Running GCHP: Basics
- Hardware and Software Requirements
- Downloading Source Code and Data Directories
- Obtaining a Run Directory
- Setting Up the GCHP Environment
- Running GCHP: Basics
- Running GCHP: Configuration
- Output Data
- Developing GCHP
- Run Configuration Files
- 1 Overview
- 2 How to Run GCHP
- 3 Verifying a Successful Run
- 4 Reusing a Run Directory
- 5 Pre-run Checklist
This page presents the basic information needed to run GCHP as well as how to verify a successful run and reuse a run directory. A pre-run checklist is included at the end to help prevent run errors. The GCHP "standard" simulation run directory is configured for a 1-hr simulation at c24 resolution and is a good first test case to check that GCHP runs on your system.
How to Run GCHP
You can run GCHP locally from within your run directory ("interactively") or by submitting your run to a job scheduler if one is available. Either way, it is useful to put run commands into a reusable script we call the run script. Executing the script will either run GCHP or submit a job that will run GCHP.
There is a symbolic link in the GCHP run directory called runScriptSamples that points to a directory in the source code containing example run scripts. Each file includes extra commands that make the run process easier and less prone to user error. These commands include:
- Source environment file symbolic link gchp.env
- Source config file runConfig.sh to set run-time configuration
- Delete any previous run output files that might interfere with the new run if present
- Send standard output to run-time log file gchp.log
Copy or adapt example run script gchp.local.run to run GCHP locally on your machine. Before running, open your run script and set nCores to the number of processors you plan to use. Make sure you have this number of processors available locally. It must be at least 6. Next, open file runConfig.sh and set NUM_CORES, NUM_NODES, and NUM_CORES_PER_NODE to be consistent with your run script.
To run, type the following at the command prompt:
Standard output will be displayed on your screen in addition to being sent to log file gchp.log.
Running as a Batch Job
Batch job run scripts will vary based on what job scheduler you have available. Most of the example run scripts are for use with SLURM, and the most basic example of these is gchp.run. You may copy any of the example run scripts to your run directory and adapt for your system and preferences as needed.
At the top of all batch job scripts are configurable run settings. Most critically are requested # cores, # nodes, time, and memory. Figuring out the optimal values for your run can take some trial and error. For a basic six core standard simulation job on one node you should request at least ___ min and __ Gb. The more cores you request the faster GCHP will run.
To submit a batch job using SLURM:
To submit a batch job using Grid Engine:
Standard output will be sent to log file gchp.log once the job is started unless you change that feature of the run script. Standard error will be sent to a file specific to your scheduler, e.g. slurm-jobid.out if using SLURM, unless you configure your run script to do otherwise.
If your computational cluster uses a different job scheduler, e.g. Grid Engine, LSF, or PBS, check with your IT staff or search the internet for how to configure and submit batch jobs. For each job scheduler, batch job configurable settings and acceptable formats are available on the internet and are often accessible from the command line. For example, type man sbatch to scroll through options for SLURM, including various ways of specifying number of cores, time and memory requested.
Verifying a Successful Run
There are several ways to verify that your run was successful.
- NetCDF files are present in the OutputDir subdirectory
- Standard output file gchp.log ends with Model Throughput timing information
- The job scheduler log does not contain any error messages
If it looks like something went wrong, scan through the log files to determine where there may have been an error. Here are a few debugging tips:
- Review all of your configuration files to ensure you have proper setup
- MAPL_Cap errors typically indicate an error with your start time, end time, and/or duration set in runConfig.sh
- MAPL_ExtData errors often indicate an error with your input files specified in either HEMCO_Config.rc or ExtData.rc
- MAPL_HistoryGridComp errors are related to your configured output in HISTORY.rc
If you cannot figure out where the problem is please do not hesitate to create a GCHPctm GitHub issue.
Reusing a Run Directory
Archiving a Run
One of the benefits of GCHP relative to GEOS-Chem Classic is that you can reuse a run directory for different grid resolutions without recompiling. You can also copy your executable between different simulation run directories as long as you are using the same code. However, reusing a run directory comes with the perils of losing your old work. To mitigate this issue there is utility shell script archiveRun.sh to archive data output and configuration files to a subdirectory within your run directory. All you need to do is pass a non-existent subdirectory name of your choosing. Here is an example:
The following output is then printed to screen to show you exactly what is being archived and where:
Archiving files to directory c24_3hr Moving files and directories... Warning: No files to move from Plots -> c24_3hr/diagnostics/GCHP.SpeciesConc.20160101_0030z.nc4 -> c24_3hr/diagnostics/GCHP.SpeciesConc.20160101_0130z.nc4 -> c24_3hr/diagnostics/GCHP.SpeciesConc.20160101_0230z.nc4 Copying files... -> c24_3hr/config/input.geos -> c24_3hr/config/CAP.rc -> c24_3hr/config/ExtData.rc -> c24_3hr/config/fvcore_layout.rc -> c24_3hr/config/GCHP.rc -> c24_3hr/config/HEMCO_Config.rc -> c24_3hr/config/HEMCO_Diagn.rc -> c24_3hr/config/HISTORY.rc -> c24_3hr/config/runConfig.sh -> c24_3hr/config/gchp.local.run -> c24_3hr/config/gchp.run -> c24_3hr/config/gchp.env -> c24_3hr/logs/compile.log -> c24_3hr/logs/gchp.log -> c24_3hr/logs/HEMCO.log -> c24_3hr/logs/mem_transportTracers_1mo.log -> c24_3hr/logs/PET00000.GEOSCHEMchem.log Warning: slurm-* not found -> c24_3hr/checkpoints/gcchem_internal_checkpoint.20160101_0000z.nc4 -> c24_3hr/checkpoints/gcchem_internal_checkpoint.restart.20160101_030000.nc4 -> c24_3hr/checkpoints/cap_restart -> c24_3hr/restart/initial_GEOSChem_rst.c24_TransportTracers.nc Complete!
All files except output diagnostics data are copied so that you can still see them after archiving. This includes restart files which remain in your run directory until you delete them. However, the diagnostic data are moved rather than copied, leaving your OutputDir directory empty. In this particular example I ran interactively so no SLURM file was found, and I archived a single segment run (single job) which is why there is a warning about a multirun file being missing. This can be ignored. If you do a multi-run, which involves running multiple consecutive jobs, archiving will move data and copy other files from all runs to your archive directory. And if you do a run as a batch job using SLURM, the SLURM files will be sent to the logs archive directory.
Since the archiveRun.sh is a simple bash script you may edit it to do customized archiving based on your own preferences.
Cleaning the Run Directory
If you have archived your last run, or simply do not want to keep it, you should then clean your run directory prior to your next run by doing "make cleanup_output". Here is an example of output printed when cleaning the run directory:
rm -f /n/home/gchp_RnPbBe/OutputDir/*.nc4 rm -f trac_avg.* rm -f tracerinfo.dat rm -f diaginfo.dat rm -f cap_restart rm -f gcchem* rm -f *.rcx rm -f *~ rm -f gchp.log rm -f HEMCO.log rm -f PET*.log rm -f multirun.log rm -f logfile.000000.out rm -f slurm-* rm -f 1 rm -f EGRESS
Rerunning Without Cleaning
You can reuse a run directory without cleaning it and without archiving your last run. Files will generally simply be replaced by files generated in the next run. This will work okay with twos exceptions.
cap_restart must be deleted
The output cap_restart file must be removed prior to subsequent runs if you are starting a run from scratch. The cap_restart file contains a date and time string for the end of your last run. GCHP will attempt to start your next run at this date and time if the file is present. This is useful for splitting up a run into multiple jobs. Unless you are doing this you should always delete cap_restart before a new run. This is included in all sample run scripts except the multi-run run script which has special handling of cap_restart to pick up where the last run left off. See the next chapter for more information on the multi-run option.
gcchem_internal_checkpoint must be deleted or renamed
The GCHP output restart filename is configured in
GCHP.rc. If a file with that name exists at the start of a run GCHP will fail at the end when it tries to overwrite it. This is a quirk with the new version of MAPL introduced in GCHP 12.5.0. To get around this, all sample run scripts located in the run directory
runScriptSamples directory rename the output checkpoint file to a file containing 'restart' and timestamp. However, if your run fails with early exit you may have the original restart file present since it is created at the start of the run, remaining empty until successful end. Using
make cleanup_output to clean up your run directory prior to rerunning will prevent this issue since it include deletion of all files starting with "gcchem".
Prior to running GCHP, always run through the following checklist to ensure everything is set up properly.
- Your run directory contains the executable gchp.
- All symbolic links in your run directory are valid (no broken links)
- You have looked through and set all configurable settings in runConfig.sh (discussed in the next chapter)
- If running via a job scheduler: you have a run script and the resource allocation in runConfig.sh and your run script are consistent (# nodes and cores)
- If running interactively: the resource allocation in runConfig.sh is available locally
- If reusing a run directory (optional but recommended): you have archived your last run with ./archiveRun.sh if you want to keep it and you have deleted old output files with ./cleanRunDir.sh