Difference between revisions of "Running GCHP: Basics"

From Geos-chem
Jump to: navigation, search
Line 15: Line 15:
  
 
This page presents the basic information needed to run GCHP as well as how to verify a successful run and reuse a run directory. The GCHP "standard" simulation run directory is configured for a 1-hr simulation at c24 resolution. This simple configuration is a good first test case to check that GCHP runs on your system.
 
This page presents the basic information needed to run GCHP as well as how to verify a successful run and reuse a run directory. The GCHP "standard" simulation run directory is configured for a 1-hr simulation at c24 resolution. This simple configuration is a good first test case to check that GCHP runs on your system.
 
== Pre-run Checklist ==
 
 
Prior to running GCHP, always run through the following checklist to ensure everything is set up properly.
 
#Your run directory contains the executable <tt>gchp</tt>.
 
#All symbolic links in your run directory are valid (no broken links)
 
#You have looked through and set all configurable settings in <tt>runConfig.sh</tt> (discussed in the next chapter)
 
#If running via a job scheduler: you have a run script and the resource allocation in <tt>runConfig.sh</tt> and your run script are consistent (# nodes and cores)
 
#If running interactively: the resource allocation in <tt>runConfig.sh</tt> is available locally
 
#If reusing a run directory (optional but recommended): you have archived your last run with <tt>./archiveRun.sh</tt> if you want to keep it and you have deleted old output files with <tt>./cleanRunDir.sh</tt>
 
  
 
== How to Run GCHP ==
 
== How to Run GCHP ==
Line 154: Line 144:
 
====gcchem_internal_checkpoint must be deleted or renamed====
 
====gcchem_internal_checkpoint must be deleted or renamed====
 
The GCHP output restart filename is configured in <code>GCHP.rc</code>. If a file with that name exists at the start of a run GCHP will fail at the end when it tries to overwrite it. This is a quirk with the new version of MAPL introduced in GCHP 12.5.0. To get around this, all sample run scripts located in the run directory <code>runScriptSamples</code> directory rename the output checkpoint file to a file containing 'restart' and timestamp. However, if your run fails with early exit you may have the original restart file present since it is created at the start of the run, remaining empty until successful end. Using <code>make cleanup_output</code> to clean up your run directory prior to rerunning will prevent this issue since it include deletion of all files starting with "gcchem".
 
The GCHP output restart filename is configured in <code>GCHP.rc</code>. If a file with that name exists at the start of a run GCHP will fail at the end when it tries to overwrite it. This is a quirk with the new version of MAPL introduced in GCHP 12.5.0. To get around this, all sample run scripts located in the run directory <code>runScriptSamples</code> directory rename the output checkpoint file to a file containing 'restart' and timestamp. However, if your run fails with early exit you may have the original restart file present since it is created at the start of the run, remaining empty until successful end. Using <code>make cleanup_output</code> to clean up your run directory prior to rerunning will prevent this issue since it include deletion of all files starting with "gcchem".
 +
 +
== Pre-run Checklist ==
 +
 +
Prior to running GCHP, always run through the following checklist to ensure everything is set up properly.
 +
#Your run directory contains the executable <tt>gchp</tt>.
 +
#All symbolic links in your run directory are valid (no broken links)
 +
#You have looked through and set all configurable settings in <tt>runConfig.sh</tt> (discussed in the next chapter)
 +
#If running via a job scheduler: you have a run script and the resource allocation in <tt>runConfig.sh</tt> and your run script are consistent (# nodes and cores)
 +
#If running interactively: the resource allocation in <tt>runConfig.sh</tt> is available locally
 +
#If reusing a run directory (optional but recommended): you have archived your last run with <tt>./archiveRun.sh</tt> if you want to keep it and you have deleted old output files with <tt>./cleanRunDir.sh</tt>
  
 
--------------------------------------
 
--------------------------------------
 
'''''[[Compiling_GCHP|Previous]] | [[Running_GCHP:_Configuration|Next]] | [[Getting Started with GCHP]] | [[GCHP Main Page]]'''''
 
'''''[[Compiling_GCHP|Previous]] | [[Running_GCHP:_Configuration|Next]] | [[Getting Started with GCHP]] | [[GCHP Main Page]]'''''

Revision as of 20:49, 9 November 2020

Previous | Next | Getting Started with GCHP | GCHP Main Page

  1. Hardware and Software Requirements
  2. Downloading Source Code and Data Directories
  3. Obtaining a Run Directory
  4. Setting Up the GCHP Environment
  5. Compiling
  6. Running GCHP: Basics
  7. Running GCHP: Configuration
  8. Output Data
  9. Developing GCHP
  10. Run Configuration Files


Overview

This page presents the basic information needed to run GCHP as well as how to verify a successful run and reuse a run directory. The GCHP "standard" simulation run directory is configured for a 1-hr simulation at c24 resolution. This simple configuration is a good first test case to check that GCHP runs on your system.

How to Run GCHP

You can run GCHP locally from within your run directory ("interactively") or by submitting your run to a job scheduler if one is available. Either way, it is useful to put run commands into a reusable script we call the run script. Executing the script will either run GCHP or submit a job that will run GCHP.

There is a symbolic link in the GCHP run directory called runScriptSamples that points to a directory in the source code containing example scripts to run GCHP. Each file includes extra commands that make the run process easier and less prone to user error. These commands include:

  1. source environment file target of symbolic link gchp.env
  2. source config file runConfig.sh to set run-time configuration
  3. send standard output to run-time log file gchp.log

Running Interactively

Copy or adapt example run script gchp.local.run to run GCHP locally on your machine. Before running, open your run script and set nCores to the number of processors you plan to use. Make sure you have this number of processors available locally. It must be at least 6. Next, open file runConfig.sh and set NUM_CORES, NUM_NODES, and NUM_CORES_PER_NODE to be consistent with your run script.

To run, type the following at the command prompt:

./gchp.local.run

Standard output will be displayed on your screen in addition to being sent to log file gchp.log.

Running as a Batch Job

Batch job run scripts will vary based on what job scheduler you have available. Most of the example run scripts are for use with SLURM, and the most basic example of these is gchp.run. You may copy any of the example run scripts to your run directory and adapt for your system and preferences as needed.

At the top of all batch job scripts are configurable run settings. Most critically are requested # cores, # nodes, time, and memory. Figuring out the optimal values for your run can take some trial and error. For a basic six core standard simulation job on one node you should request at least ___ min and __ Gb. The more cores you request the faster GCHP will run.

To submit a batch job using SLURM:

 sbatch gchp.run

To submit a batch job using Grid Engine:

 qsub gchp.run

If your computational cluster uses a different job scheduler, e.g. LSF or PBS, check with your IT staff or search the internet for how to configure and submit batch jobs.

Standard output will be sent to log file gchp.log once the job is started unless you change that feature of the run script. Standard error will be sent to a file specific to your scheduler, e.g. slurm-jobid.out if using SLURM, unless you configure your run script to do otherwise.

For each job scheduler, batch job configurable settings and acceptable formats are available on the internet and are often accessible from the command line. For example, type man sbatch to scroll through options for SLURM, including various ways of specifying number of cores and time and memory requested.

Verifying a Successful Run

There are several ways to verify that your run was successful.

  1. NetCDF files are present in the OutputDir subdirectory.
  2. gchp.log ends with timing information for the run.
  3. Your scheduler log (e.g. output from SLURM) does not contain any obvious errors.
  4. gchp.log contains text with format "AGCM Date: YYYY/MM/DD Time: HH:mm:ss" for each timestep you ran at.

If it looks like something went wrong, scan through gchp.log (sometimes the error is near the top) as well as your scheduler output file (if one exists) to determine where there may have been an error. Beware that if you have a problem in one of your configuration files then you will likely see a MAPL error with traceback to the GCHP/Shared directory. Review all of your configuration files to ensure you have proper setup. Errors in "CAP" typically indicate an error with your start time, end time, and/or duration set in runConfig.sh (more on this file in the next chapter). Errors in "ExtData" often indicate an error with your input files specified in either HEMCO_Config.rc or ExtData.rc. Errors in "HISTORY" are related to your configured output in HISTORY.rc

GCHP errors can be cryptic. If you find yourself debugging within MAPL then you may be on the wrong track as most issues can be resolved by updating the run settings. If you cannot figure out where you are going wrong please create an issue on the GCHP GitHub issue tracker located at https://github.com/geoschem/gchp/issues.

Reusing a Run Directory

Archiving a Run

One of the benefits of GCHP relative to GEOS-Chem Classic is that you can reuse a run directory for different grid resolutions without recompiling. You can also copy your executable between different simulation run directories as long as you are using the same code. However, reusing a run directory comes with the perils of losing your old work. To mitigate this issue there is utility shell script archiveRun.sh to archive data output and configuration files to a subdirectory within your run directory. All you need to do is pass a non-existent subdirectory name of your choosing. Here is an example:

./archiveRun.sh c24_3hr

The following output is then printed to screen to show you exactly what is being archived and where:

Archiving files to directory c24_3hr
Moving files and directories...
  Warning: No files to move from Plots
  -> c24_3hr/diagnostics/GCHP.SpeciesConc.20160101_0030z.nc4
  -> c24_3hr/diagnostics/GCHP.SpeciesConc.20160101_0130z.nc4
  -> c24_3hr/diagnostics/GCHP.SpeciesConc.20160101_0230z.nc4
Copying files...
  -> c24_3hr/config/input.geos
  -> c24_3hr/config/CAP.rc
  -> c24_3hr/config/ExtData.rc
  -> c24_3hr/config/fvcore_layout.rc
  -> c24_3hr/config/GCHP.rc
  -> c24_3hr/config/HEMCO_Config.rc
  -> c24_3hr/config/HEMCO_Diagn.rc
  -> c24_3hr/config/HISTORY.rc
  -> c24_3hr/config/runConfig.sh
  -> c24_3hr/config/gchp.local.run
  -> c24_3hr/config/gchp.run
  -> c24_3hr/config/gchp.env
  -> c24_3hr/logs/compile.log
  -> c24_3hr/logs/gchp.log
  -> c24_3hr/logs/HEMCO.log
  -> c24_3hr/logs/mem_transportTracers_1mo.log
  -> c24_3hr/logs/PET00000.GEOSCHEMchem.log
  Warning: slurm-* not found
  -> c24_3hr/checkpoints/gcchem_internal_checkpoint.20160101_0000z.nc4
  -> c24_3hr/checkpoints/gcchem_internal_checkpoint.restart.20160101_030000.nc4
  -> c24_3hr/checkpoints/cap_restart
  -> c24_3hr/restart/initial_GEOSChem_rst.c24_TransportTracers.nc
Complete!

All files except output diagnostics data are copied so that you can still see them after archiving. This includes restart files which remain in your run directory until you delete them. However, the diagnostic data are moved rather than copied, leaving your OutputDir directory empty. In this particular example I ran interactively so no SLURM file was found, and I archived a single segment run (single job) which is why there is a warning about a multirun file being missing. This can be ignored. If you do a multi-run, which involves running multiple consecutive jobs, archiving will move data and copy other files from all runs to your archive directory. And if you do a run as a batch job using SLURM, the SLURM files will be sent to the logs archive directory.

Since the archiveRun.sh is a simple bash script you may edit it to do customized archiving based on your own preferences.

Cleaning the Run Directory

If you have archived your last run, or simply do not want to keep it, you should then clean your run directory prior to your next run by doing "make cleanup_output". Here is an example of output printed when cleaning the run directory:

rm -f /n/home/gchp_RnPbBe/OutputDir/*.nc4
rm -f trac_avg.*
rm -f tracerinfo.dat
rm -f diaginfo.dat
rm -f cap_restart
rm -f gcchem*
rm -f *.rcx
rm -f *~
rm -f gchp.log
rm -f HEMCO.log
rm -f PET*.log
rm -f multirun.log
rm -f logfile.000000.out
rm -f slurm-*
rm -f 1
rm -f EGRESS

Rerunning Without Cleaning

You can reuse a run directory without cleaning it and without archiving your last run. Files will generally simply be replaced by files generated in the next run. This will work okay with twos exceptions.

cap_restart must be deleted

The output cap_restart file must be removed prior to subsequent runs if you are starting a run from scratch. The cap_restart file contains a date and time string for the end of your last run. GCHP will attempt to start your next run at this date and time if the file is present. This is useful for splitting up a run into multiple jobs. Unless you are doing this you should always delete cap_restart before a new run. This is included in all sample run scripts except the multi-run run script which has special handling of cap_restart to pick up where the last run left off. See the next chapter for more information on the multi-run option.

gcchem_internal_checkpoint must be deleted or renamed

The GCHP output restart filename is configured in GCHP.rc. If a file with that name exists at the start of a run GCHP will fail at the end when it tries to overwrite it. This is a quirk with the new version of MAPL introduced in GCHP 12.5.0. To get around this, all sample run scripts located in the run directory runScriptSamples directory rename the output checkpoint file to a file containing 'restart' and timestamp. However, if your run fails with early exit you may have the original restart file present since it is created at the start of the run, remaining empty until successful end. Using make cleanup_output to clean up your run directory prior to rerunning will prevent this issue since it include deletion of all files starting with "gcchem".

Pre-run Checklist

Prior to running GCHP, always run through the following checklist to ensure everything is set up properly.

  1. Your run directory contains the executable gchp.
  2. All symbolic links in your run directory are valid (no broken links)
  3. You have looked through and set all configurable settings in runConfig.sh (discussed in the next chapter)
  4. If running via a job scheduler: you have a run script and the resource allocation in runConfig.sh and your run script are consistent (# nodes and cores)
  5. If running interactively: the resource allocation in runConfig.sh is available locally
  6. If reusing a run directory (optional but recommended): you have archived your last run with ./archiveRun.sh if you want to keep it and you have deleted old output files with ./cleanRunDir.sh

Previous | Next | Getting Started with GCHP | GCHP Main Page