Difference between revisions of "Debugging GEOS-Chem"
(→Debug options for GEOS-Chem Classic simulations) |
(→Debug options for GCHP simulations) |
||
Line 67: | Line 67: | ||
# In file <tt>GCHP/Shared/Config/ESMA_base.mk</tt> change the flag <tt>BOPT = O</tt> to <tt>BOPT = g</tt> | # In file <tt>GCHP/Shared/Config/ESMA_base.mk</tt> change the flag <tt>BOPT = O</tt> to <tt>BOPT = g</tt> | ||
− | # | + | # Cleanup the GCHP directory with <tt>make clean_all</tt>. |
− | # Submit a simulation and check the log files for error messages | + | # Recompile GCHP from scratch using the option <tt>make compile_debug</tt> |
+ | # Submit a GCHP simulation and check the log files for error messages. | ||
--[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 16:34, 21 December 2018 (UTC) | --[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 16:34, 21 December 2018 (UTC) |
Revision as of 16:37, 21 December 2018
Contents
- 1 Overview
- 2 Debugging tips
- 2.1 Check the GEOS-Chem and HEMCO log files
- 2.2 Make sure you did not max out your allotted time or memory
- 2.3 Check if someone else has already reported the bug
- 2.4 Recompile GEOS-Chem with debug options turned on
- 2.5 Identify whether the error happens consistently
- 2.6 Run GEOS-Chem in a debugger to find the source of error
- 2.7 Isolate the error to a particular operation
- 2.8 Check any code modifications that you have added
- 2.9 Check for math errors
- 2.10 When in doubt, print it out!
- 2.11 When all else fails, use the brute force method
- 3 Using profiling tools to determine the source of computational bottlenecks
- 4 Using the GEOS-Chem Unit Tester
- 5 Contacting the GEOS-Chem Support Team for assistance
Overview
If your GEOS-Chem simulation dies unexpectedly with an error or takes much longer to execute than it should, the most important thing is to try to isolate the source of the error or bottleneck right away. Below are some of debugging tips that you can use.
Also see the Common GEOS-Chem error messages wiki page for detailed information on common errors and how to resolve them.
Debugging tips
Check the GEOS-Chem and HEMCO log files
If your GEOS-Chem simulation stopped with an error, but you cannot tell where, turn on the ND70 diagnostic (debug output) in input.geos and rerun your simulation. The ND70 diagnostic will print debug output at several locations in the code (after transport, chemistry, emissions, dry deposition, etc.). This should let you pinpoint the location of the error.
If the log file indicates your run stopped in emissions, you can check the HEMCO.log file for additional information (GEOS-Chem v10-01 and later versions only). We recommend setting both the Verbose and Warnings options in HEMCO_Config.rc to 3 to print all debug statements and warning messages to your HEMCO.log file.
If your run stopped with an erorr, see the Common GEOS-Chem error messages wiki page for detailed information on common errors and how to resolve them.
Make sure you did not max out your allotted time or memory
If you are running GEOS-Chem in on a shared computer system, chances are you will have used a scheduler (such as LSF, PBS, Grid Engine, or SLURM) to submit your GEOS-Chem job to a computational queue. You should be aware of the run time and memory limits for each of the queues on your system.
If your GEOS-Chem job uses more memory or run time than the computational queue allows, your job can be cancelled by the scheduler. You will usually get an error message printed out to the stderr stream. Be sure to check all of the log files created by your GEOS-Chem jobs for such error messages.
The solution will usually be to submit your GEOS-Chem simulation to a queue with a longer run-time limit, or larger memory limit. You can also split up your GEOS-Chem simulation into several smaller stages that take less time to complete.
Check if someone else has already reported the bug
Before trying to debug your code, we recommend that you check our Bugs and fixes wiki page to see if your error is a known issue, and if someone has already submitted a fix.
Recompile GEOS-Chem with debug options turned on
Check for common problems like array-out-of-bounds errors, floating-point exceptions, and parallelization issues by turning on debug compiler switches:
Debug options for GEOS-Chem Classic simulations
Compile GEOS-Chem classic with the options listed below. Then run a GEOS-Chem simulation and check the log files for error messages.
Debugging flag | Description |
---|---|
DEBUG=yes | This option turns off all optimization. It also prepares GEOS-Chem so that it can be run in a debugger like TotalView. |
BOUNDS=yes | This option turns on runtime array-out-of-bounds checking, which looks for instances of invalid array indices (i.e. If the A array only has 10 elements but you try to reference A(11).) |
TRACEBACK=yes | This option turns on the -traceback option (ifort only) and will print a list of routines that were called when the error occurred. NOTE: This option will always be turned on by default in GEOS-Chem v11-01 and newer versions. |
FPEX=yes or FPE=yes |
This option turns on error checking for floating-point exceptions (i.e. div-by-zero, NaN, floating-invalid, and similar errors). |
OMP=no | This option turns off OpenMP parallelization (which is turned on by default). This will check for parallelization issues. |
Debug options for GCHP simulations
To debug a GCHP simulation, follow these steps:
- In file GCHP/Shared/Config/ESMA_base.mk change the flag BOPT = O to BOPT = g
- Cleanup the GCHP directory with make clean_all.
- Recompile GCHP from scratch using the option make compile_debug
- Submit a GCHP simulation and check the log files for error messages.
--Bob Yantosca (talk) 16:34, 21 December 2018 (UTC)
Identify whether the error happens consistently
If the error happens at the same model date & time, it could indicate bad input data. Check our List of reprocessed met fields to make sure there is not a known issue with the met fields or emissions for that date. This is a list of met field data files that had to be regenerated due to known issues (i.e. incomplete data or other such problems). You might be able to fix your problem by simply re-downloading the affected file or files.
If the error happened only once, it could be caused by a network problem or other such transient condition.
Run GEOS-Chem in a debugger to find the source of error
If you have access to a debugger (e.g. GDB, IDB, DBX, Totalview), you can save a lot of time and hassle by learning the basic commands such as how to:
- Examine data when a program stops
- Navigate the stack when a program stops
- Set break points
To run GEOS-Chem in a debugger, you should add the DEBUG=yes option to the make command. This will compile GEOS-Chem with the -g flag that tells the compiler to generate symbolic debug information. The DEBUG=yes option also uses the -O flag, which switches off compiler optimization that can modify the sequence in which individual instructions occur. To apply these options, type:
make -j4 DEBUG=yes OMP=no # Without parallelization make -j4 DEBUG=yes # With parallelization
Isolate the error to a particular operation
Can you tell if the error happens in transport, chemistry, emissions, dry dep, etc? Try turning off these operations one at a time in input.geos to see if you get past the error.
Also try turning on the ND70 diagnostic, which will add additional debug print statements to the output. This will help you to see the last subroutine that was exited before the error occurred.
Check any code modifications that you have added
If you have made modifications to a "fresh out-of-the-box" GEOS-Chem version, then you should look over your changes to search for the source of error.
You can also use Git to revert to the last known error-free state of GEOS-Chem, and use that as a reference.
Check for math errors
If you suspect that a floating-point math error, such as:
- Division by zero
- Logarithm of a negative number
- Numerical overflow or underflow
- Infinity
Then make clean and recompile with the FPEX=yes flag. This will turn on additional error checking that will stop your GEOS-Chem run if a floating-point error is encountered.
You can often detect numerical errors by adding debugging print statements into your source code:
- Check the minimum and maximum values of an array with the MINVAL and MAXVAL intrinsic functions:
PRINT*, '### Min, Max: ', MINVAL( ARRAY ), MAXVAL( ARRAY ) CALL FLUSH( 6 )
- Check the sum of an array with the SUM intrinsic function:
PRINT*, '### Sum of X : ', SUM( ARRAY ) CALL FLUSH( 6 )
See our Floating point math issues wiki page for information on how to avoid some common pitfalls.
When in doubt, print it out!
Print out the values of variables in the area where you suspect the error lies. You can also add call flush(6) to flush the output buffer after writing. Maybe you will see something wrong in the output.
When all else fails, use the brute force method
If the bug is difficult to locate, then comment out a large section of code and run GEOS-Chem. If the error does not occur, then uncomment some more code and run GEOS-Chem again. Repeat the process until you find the location of the error. The brute force method may be tedious, but it will usually lead you to the source of the problem.
Using profiling tools to determine the source of computational bottlenecks
If you think your GEOS-Chem simulation is taking too long to run, consider using profiling tools to generate a list of the time that is spent in each routine. This can help you identify badly written or parallelized code that is causing GEOS-Chem to slow down. For more information, please see our Profiling GEOS-Chem wiki page.
Using the GEOS-Chem Unit Tester
The GEOS-Chem Unit Tester is an external package that can run several test GEOS-Chem simulations with a set of very strict debugging options. The debugging options are designed to detect issues such as floating-point math errors, array-out-of-bounds errors, inefficient subroutine calls, and parallelization errors. You can use this tool to find many common numerical errors and programming issues in your GEOS-Chem code.
For complete instructions on how the GEOS-Chem Unit Tester can assist your debugging efforts, please see our Debugging with the GEOS-Chem unit tester wiki page.
Contacting the GEOS-Chem Support Team for assistance
If you have tried to solve your code problem but cannot, then please report it to the GEOS-Chem Support Team. We will be happy to assist you. In order for us to diagnose your issue, we request that you send us the following items:
A description of the issue
Please include a brief description of the issue and any error messages that you encountered. Also include the steps (e.g. compile and run commands) that you used.
Of particular importance: The GEOS-Chem Support Team needs to know if you are using an unmodified (aka "out-of-the-box") version of GEOS-Chem, or if you have made any modifications to the code or input data. If you have modified GEOS-Chem, then please take a moment to doublecheck your modified code or data to make sure that it is not the source of the error.
The lastbuild file
The GEOS-Chem "Classic" compilation sequence generates a file that contains all of the options that were used to build the executable file. This file is named either lastbuild.mp (if OpenMP parallelization is turned on) or lastbuild.sp (if OpenMp parallelization is turned off).
Here is a sample lastbuild.mp file from a recent 1-month benchmark simulation:
LAST BUILD INFORMATION: CODE_DIR : /n/home05/msulprizio/GC/Code.Dev CODE_BRANCH : dev/12.1.0 LAST_COMMIT : Revert to reading I3 met fields each hour instead of once per day COMMIT_DATE : Wed Nov 21 09:07:51 2018 -0500 VERSION : 12.1.0 VERSION_TAG : GC_12.1.0 MET : geosfp GRID : 4x5 SIM : benchmark NEST : n TRACEBACK : y BOUNDS : n DEBUG : n FPE : n NO_ISO : n NO_REDUCED : y CHEM : Standard TOMAS12 : n TOMAS15 : n TOMAS30 : n TOMAS40 : n RRTMG : n MASSCONS : n TIMERS : 1 TAU_PROF : n BPCH_DIAG : y NC_DIAG : y COMPILER : ifort 17.0.4 Datetime : 2018/11/21 10:33
At a very minimum, the GCST needs to know the following items in order to be able to diagnose your issue:
Item | Description |
---|---|
LAST_COMMIT | The location (aka "commit") of your source code in the Git version history. (You can use the gitk browser to view the Git version history.) |
VERSION | The GEOS-Chem version number. |
MET | The meteorology that is driving GEOS-Chem (e.g. geosfp, merra2). |
GRID | The horizontal grid (e.g. 4° x 5° , 2° x 2.5°, 0.25° x 0.3125°, 0.5° x 0.625°) that your simulation uses. |
SIM | The name of your GEOS-Chem simulation (e.g. Standard, Benchmark, CH4, Rn-Pb-Be, Hg, etc.). |
NEST | The nested region that your simulation uses (e.g. as, af, ch, na, eu, etc.). This only applies to nested-grid simulations. |
COMPILER | The compiler version that you used to build the GEOS-Chem executable file. |
The GEOS-Chem log file
The GEOS-Chem Support Team will also need to look at the complete log file from your GEOS-Chem simulation. In many cases, GEOS-Chem users send us screenshots of the error message, but in practice this does not provide enough information to solve the problem.
The GEOS-Chem log file contains an "echo-back" of all input options, as well as a list of files that are being read in, and operations that are being done at each timestep. Here is an example log file from a recent 1-month benchmark simulation:
************* S T A R T I N G 4 x 5 G E O S - C H E M ************* ===> Mode of operation : GEOS-Chem "Classic" ===> GEOS-Chem version : 12.1.0 ===> Compiler : Intel Fortran Compiler (aka ifort) ===> Driven by meteorology : GMAO GEOS-FP (on native 72-layer vertical grid) ===> ISORROPIA ATE package : ON ===> Parallelization w/ OpenMP : ON ===> Binary punch diagnostics : ON ===> netCDF diagnostics : ON ===> netCDF file compression : SUPPORTED ===> SIMULATION START TIME: 2018/11/21 10:38 <=== =============================================================================== G E O S - C H E M U S E R I N P U T READ_INPUT_FILE: Reading input.geos SIMULATION MENU --------------- Start time of run : 20160701 000000 End time of run : 20160801 000000 Run directory : ./ Data Directory : /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/ CHEM_INPUTS directory : /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/CHEM_INPUTS/ Resolution-specific dir : GEOS_4x5/ Is this a nested-grid sim? : F Global offsets I0, J0 : 0 0 TIMESTEP MENU --------------- Transport/Convection [sec] : 600 Chemistry/Emissions [sec] : 1200 ... etc ...
If your simulation stopped with an error, a detailed error message should also appear in the GEOS-Chem log file.
The HEMCO log file
The HEMCO emissions component also produces a log file. The log file is named HEMCO.log. The HEMCO log file contains information about how emissions, met fields, and other relevant data are read from disk and processed for input into GEOS-Chem.
(NOTE: When your simulation finishes successfully, the HEMCO.log file is automatically renamed to either HEMCO.log.mp (if OpenMP parallelization is turned on) or HEMCO.log.sp (if OpenMP parallelization is turned off). But if your run dies with an error, the HEMCO log file will not have been renamed.)
Here is a sample HEMCO.log file from a recent 1-month benchmark simulation (edited for brevity):
------------------------------------------------------------------------------- Using HEMCO v2.1.010 ------------------------------------------------------------------------------- Registering HEMCO species: Species NO Species O3 Species PAN Species CO ... etc ... ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Use air-sea flux emissions (extension module) - Use species: - DMS 25 - ACET 9 - ALD2 11 ------------------------------------------------------------------------------- Use ParaNOx ship emissions (extension module) - Use the following species: (MW, emitted as HEMCO ID) NO : 30.00 1 NO2 : 46.00 64 O3 : 48.00 2 HNO3: 63.00 7 READ_LUT_NCFILE: Reading /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/HEMCO/PARANOX/v2015-02/ship_plume_lut_02ms.nc READ_LUT_NCFILE: Reading /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/HEMCO/PARANOX/v2015-02/ship_plume_lut_06ms.nc READ_LUT_NCFILE: Reading /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/HEMCO/PARANOX/v2015-02/ship_plume_lut_10ms.nc READ_LUT_NCFILE: Reading /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/HEMCO/PARANOX/v2015-02/ship_plume_lut_14ms.nc READ_LUT_NCFILE: Reading /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/HEMCO/PARANOX/v2015-02/ship_plume_lut_18ms.nc ------------------------------------------------------------------------------- Use lightning NOx emissions (extension module) - Use species NO-> 1 - Use OTD-LIS factors from file? T - Use GEOS-5 flash rates: F - Use scalar scale factor: 1.000000 - Use gridded scale field: none - INIT_LIGHTNOX: Reading /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid/data/ExtData/HEMCO/LIGHTNOX/v2014-07/light_dist.ott2010.dat ------------------------------------------------------------------------------- Use soil NOx emissions (extension module) - NOx species : NO 1 - NOx scale factor : 1.000000 - NOx scale field : none - Use fertilizer NOx : T - Fertilizer scale factor: 6.800000090152025E-003 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Use DEAD dust emissions (extension module) Use the following species (Name: HcoID): DST1: 38 DST2: 39 DST3: 40 DST4: 41 Global mass flux tuning factor: 8.328599999999999E-004 ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Use sea salt aerosol emissions (extension module) Accumulation aerosol: SALA: 42 - size range : 1.000000000000000E-002 0.500000000000000 Coarse aerosol : SALC: 43 - size range : 1.000000000000000E-002 0.500000000000000 - wind scale factor: 1.00000000000000 ------------------------------------------------------------------------------- Use MEGAN biogenic emissions (extension module) ------------------------------------------------------------------------------- - This is instance 1 - Use the following species: Isoprene = ISOP 6 Acetone = ACET 9 C3 Alkenes = PRPE 18 Ethene = C2H4 -1 ALD2 = ALD2 11 EOH = EOH 96 SOA-Precursor = SOAP 94 SOA-Simple = SOAS 95 --> Isoprene scale factor is 1.00000000000000 --> Use CO2 inhibition on isoprene option T --> Global atmospheric CO2 concentration : 390.000000000000 ppmv --> Normalize LAI by PFT: T - MEGAN monoterpene option enabled CO = CO 4 a-,b-pinene = MTPA 70 Other monoterp.= MTPO 72 Limonene = LIMO 71 Sesquiterpenes = SESQ 184 ------------------------------------------------------------------------------- Use GFED extension - Use GFED-4 : T - Use daily scale factors : F - Use hourly scale factors: F - Hydrophilic OC fraction : 0.5000000 - Hydrophilic BC fraction : 0.2000000 - POG1 fraction : 0.4900000 - SOAP fraction : 1.3000000E-02 - Emit GFED species NO as model species NO --> Will use scale factor: 1.000000 --> Will use scale field : none - Emit GFED species CO as model species CO --> Will use scale factor: 1.050000 --> Will use scale field : none ... etc... ------------------------------------------------------------------------------- HEMCO v2.1.010 FINISHED. Warnings (level 1 or lower): 25386 -------------------------------------------------------------------------------
--Bob Yantosca (talk) 16:16, 29 November 2018 (UTC)
The job output file from the batch scheduler
slurm-60034556.out