|
|
(7 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
− | On this page we provide information that you can use to debug errors that may arise in your GEOS-Chem simulations.
| + | This content has been migrated to the [https://geos-chem.readthedocs.io/en/stable/geos-chem-shared-docs/supplemental-guides/debug-guide.html '''''Debug GEOS-Chem and HEMCO errors'''''] supplemental guide at [https://geos-chem.readthedocs.io geos-chem.readthedocs.io]. |
− | | + | |
− | == Overview ==
| + | |
− | | + | |
− | If your GEOS-Chem simulation dies unexpectedly with an error or takes much longer to execute than it should, the most important thing is to try to isolate the source of the error or bottleneck right away. Below are some of debugging tips that you can use.
| + | |
− | | + | |
− | Also see the [[Common_GEOS-Chem_error_messages|''Common GEOS-Chem error messages'' wiki page]] for detailed information on common errors and how to resolve them.
| + | |
− | | + | |
− | == Debugging tips ==
| + | |
− | | + | |
− | === Check to see if your error is listed on the wiki ===
| + | |
− | | + | |
− | The [[GEOS-Chem Support Team]] has compiled a list of common GEOS-Chem errors (and their solutions) at our ''[[Guide to GEOS-Chem error messages]]''.
| + | |
− | | + | |
− | The GCST also keeps a list of [[Bugs and fixes|GEOS-Chem bugs/technical issues and the versions in which they were fixed]]. In some cases, the fastest solution to your problem might be to upgrade to a newer version of GEOS-Chem in which your issue has been fixed.
| + | |
− | | + | |
− | === Look at the traceback to find where the error happened ===
| + | |
− | | + | |
− | When GEOS-Chem dies with an error, it will print out an error stack, also known as a '''traceback'''. This is a list of routines that were called when the error occurred and the line at which the error occurred.
| + | |
− | | + | |
− | An error stack is included below:
| + | |
− | | + | |
− | forrtl: severe (174): SIGSEGV, segmentation fault occurred
| + | |
− | Image PC Routine Line Source
| + | |
− | libintlc.so.5 00002ACA46B91961 Unknown Unknown Unknown
| + | |
− | libintlc.so.5 00002ACA46B900B7 Unknown Unknown Unknown
| + | |
− | libnetcdff.so.5 00002ACA4473D682 Unknown Unknown Unknown
| + | |
− | libnetcdff.so.5 00002ACA4473D4D6 Unknown Unknown Unknown
| + | |
− | libnetcdff.so.5 00002ACA4471DD4C Unknown Unknown Unknown
| + | |
− | libnetcdff.so.5 00002ACA44721DB8 Unknown Unknown Unknown
| + | |
− | libpthread.so.0 00000031A0A0F710 Unknown Unknown Unknown
| + | |
− | <span style="color:red">'''geos.mp 000000000175FF79 hco_interface_mod 341 hco_interface_mod.F90'''</span>
| + | |
− | geos.mp 00000000005F1F47 carbon_mod_mp_emi 5490 carbon_mod.F
| + | |
− | geos.mp 00000000016EAF33 emissions_mod_mp_ 206 emissions_mod.F90
| + | |
− | geos.mp 00000000010BB119 MAIN__ 1383 main.F
| + | |
− | geos.mp 000000000040370E Unknown Unknown Unknown
| + | |
− | libc.so.6 00000031A061ED5D Unknown Unknown Unknown
| + | |
− | geos.mp 0000000000403619 Unknown Unknown Unknown
| + | |
− | | + | |
− | The top line with a valid routine and line number printed is the location of the error. In this case, there is an issue in <tt>hco_interface_mod.F</tt> at line 341. You may also choose to step back through the routines to determine what went wrong. Again, in this case, the problematic routine in <tt>hco_interface_mod.F90</tt> was called from <tt>carbon_mod.F</tt> (line 5490), etc. It may be useful to [[#Recompile_GEOS-Chem_with_debug_options_turned_on|recompile and rerun GEOS-Chem with [additional debug options turned on]] (e.g. <tt>BOUNDS=yes</tt>, <tt>FPE-yes</tt>) to determine the cause of the error.
| + | |
− | | + | |
− | === Check the GEOS-Chem and HEMCO log files ===
| + | |
− | | + | |
− | If your GEOS-Chem simulation stopped with an error, but you cannot tell where, turn on the ND70 diagnostic (debug output) in <tt>input.geos</tt> and rerun your simulation. The ND70 diagnostic will print debug output at several locations in the code (after transport, chemistry, emissions, dry deposition, etc.). This should let you pinpoint the location of the error.
| + | |
− | | + | |
− | If the log file indicates your run stopped in emissions, you can check the <tt>HEMCO.log</tt> file for additional information ([[GEOS-Chem v10-01]] and later versions only). We recommend setting both the <tt>Verbose</tt> and <tt>Warnings</tt> options in <tt>HEMCO_Config.rc</tt> to 3 to print all debug statements and warning messages to your <tt>HEMCO.log</tt> file.
| + | |
− | | + | |
− | If your run stopped with an erorr, see the [[Common_GEOS-Chem_error_messages|''Common GEOS-Chem error messages'' wiki page]] for detailed information on common errors and how to resolve them.
| + | |
− | | + | |
− | === Make sure you did not max out your allotted time or memory ===
| + | |
− | | + | |
− | If you are running GEOS-Chem in on a shared computer system, chances are you will have used a scheduler (such as LSF, PBS, Grid Engine, or SLURM) to submit your GEOS-Chem job to a computational queue. You should be aware of the run time and memory limits for each of the queues on your system.
| + | |
− | | + | |
− | If your GEOS-Chem job uses more memory or run time than the computational queue allows, your job can be cancelled by the scheduler. You will usually get an error message printed out to the stderr stream. Be sure to check all of the log files created by your GEOS-Chem jobs for such error messages.
| + | |
− | | + | |
− | The solution will usually be to submit your GEOS-Chem simulation to a queue with a longer run-time limit, or larger memory limit. You can also split up your GEOS-Chem simulation into several smaller stages that take less time to complete.
| + | |
− | | + | |
− | === Check if someone else has already reported the bug ===
| + | |
− | | + | |
− | Before trying to debug your code, we recommend that you check [[Bugs_and_fixes|our ''Bugs and fixes'' wiki page]] to see if your error is a known issue, and if someone has already submitted a fix. Also check [[Common GEOS-Chem error messages|our ''Common GEOS-Chem error messages'' wiki page]] for a list of commonly-encountered issues.
| + | |
− | | + | |
− | === Recompile GEOS-Chem with debug options turned on ===
| + | |
− | | + | |
− | Check for common problems like array-out-of-bounds errors, floating-point exceptions, and parallelization issues by turning on debug compiler switches:
| + | |
− | | + | |
− | ==== Debug options for GEOS-Chem Classic simulations ====
| + | |
− | | + | |
− | Compile GEOS-Chem classic with the options listed below. Then run a GEOS-Chem simulation and check the log files for error messages.
| + | |
− | | + | |
− | {| border=1 cellspacing=0 cellpadding=5
| + | |
− | |-valign="top" bgcolor="#CCCCCC"
| + | |
− | !width="150px"|Debugging flag
| + | |
− | !width="850px"|Description
| + | |
− | | + | |
− | |-valign="top"
| + | |
− | |<tt>DEBUG=yes</tt>
| + | |
− | |This option turns off all optimization. It also prepares GEOS-Chem so that it can be run in a debugger like TotalView.
| + | |
− | | + | |
− | |-valign="top"
| + | |
− | |<tt>BOUNDS=yes</tt>
| + | |
− | |This option turns on runtime [[Common GEOS-Chem error messages#Array out-of-bounds error|array-out-of-bounds]] checking, which looks for instances of invalid array indices (i.e. If the <tt>A</tt> array only has 10 elements but you try to reference <tt>A(11)</tt>.)
| + | |
− | | + | |
− | |-valign="top"
| + | |
− | |<tt>TRACEBACK=yes</tt>
| + | |
− | | This option turns on the <tt>-traceback</tt> option (ifort only) and will print a list of routines that were called when the error occurred.<br><span style="color:darkorange">'''''NOTE: This option will always be turned on by default in [[GEOS-Chem v11-01]] and newer versions.'''''</span>
| + | |
− | | + | |
− | |-valign="top"
| + | |
− | |<tt>FPEX=yes</tt> or<br><tt>FPE=yes</tt>
| + | |
− | |This option turns on error checking for [[Floating_point_math_issues|floating-point exceptions]] (i.e. div-by-zero, NaN, floating-invalid, and similar errors).
| + | |
− | | + | |
− | |-valign="top"
| + | |
− | |<tt>OMP=no</tt>
| + | |
− | |This option turns off [[Parallelizing_GEOS-Chem|OpenMP parallelization]] (which is turned on by default). This will check for parallelization issues.
| + | |
− | | + | |
− | |}
| + | |
− | | + | |
− | ==== Debug options for GCHP simulations ====
| + | |
− | | + | |
− | To debug a GCHP simulation, follow these steps:
| + | |
− | | + | |
− | # In file <tt>GCHP/Shared/Config/ESMA_base.mk</tt> change the flag <tt>BOPT = O</tt> to <tt>BOPT = g</tt>
| + | |
− | # Cleanup the GCHP directory with <tt>make clean_all</tt>.
| + | |
− | # Recompile GCHP from scratch using the option <tt>make compile_debug</tt>
| + | |
− | # Submit a GCHP simulation and check the log files for error messages.
| + | |
− | | + | |
− | --[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 16:34, 21 December 2018 (UTC)
| + | |
− | | + | |
− | === Identify whether the error happens consistently ===
| + | |
− | | + | |
− | If the error happens at the same model date & time, it could indicate bad input data. Check our [[List of reprocessed met fields]] to make sure there is not a known issue with the met fields or emissions for that date. This is a list of met field data files that had to be regenerated due to known issues (i.e. incomplete data or other such problems). You might be able to fix your problem by simply re-downloading the affected file or files.
| + | |
− | | + | |
− | If the error happened only once, it could be caused by a network problem or other such transient condition.
| + | |
− | | + | |
− | === Run GEOS-Chem in a debugger to find the source of error ===
| + | |
− | | + | |
− | If you have access to a debugger (e.g. GDB, IDB, DBX, Totalview), you can save a lot of time and hassle by learning the basic commands such as how to:
| + | |
− | | + | |
− | *Examine data when a program stops
| + | |
− | *Navigate the stack when a program stops
| + | |
− | *Set break points
| + | |
− | | + | |
− | To run GEOS-Chem in a debugger, you should add the <tt>DEBUG=yes</tt> option to the make command. This will compile GEOS-Chem with the <tt>-g</tt> flag that tells the compiler to generate symbolic debug information. The <tt>DEBUG=yes</tt> option also uses the <tt>-O</tt> flag, which switches off compiler optimization that can modify the sequence in which individual instructions occur. To apply these options, type:
| + | |
− | | + | |
− | make -j4 DEBUG=yes OMP=no # Without parallelization
| + | |
− | make -j4 DEBUG=yes # With parallelization
| + | |
− | | + | |
− | === Isolate the error to a particular operation ===
| + | |
− | | + | |
− | Can you tell if the error happens in transport, chemistry, emissions, dry dep, etc? Try turning off these operations one at a time in <tt>input.geos</tt> to see if you get past the error.
| + | |
− | | + | |
− | Also try turning on the ND70 diagnostic, which will add additional debug print statements to the output. This will help you to see the last subroutine that was exited before the error occurred.
| + | |
− | | + | |
− | === Check any code modifications that you have added ===
| + | |
− | | + | |
− | If you have made modifications to a "fresh out-of-the-box" GEOS-Chem version, then you should look over your changes to search for the source of error.
| + | |
− | | + | |
− | You can also use Git to revert to the last known error-free state of GEOS-Chem, and use that as a reference.
| + | |
− | | + | |
− | === Check for math errors ===
| + | |
− | | + | |
− | If you suspect that a floating-point math error, such as:
| + | |
− | | + | |
− | *Division by zero
| + | |
− | *Logarithm of a negative number
| + | |
− | *Numerical overflow or underflow
| + | |
− | *Infinity
| + | |
− | | + | |
− | Then make clean and recompile with the <tt>FPEX=yes</tt> flag. This will turn on additional error checking that will stop your GEOS-Chem run if a floating-point error is encountered.
| + | |
− | | + | |
− | You can often detect numerical errors by adding debugging print statements into your source code:
| + | |
− | | + | |
− | *Check the minimum and maximum values of an array with the <tt>MINVAL</tt> and <tt>MAXVAL</tt> intrinsic functions:
| + | |
− | | + | |
− | PRINT*, '### Min, Max: ', MINVAL( ARRAY ), MAXVAL( ARRAY )
| + | |
− | CALL FLUSH( 6 )
| + | |
− | | + | |
− | *Check the sum of an array with the <tt>SUM</tt> intrinsic function:
| + | |
− | | + | |
− | PRINT*, '### Sum of X : ', SUM( ARRAY )
| + | |
− | CALL FLUSH( 6 )
| + | |
− | | + | |
− | See [[Floating_point_math_issues|our ''Floating point math issues'' wiki page]] for information on how to avoid some common pitfalls.
| + | |
− | | + | |
− | === When in doubt, print it out! ===
| + | |
− | | + | |
− | Print out the values of variables in the area where you suspect the error lies. You can also add <tt>call flush(6)</tt> to flush the output buffer after writing. Maybe you will see something wrong in the output.
| + | |
− | | + | |
− | === When all else fails, use the brute force method ===
| + | |
− | | + | |
− | If the bug is difficult to locate, then comment out a large section of code and run GEOS-Chem. If the error does not occur, then uncomment some more code and run GEOS-Chem again. Repeat the process until you find the location of the error. The brute force method may be tedious, but it will usually lead you to the source of the problem.
| + | |
− | | + | |
− | == Using profiling tools to determine the source of computational bottlenecks ==
| + | |
− | | + | |
− | If you think your GEOS-Chem simulation is taking too long to run, consider using profiling tools to generate a list of the time that is spent in each routine. This can help you identify badly written or parallelized code that is causing GEOS-Chem to slow down. For more information, please see [[Profiling GEOS-Chem|our ''Profiling GEOS-Chem'' wiki page]].
| + | |
− | | + | |
− | == Using the GEOS-Chem Unit Tester ==
| + | |
− | | + | |
− | The GEOS-Chem Unit Tester is an external package that can run several test GEOS-Chem simulations with a set of very strict debugging options. The debugging options are designed to detect issues such as floating-point math errors, array-out-of-bounds errors, inefficient subroutine calls, and parallelization errors. You can use this tool to find many common numerical errors and programming issues in your GEOS-Chem code.
| + | |
− | | + | |
− | For complete instructions on how the GEOS-Chem Unit Tester can assist your debugging efforts, please see our [[Debugging with the GEOS-Chem unit tester|''Debugging with the GEOS-Chem unit tester'' wiki page]].
| + | |
− | | + | |
− | == Contacting the GEOS-Chem Support Team for assistance ==
| + | |
− |
| + | |
− | If you have tried to solve your code problem but cannot, then please report it to the [[GEOS-Chem Support Team]]. We will be happy to assist you. Please see [[Submitting GEOS-Chem support requests|our ''Submitting GEOS-Chem support requests'' wiki page]] for a checklist of items to include in your support request.
| + | |
− | | + | |
− | --[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 20:33, 14 June 2019 (UTC)
| + | |