Difference between revisions of "Bugs and fixes"

From Geos-chem
Jump to: navigation, search
(New page: == NaN's in SMVGEAR == === 05-Jul-2007 === From Lok Lamsal (lok.lamsal@fizz.phys.dal.ca) <blockquote> I ran into a problem while running GEOS-4, version v7-04-09, at 2x2.5. The simulati...)
 
Line 3: Line 3:
 
=== 05-Jul-2007 ===
 
=== 05-Jul-2007 ===
  
From Lok Lamsal (lok.lamsal@fizz.phys.dal.ca)
+
'''''From Lok Lamsal (lok.lamsal@fizz.phys.dal.ca)'''''
 
+
 
<blockquote>
 
<blockquote>
 
I ran into a problem while running GEOS-4, version v7-04-09, at 2x2.5. The simulation stops on 15th July 2006 with different error messages on two of our machines tuque and beret. One of the error messages on tuque is like this:
 
I ran into a problem while running GEOS-4, version v7-04-09, at 2x2.5. The simulation stops on 15th July 2006 with different error messages on two of our machines tuque and beret. One of the error messages on tuque is like this:
Line 20: Line 19:
 
And on beret the message is like this:
 
And on beret the message is like this:
  
 +
<pre>
 
       - CLEANUP: deallocating arrays now...
 
       - CLEANUP: deallocating arrays now...
 
   forrtl: severe (174): SIGSEGV, segmentation fault occurred
 
   forrtl: severe (174): SIGSEGV, segmentation fault occurred
Line 29: Line 29:
 
   Unknown
 
   Unknown
 
   etc.
 
   etc.
 +
</pre>
  
 
The message which is repeated in either case is like this:
 
The message which is repeated in either case is like this:
  
 +
<pre>
 
   SMVGEAR: DELT= 9.96E-16 TOO LOW DEC YFAC. KBLK, KTLOOP, NCS, TIME, TIMREMAIN, YFAC, EPS =
 
   SMVGEAR: DELT= 9.96E-16 TOO LOW DEC YFAC. KBLK, KTLOOP, NCS, TIME, TIMREMAIN, YFAC, EPS =
 +
</pre>
  
 
Could you suggest me what the problem could be? Just to inform you: while trying to figure out the problem, I noticed from Bastien that he did not have problem on that day with version v7-04-10, which stopped on September 13 2006.
 
Could you suggest me what the problem could be? Just to inform you: while trying to figure out the problem, I noticed from Bastien that he did not have problem on that day with version v7-04-10, which stopped on September 13 2006.
 
</blockquote>
 
</blockquote>
  
Response by Bob Yantosca (yantosca@seas.harvard.edu):
+
'''''Response by Bob Yantosca (yantosca@seas.harvard.edu):'''''
 
+
 
<blockquote>
 
<blockquote>
 
I think there is a division by zero somewhere that is causing SMVGEAR to choke.  It could be a couple of things:
 
I think there is a division by zero somewhere that is causing SMVGEAR to choke.  It could be a couple of things:
Line 44: Line 46:
 
(1) Make sure that in your In a6_read_mod.f (routine READ_A6) you have the following code to prevent Q from going to zero.  This can make logarithms blow up in other places in the code:
 
(1) Make sure that in your In a6_read_mod.f (routine READ_A6) you have the following code to prevent Q from going to zero.  This can make logarithms blow up in other places in the code:
  
 +
<pre>
 
           !--------------------------------
 
           !--------------------------------
 
           ! Q: 6-h avg specific humidity
 
           ! Q: 6-h avg specific humidity
Line 61: Line 64:
 
                 WHERE ( Q < 0d0 ) Q = 1d-32
 
                 WHERE ( Q < 0d0 ) Q = 1d-32
 
             ENDIF
 
             ENDIF
 +
</pre>
  
  
 
(2) In fvdas_convect_mod.f, make SMALLEST a smaller number (i.e. 1d-60):
 
(2) In fvdas_convect_mod.f, make SMALLEST a smaller number (i.e. 1d-60):
  
 +
<pre>
 
     !=================================================================
 
     !=================================================================
 
     ! MODULE VARIABLES
 
     ! MODULE VARIABLES
Line 83: Line 88:
 
     REAL*8,  PARAMETER :: TINYALT  = 1.0d-36           
 
     REAL*8,  PARAMETER :: TINYALT  = 1.0d-36           
 
     REAL*8,  PARAMETER :: TINYNUM  = 2*SMALLEST
 
     REAL*8,  PARAMETER :: TINYNUM  = 2*SMALLEST
 +
</pre>
  
  
 
(3) In "fvdas_convect_mod.f", avoid division by zero in routine CONVTRAN:
 
(3) In "fvdas_convect_mod.f", avoid division by zero in routine CONVTRAN:
 +
<pre>
  
 
             IF ( CDIFR > 1.d-6 ) THEN
 
             IF ( CDIFR > 1.d-6 ) THEN
Line 115: Line 122:
 
                 CHAT(I,K) = 0.5d0 * ( CMIX(I,K) + CMIX(I,KM1) )
 
                 CHAT(I,K) = 0.5d0 * ( CMIX(I,K) + CMIX(I,KM1) )
 
             ENDIF
 
             ENDIF
 +
</pre>
  
  
 
(4) Also I had to rewrite the parallel DO loops in the routine HACK_CONV since this was causing some kind of a memory fault.
 
(4) Also I had to rewrite the parallel DO loops in the routine HACK_CONV since this was causing some kind of a memory fault.
  
You may just want to get the most recent version of fvdas_convect_mod.f, which has all of these fixes installed. See:
+
You may just want to get the most recent version of fvdas_convect_mod.f, which has all of these fixes installed. See:
  
 +
<pre>
 
   ftp ftp.as.harvard.edu
 
   ftp ftp.as.harvard.edu
 
   cd pub/exchange/bmy
 
   cd pub/exchange/bmy
 
   get fvdas_convect_mod.f
 
   get fvdas_convect_mod.f
 +
</pre>
  
 
So I would recommend trying to implement these fixes and see if this solves your problem.
 
So I would recommend trying to implement these fixes and see if this solves your problem.
Line 130: Line 140:
 
NOTE: These fixes have been introduced into GEOS-Chem v7-04-10.
 
NOTE: These fixes have been introduced into GEOS-Chem v7-04-10.
  
 +
== regrid_1x1_mod.f ==
  
 +
=== 16 Oct 2007 ===
  
 +
From Mike Barkley (mbarkley@staffmail.ed.ac.uk)
  
===== regrid_1x1_mod.f =====
+
<blockquote>
 +
I think I've found an error in the regrid_1x1_mod.f subroutine (attached in the text file):
  
==== 16 Oct 2007 ====
+
<pre>
 +
SUBROUTINE REGRID_MASS_TO_2x25( I1, J1, L1, IN, I2, J2, OUT )
 +
</pre>
  
From Mike Barkley (mbarkley@staffmail.ed.ac.uk)
+
There is a do loop over longitude with the upper limit defined as the input latitude (J1) instead what should (?) be the output longitude (I2) - I've indicated where this in the program, Which is correct? We didn't notice this until we were running multi-processor 2x2.5 simulations on different servers.
 
+
> I think I've found an error in the regrid_1x1_mod.f subroutine (attached in the text file):
+
>
+
>SUBROUTINE REGRID_MASS_TO_2x25( I1, J1, L1, IN, I2, J2, OUT )
+
>
+
>There is a do loop over longitude with the upper limit defined as the input latitude (J1) instead what should (?) be the output longitude (I2) - I've indicated where this in the program, Which is correct? We didn't notice this until we were running multi-processor 2x2.5 simulations on different servers.
+
  
 
The bug was:
 
The bug was:
  
 +
<pre>
 
   !-----------------------
 
   !-----------------------
 
   ! Non-polar latitudes
 
   ! Non-polar latitudes
Line 153: Line 164:
 
     ...           
 
     ...           
 
     DO I = 1, J1
 
     DO I = 1, J1
 +
</pre>
  
 
which needs to be replaced by:
 
which needs to be replaced by:
  
 +
<pre>
 
   !-----------------------
 
   !-----------------------
 
   ! Non-polar latitudes
 
   ! Non-polar latitudes
Line 162: Line 175:
 
     ...           
 
     ...           
 
     DO I = 1, I1
 
     DO I = 1, I1
 +
</pre>
 +
</blockquote>
  
 
This bug has now been fixed in [[wiki:geos-chem:dev_forum|GEOS-Chem v7-04-13]].
 
This bug has now been fixed in [[wiki:geos-chem:dev_forum|GEOS-Chem v7-04-13]].
  
===== smvgear.f =====
+
== smvgear.f ==
  
 
==== 02 Nov 2007 ====
 
==== 02 Nov 2007 ====

Revision as of 20:27, 26 March 2008

NaN's in SMVGEAR

05-Jul-2007

From Lok Lamsal (lok.lamsal@fizz.phys.dal.ca)

I ran into a problem while running GEOS-4, version v7-04-09, at 2x2.5. The simulation stops on 15th July 2006 with different error messages on two of our machines tuque and beret. One of the error messages on tuque is like this:

  sum of rrate =  Infinity
  SMVGEAR: CNEW is NaN!
  Species index :            1
  Grid Box      :          121          15           1
  STOP in smvgear.f!
      - CLEANUP: deallocating arrays now...
 forrtl: severe (174): SIGSEGV, segmentation fault occurred

And on beret the message is like this:

      - CLEANUP: deallocating arrays now...
  forrtl: severe (174): SIGSEGV, segmentation fault occurred
  Image              PC                Routine            Line
  Source
  geos_4_nv          400000000059EA10  Unknown               Unknown
  Unknown
  libguide.so        200000000039C1A0  Unknown               Unknown
  Unknown
  etc.

The message which is repeated in either case is like this:

  SMVGEAR: DELT= 9.96E-16 TOO LOW DEC YFAC. KBLK, KTLOOP, NCS, TIME, TIMREMAIN, YFAC, EPS =

Could you suggest me what the problem could be? Just to inform you: while trying to figure out the problem, I noticed from Bastien that he did not have problem on that day with version v7-04-10, which stopped on September 13 2006.

Response by Bob Yantosca (yantosca@seas.harvard.edu):

I think there is a division by zero somewhere that is causing SMVGEAR to choke. It could be a couple of things:

(1) Make sure that in your In a6_read_mod.f (routine READ_A6) you have the following code to prevent Q from going to zero. This can make logarithms blow up in other places in the code:

          !--------------------------------
          ! Q: 6-h avg specific humidity
          ! (GEOS-4 only)
          !--------------------------------
          CASE ( 'Q' )
             READ( IU_A6, IOSTAT=IOS ) XYMD, XHMS, Q3
             IF ( IOS /= 0 ) CALL IOERROR( IOS, IU_A6, 'read_a6:16' )
     
             IF ( CHECK_TIME( XYMD, XHMS, NYMD, NHMS ) ) THEN
                IF ( PRESENT( Q ) ) CALL TRANSFER_3D( Q3, Q )
                NFOUND = NFOUND + 1
     
                ! NOTE: Now set negative Q to a small positive #
                ! instead of zero, so as not to blow up logarithms
                ! (bmy, 9/8/06)
                WHERE ( Q < 0d0 ) Q = 1d-32
             ENDIF


(2) In fvdas_convect_mod.f, make SMALLEST a smaller number (i.e. 1d-60):

    !=================================================================
    ! MODULE VARIABLES
    !=================================================================
     
    ! Variables
    INTEGER            :: LIMCNV              ! Constants
    LOGICAL, PARAMETER :: RLXCLM   = .TRUE.
    REAL*8,  PARAMETER :: CMFTAU   = 3600.d0
    REAL*8,  PARAMETER :: EPS      = 1.0d-13       
    REAL*8,  PARAMETER :: GRAV     = 9.8d0
    !-------------------------------------------------------
    ! Prior to 12/19/06:
    ! Make SMALLEST smaller (bmy, 12/19/06)
    !REAL*8,  PARAMETER :: SMALLEST = 1.0d-32
    !-------------------------------------------------------
    REAL*8,  PARAMETER :: SMALLEST = 1.0d-60
    REAL*8,  PARAMETER :: TINYALT  = 1.0d-36           
    REAL*8,  PARAMETER :: TINYNUM  = 2*SMALLEST


(3) In "fvdas_convect_mod.f", avoid division by zero in routine CONVTRAN:


             IF ( CDIFR > 1.d-6 ) THEN
     
                ! If the two layers differ significantly.
                ! use a geometric averaging procedure
                CABV = MAX( CMIX(I,KM1), MAXC*TINYNUM, SMALLEST )
                CBEL = MAX( CMIX(I,K),   MAXC*TINYNUM, SMALLEST )
  !-----------------------------------------------------------------
  !  Prior to 12/19/06:
  ! Avoid division by zero (bmy, 12/19/06)
  !                  CHAT(I,K) = LOG( CABV / CBEL)
  !     &                       /   ( CABV - CBEL)
  !     &                       *     CABV * CBEL
  !-----------------------------------------------------------------
     
                ! If CABV-CBEL is zero then set CHAT=SMALLEST
                ! so that we avoid div by zero (bmy, 12/19/06)
                IF ( ABS( CABV - CBEL ) > 0d0 ) THEN
                   CHAT(I,K) = LOG( CABV / CBEL )
   &                         /    ( CABV - CBEL )
   &                         *      CABV * CBEL
                ELSE
                   CHAT(I,K) = SMALLEST
                ENDIF
     
             ELSE                           
                ! Small diff, so just arithmetic mean
                CHAT(I,K) = 0.5d0 * ( CMIX(I,K) + CMIX(I,KM1) )
             ENDIF


(4) Also I had to rewrite the parallel DO loops in the routine HACK_CONV since this was causing some kind of a memory fault.

You may just want to get the most recent version of fvdas_convect_mod.f, which has all of these fixes installed. See:

  ftp ftp.as.harvard.edu
  cd pub/exchange/bmy
  get fvdas_convect_mod.f

So I would recommend trying to implement these fixes and see if this solves your problem.

NOTE: These fixes have been introduced into GEOS-Chem v7-04-10.

regrid_1x1_mod.f

16 Oct 2007

From Mike Barkley (mbarkley@staffmail.ed.ac.uk)

I think I've found an error in the regrid_1x1_mod.f subroutine (attached in the text file):

SUBROUTINE REGRID_MASS_TO_2x25( I1, J1, L1, IN, I2, J2, OUT )

There is a do loop over longitude with the upper limit defined as the input latitude (J1) instead what should (?) be the output longitude (I2) - I've indicated where this in the program, Which is correct? We didn't notice this until we were running multi-processor 2x2.5 simulations on different servers.

The bug was:

  !-----------------------
  ! Non-polar latitudes
  !-----------------------
  DO J = 2, J2-1    
     ...          
     DO I = 1, J1

which needs to be replaced by:

  !-----------------------
  ! Non-polar latitudes
  !-----------------------
  DO J = 2, J2-1    
     ...          
     DO I = 1, I1

This bug has now been fixed in GEOS-Chem v7-04-13.

smvgear.f

02 Nov 2007

From Bob Yantosca (yantosca@seas.harvard.edu)

>Some of you have reported a weird error in SMVGEAR that causes GEOS-Chem simulations to die unexpectedly. The main symptom of this error is that concentrations of some species (e.g CO) appear to go to zero, while other species (e.g. Ox) seem to reach unphysically high values, all within a single chemistry timestep. Then the simulation dies shortly thereafter. > >May Fu and Philippe Le Sager have isolated the cause of the problem. They found that in some instances it is possible (e.g. due to locally low OH) to get into a regime where the first derivative of a species goes very negative during SMVGEAR's internal iteration loop. This then causes the new species concentration to be negative. This can sometimes happen even if the local & global error tolerance checks have passed. Then upon exiting the internal iteration loop, SMVGEAR would automatically reset any negative species concentrations to zero (actually a small positive number like 1e-99). A species with zero concentration can adversely affect other species within the SMVGEAR solver process. Furthermore, sometimes these zero concentrations were propagating out of SMVGEAR and into the STT tracer array, which caused problems in other areas of the code. > >May & Philippe implemented a fix into the file "smvgear.f" which does the following: If a negative species concentration value is found during an internal iteration, then we don't set it to zero. We instead reduce the internal iteration timestep and do another iteration (i.e. re-evaluate the Jacobian matrix) to solve for the new species concentration. This process is repeated until SMVGEAR converges onto a non-negative solution. May & Philippe also added an extra error trap to stop the simulation if any negative species concentrations still persist upon exiting the subroutine. So the entire process should now be more robust. > >You may download the updated "smvgear.f" file from our anonymous FTP site: > > ftp ftp.as.harvard.edu > cd pub/geos-chem/patches/v7-04-12 > get README > get smvgear.f > >Then copy the "smvgear.f" file to your own source code directory and recompile. Please see the README file for more information on how to locate the places in "smvgear.f" that were modified. > >This is not really a "bug" but more of a "design flaw" in the original SMVGEAR package.

This bug has now been fixed in GEOS-Chem GEOS-Chem v7-04-13.