- Parallelizing GEOS-Chem
- GEOS-Chem scalability
- GEOS-Chem 7-day timing tests
- GEOS-Chem 1-month benchmark timing results
- Profiling GEOS-Chem with the TAU performance system
- Speeding up GEOS-Chem
On this page we provide information about how GEOS-Chem "Classic" scales when adding extra cores to a simulation.
Scalability by GEOS-Chem version
Please see this presentation by Bob Yantosca for a detailed analysis of GEOS-Chem 12.0.0's scalability.
The above plot displays the "wall clock" time from several 1-month GEOS-Chem "Classic" simulations.
Machine: All simulations were performed with v11-02c, and used between 6 and 30 cores. Simulations were run on the "shared" partition of the Harvard Odyssey cluster (odyssey.rc.fas.harvard.edu). Each node in the "shared" partition has 2 Intel Broadwell CPUs x 16 CPUs/node = 32 CPUs/node and 128 GB RAM. The network between nodes is FDR Infiniband. Each of the simulations requested an entire node of Odyssey in order to prevent "backfilling" jobs from potentially affecting the timing results.
Discussion: Adding more cores to both the 2° x 2.5° simulations and 4° x 5° simulations always results in increased run times. The run-time curves tend towards an asymptote, which is the expected behavior of shared OpenMP parallelization. The more cores are added to the simulation, the more computational overhead is added, as each extra core has to communicate with the memory. Nevertheless, we do not see a "turnover" in the plot (i.e. where adding more cores to the simulation results in slower run times). The 4° x 5° simulation with 30 cores is about 2x as fast as with 6 cores. A similar speedup is observed for the 2° x 2.5° simulations
The above plot shows the amount of time spent in each GEOS-Chem operation. This information was taken from the same simulations shown in the scalability plot above.
Discussion: Here is a summary of what we found. We will continue to investigate how to speed up GEOS-Chem in future versions.
HEMCO and Diagnostics do not show much speedup when increasing the number of cores from 6 to 30. But we understand why. HEMCO mostly involves disk input and the Diagnostics mostly involve disk output. Disk I/O operations are not parallelized in GEOS-Chem "Classic".
Wet deposition does not scale well at all. We believe that this may be attributed to non-optimal loop ordering. In wetdep, we loop over each surface box, then over vertical boxes and species (IJLN). But the optimal ordering is to have the N loop on the outside (NLIJ), because this will step through the array in the order it is laid out in memory. (This reduces the number of memory accesses via the CPU cache.) We will investigate this as time allows.
The GEOS-Chem operation that scales the best is "Gas-Phase Chemistry", which includes both the FlexChem/KPP solver and the FAST-JX photolysis module. When going from 6 cores to 30 cores, a speedup of about 3.5X is obtained.
Transport and Convection scale less well than Gas-Phase Chemistry. Transport probably incurs some computational overhead from flipping the arrays in the vertical. Convection might also be affected by non-optimal loop ordering.
All other operations are in the noise.
The 2° x 2.5° plot looks similar to the 4° x 5° plot; for this reason we have not shown it here.
Mat Evans and his group at the University of York have done an analysis of how GEOS-Chem v9-02 performs when compared to the previous version, GEOS-Chem v9-01-03. Please follow this post on our GEOS-Chem v9-02 wiki page to view the results.
--Bob Y. 10:57, 19 November 2013 (EST)
GEOS-Chem v9-01-03, U. Liege user group
Manu Mahieu wrote:
- GEOS-Chem v9-01-03 is now installed and running at ULg. I have performed a few benchmark runs on our server involving from 4 up to 32 CPUs. This PDF document provides information about our server, the OS and compiler used as well as the running times for the various configurations tested up to now. By far, the best performance was obtained when submitting the GC simulation to all available CPUs (i.e. 32).
--Bob Y. 11:44, 19 June 2013 (EDT)
GEOS-Chem v9-01-03, MIT user group
Colette Heald wrote:
- I have been benchmarking GEOS-Chem on my new system here at MIT and I thought you might be interested in seeing the results for the scaling. This is with a dual hex-core Xeon 3.07 GHz chip & 48 Gb RAM from Thinkmate.
Jack Yatteau replied:
- Note that there is no difference between 12 and 24. It’s not just scaling. With 12 real cores, jobs run faster than on old non-hyperthreaded cores at the same clock speed, but once you start relying on hyperthreading (>12) you don’t gain speed, even if the job scales. But you could run two 12-core jobs at about the speed of the older processors, which is what we do when the cluster is busy.
Colette Heald wrote:
- Yup, hyperthreading doesn't appear to buy me anything. But I did test submitting 2 12-core jobs to the same machine and the run time dropped from 36 min to 58 min. I suppose not quite a doubling, but didn't seem like a worthwhile experiment on my system.
--Bob Y. 12:00, 17 April 2012 (EDT)
NOTE: The hardware mentioned in this thread is now very old. The most recent GEOS-Chem version (v11-02) shows excellent scalability, as discussed above.
Colette Heald wrote:
- I was wondering if you could give me your thoughts on GEOS-Chem scalability? I'm about to purchase some new servers, and the default would be 6 dual core servers, so 12 processors total. I see a huge difference in my 4p vs. 8p machines, but I'm wondering if there's much advantage going beyond that to 12p. My sense from past discussions is that GC does not scale very well.
Jack Yatteau replied:
- First, if you’re getting Intel processors with hyperthreading, your 2 socket hex core system will look as if it has 24 processors under Linux. We’re current using 2 socket quad core systems that appear to have 16 processors under Linux. Codes run almost twice as fast up to 8 threads, and run about the same speed at 16 threads, meaning, an 8p job will run faster on the newer system than on a older 2 socket quad core system without hyperthreading at about the clock speed, but two 8p jobs running simultaneously will each run at about the same speed as they would on the older systems. So the system appears to slow down as you add more than 8 threads. On a hex core system, the threshold would be at 12 threads. So you’ll have a difficult time making sense of timing tests if you use one of the newer systems unless you disable hyperthreading, in which case you might as well limit the number of threads to 12 and leave hyperthreading enabled.
- I measured scaling 5 years ago using a 16 processor Origin 2000 and a 12 processor Altix and you can see the results and my analysis of them in this Powerpoint presentation.
- Since then I’ve run tests at 4x5 resolution on dual core Opteron processors up to 16 cores and on modern Xeon systems up to 8p. GEOS-Chem still runs about 1.5-1.6 times faster on 8 threads than on 4 threads. In our environment, it matters more how many runs get completed. Even if we could get a job to run 25% faster on 16 threads than on 8 threads, we’d be better off running 2 simultaneous jobs each using 8 threads. Also, be aware that at 2x2.5 resolution GEOS-Chem doesn’t scale as well, since more time is spend doing transport, and the transport code doesn’t scale as well as the chemistry code.
- Finally, we’ve done very well using dual socket systems since for the past several years computers have been designed with high bandwidth to memory for pairs of processors. Going to more than 2 sockets (e.g. 4 quad or hex core systems), the bandwidth between 2 or more pairs of processors drops, and I’d expect that to slow down SMP jobs whose threads don’t all run on the same pair. So my recommendation would be to stick to 2 socket systems and use the savings to add more of them. Plus, maybe I’ve convinced you that you’ll be getting a machine with 3 times the capability of an older dual quad-core system.
--Bob Y. 13:09, 12 August 2010 (EDT)