Difference between revisions of "Scalability"

From Geos-chem
Jump to: navigation, search
(SLURM scheduler)
(Computing scalability from SLURM scheduler output)
 
(28 intermediate revisions by the same user not shown)
Line 1: Line 1:
On this page we describe the scalability calculation used in the 1-month benchmark simulations.
+
To calculate how well a GEOS-Chem simulation scaled, we compute the ratio of <span style="color:red">'''''CPU time'''''</span> / <span style="color:green">'''''wall time'''''</span>. The CPU time and wall time metrics can be obtained by checking the job information from your scheduler. We describe the steps to take to obtain this information and calculate this ratio in the sections below.
  
== Overview ==
+
== Computing scalability from SLURM scheduler output ==
 
+
To calculate how well a run scaled, we use CPU time / wall time. The CPU time and wall time metrics can be obtained by checking the job information from your scheduler. We describe the steps to take to obtain this information and calculate scalability below.
+
 
+
=== SLURM scheduler ===
+
  
 
After your job has finished, type:
 
After your job has finished, type:
Line 15: Line 11:
 
   sacct -j JOBID --format=JobID,JobName,User,Partition,NNodes,NCPUS,MaxRSS,TotalCPU,Elapsed
 
   sacct -j JOBID --format=JobID,JobName,User,Partition,NNodes,NCPUS,MaxRSS,TotalCPU,Elapsed
  
From the returned output, note the values for <tt>AveCPU</tt> and <tt>Elapsed</tt>. For example:
+
From the returned output, note the values for <tt>TotalCPU</tt> and <tt>Elapsed</tt>. For example:
  
         JobID    JobName      User  Partition  NNodes      NCPUS    MaxRSS  <span style="color:red">TotalCPU</span>    <span style="color:green">Elapsed</span>  
+
         JobID    JobName      User  Partition  NNodes      <span style="color:purple">NCPUS</span>     MaxRSS  <span style="color:red">TotalCPU</span>    <span style="color:green">Elapsed</span>  
   ------------ ---------- --------- ---------- -------- ---------- ---------- ---------- ----------  
+
   ------------ ---------- --------- ---------- -------- <span style="color:purple">----------</span> ---------- <span style="color:red">----------</span> <span style="color:green">----------</span>
   53901011    HEMCO+Hen+ ryantosca      jacob        1          8        16? <span style="color:red">1-03:35:33</span>  <span style="color:green">04:15:53</span>  
+
   53901011    HEMCO+Hen+ ryantosca      jacob        1          <span style="color:purple">8</span>       16? <span style="color:red">1-03:35:33</span>  <span style="color:green">04:15:53</span>  
   53901011.ba+      batch                            1          8  6329196K <span style="color:red">1-03:35:33</span>  <span style="color:green">04:15:53</span>  
+
   53901011.ba+      batch                            1          <span style="color:purple">8</span>   6329196K <span style="color:red">1-03:35:33</span>  <span style="color:green">04:15:53</span>  
  
 
Note that there are 2 entries.  The first line represents queue to which you submitted the job (i.e. <tt>jacob</tt>, and the second line represents the internal queue name in which the job actually ran (i.e. <tt>batch</tt>).
 
Note that there are 2 entries.  The first line represents queue to which you submitted the job (i.e. <tt>jacob</tt>, and the second line represents the internal queue name in which the job actually ran (i.e. <tt>batch</tt>).
  
A good measure of how well your job scales across multiple CPUs is the ratio of CPU time to wall-clock time.  You can compute this by taking the ratio of the SLURM quantities
+
A good measure of how well your job scales across multiple CPUs is the <span style="color:blue">ratio of CPU time to wall-clock time</span>.  You can compute this by taking the ratio of the SLURM quantities
  
 
   <span style="color:red"><tt>TotalCPU</tt></span> [s] / <span style="color:green"><tt>Elapsed</tt></span> [s]
 
   <span style="color:red"><tt>TotalCPU</tt></span> [s] / <span style="color:green"><tt>Elapsed</tt></span> [s]
Line 33: Line 29:
  
 
   CPU time / wall time = <span style="color:red">1d 03h 35m 33s</span> / <span style="color:green">4h 15m 53s</span>             
 
   CPU time / wall time = <span style="color:red">1d 03h 35m 33s</span> / <span style="color:green">4h 15m 53s</span>             
                         = <span style="color:red">99333 s</span>        / <span style="color:green">15333 s</span>    = 6.4784
+
                         = <span style="color:red">99333 s</span>        / <span style="color:green">15353 s</span>    = <span style="color:blue">6.4699</span>
  
A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8.  This in practice is never attained due to file I/O as well as system overhead.  By dividing the ratio of CPU time / wall time computed above by the number of CPUs that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance:
+
A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8.  This in practice is never attained due to file I/O as well as system overhead.  By dividing the ratio of CPU time / wall time computed above by the <span style="color:purple">number of CPUs</span> that were used (in this example, 8), you can get an <span style="color:darkorange">estimate of how efficient your job was, compared to ideal performance</span>:
  
   % of ideal performance = ( 6.4784 / 8 ) * 100 = 80.9797%
+
   % of ideal performance = ( <span style="color:blue">6.4699</span> / <span style="color:purple">8</span> ) * 100 = <span style="color:darkorange">80.87%</span>
  
--[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 16:59, 21 December 2015 (UTC)
+
=== Scripts for parsing SLURM scheduler output ===
  
=== SGE scheduler ===
+
To simplify reporting the scalability for GEOS-Chem jobs from SLURM scheduler output, the [[GEOS-Chem Support Team]] have written two Perl scripts, [http://acmg.seas.harvard.edu/geos/wiki_docs/machines/jobstats <tt>jobstats</tt>] and [http://acmg.seas.harvard.edu/geos/wiki_docs/machines/jobinfo <tt>jobinfo</tt>] (upon which <tt>jobstats</tt> relies).  All you need to supply is the SLURM job ID #, as shown below:
 +
 
 +
  > jobstats 53901011
 +
 +
  SLURM JobID #        : 53901011
 +
  Job Name              : HEMCO+Henry.run
 +
  Submit time          : 2015-12-17 11:16:16
 +
  Start  time          : 2015-12-17 11:16:16
 +
  End    time          : 2015-12-17 15:32:09
 +
  Partition            : jacob
 +
  Node                  : regal12
 +
  <span style="color:purple">CPUs                  : 8</span>
 +
  Memory                : 6.3292 GB
 +
  <span style="color:red">CPU  Time            : 1-03:35:33  (      99333 s)</span>
 +
  <span style="color:green">Wall Time            : 04:15:53    (      15353 s)</span>
 +
  <span style="color:blue">CPU  Time / Wall Time : 6.4699</span>      <span style="color:darkorange">( 80.87% ideal)</span>
 +
 
 +
--[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 17:33, 21 December 2015 (UTC)
 +
 
 +
== Computing scalability from Grid Engine scheduler output ==
  
 
After your job has finished, type:
 
After your job has finished, type:
Line 47: Line 62:
 
   qacct -j JOBID
 
   qacct -j JOBID
  
From the returned output, note the values for <tt>cpu</tt> and <tt>ru_wallclock</tt>. For example:
+
From the returned output, note the values for <span style="color:red">'''<tt>cpu</tt>'''</span> and <span style="color:green">'''<tt>ru_wallclock</tt>'''</span>. For example:
  
 
   ==============================================================
 
   ==============================================================
Line 65: Line 80:
 
   end_time    Fri Jun 19 01:01:48 2015
 
   end_time    Fri Jun 19 01:01:48 2015
 
   granted_pe  bench               
 
   granted_pe  bench               
   slots        8                  
+
   <span style="color:purple">'''slots        8'''</span>
 
   failed      0     
 
   failed      0     
 
   exit_status  0                   
 
   exit_status  0                   
   <b>ru_wallclock 28013</b>
+
   <span style="color:green">'''ru_wallclock 28013'''</span>
 
   ru_utime    189568.938   
 
   ru_utime    189568.938   
 
   ru_stime    1718.925     
 
   ru_stime    1718.925     
Line 86: Line 101:
 
   ru_nvcsw    390660               
 
   ru_nvcsw    390660               
 
   ru_nivcsw    19093052             
 
   ru_nivcsw    19093052             
   <b>cpu          191287.863</b>
+
   <span style="color:red">'''cpu          191287.863'''</span>
 
   mem          1266832.593       
 
   mem          1266832.593       
 
   io          30.376             
 
   io          30.376             
Line 93: Line 108:
 
   arid        undefined
 
   arid        undefined
  
Calculate the scalability using:
+
Calculate the CPU time / wall time ratio using:
  
   cputime / ru_wallclock
+
   <span style="color:red">cpu</span> / <span style="color:green">ru_wallclock</span>
  
 
From the above example:
 
From the above example:
  
   Scalability = 191287.863 / 28013 = 6.8285
+
   CPU time / wall time = <span style="color:red">191287.863 s</span> / <span style="color:green">28013 s</span> = <span style="color:blue">6.8285</span>
 +
 
 +
A theoretically ideal job running on 8 CPUs would have a <span style="color:blue">CPU time / wall time ratio</span> of exactly 8.  This in practice is never attained due to file I/O as well as system overhead.  By dividing the ratio of CPU time / wall time computed above by the <span style="color:purple">number of CPUs (aka "slots")</span> that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance:
 +
 
 +
  % of ideal performance = ( <span style="color:blue">6.8285</span> / <span style="color:purple">8</span> ) * 100 = <span style="color:darkorange">85.35%</span>
 +
 
 +
--[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 17:34, 21 December 2015 (UTC)
 +
 
 +
=== Scripts for parsing Grid Engine output ===
 +
 
 +
To simplify reporting the scalability for GEOS-Chem jobs from Grid Engine scheduler output, the [[GEOS-Chem Support Team]] have written a Perl script ( [http://acmg.seas.harvard.edu/geos/wiki_docs/machines/scale <tt>scale</tt>]).  All you need to supply is the Grid Engine job ID #, as shown below:
 +
 
 +
  > scale 81969
 +
 
 +
  SGE JobID #      : 81969
 +
  Submitted at      : Thu Jun 18 17:14:15 2015
 +
  Run began at      : Thu Jun 18 17:14:55 2015
 +
  Run ended at      : Fri Jun 19 01:01:48 2015
 +
  Ran on host      : titan-10.as.harvard.edu
 +
  Ran in queue      : bench
 +
  <span style="color:red">CPU  Time [s]    : 191287.863 s
 +
  CPU  Time [h:m:s] : 53:08:07</span>
 +
  <span style="color:green">Wall Time [s]    : 28013 s
 +
  Wall Time [h:m:s] : 07:46:53</span>
 +
  <span style="color:blue">Scalability      : 6.8285</span>
  
--[[User:Melissa Payer|Melissa Sulprizio]] ([[User talk:Melissa Payer|talk]]) 22:04, 11 September 2015 (UTC)
+
--[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 17:20, 21 December 2015 (UTC)

Latest revision as of 20:44, 21 December 2015

To calculate how well a GEOS-Chem simulation scaled, we compute the ratio of CPU time / wall time. The CPU time and wall time metrics can be obtained by checking the job information from your scheduler. We describe the steps to take to obtain this information and calculate this ratio in the sections below.

Computing scalability from SLURM scheduler output

After your job has finished, type:

  sacct -l -j JOBID

The -l option returns the “long” output information for your job. You may also specify which information you would like to obtain for your job. For example:

  sacct -j JOBID --format=JobID,JobName,User,Partition,NNodes,NCPUS,MaxRSS,TotalCPU,Elapsed

From the returned output, note the values for TotalCPU and Elapsed. For example:

        JobID    JobName      User  Partition   NNodes      NCPUS     MaxRSS   TotalCPU    Elapsed 
 ------------ ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- 
 53901011     HEMCO+Hen+ ryantosca      jacob        1          8        16? 1-03:35:33   04:15:53 
 53901011.ba+      batch                             1          8   6329196K 1-03:35:33   04:15:53 

Note that there are 2 entries. The first line represents queue to which you submitted the job (i.e. jacob, and the second line represents the internal queue name in which the job actually ran (i.e. batch).

A good measure of how well your job scales across multiple CPUs is the ratio of CPU time to wall-clock time. You can compute this by taking the ratio of the SLURM quantities

  TotalCPU [s] / Elapsed [s]

as reported by the sacct command.

From the above example:

  CPU time / wall time = 1d 03h 35m 33s / 4h 15m 53s             
                       = 99333 s        / 15353 s    = 6.4699

A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the number of CPUs that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance:

  % of ideal performance = ( 6.4699 / 8 ) * 100 = 80.87%

Scripts for parsing SLURM scheduler output

To simplify reporting the scalability for GEOS-Chem jobs from SLURM scheduler output, the GEOS-Chem Support Team have written two Perl scripts, jobstats and jobinfo (upon which jobstats relies). All you need to supply is the SLURM job ID #, as shown below:

  > jobstats 53901011

  SLURM JobID #         : 53901011
  Job Name              : HEMCO+Henry.run
  Submit time           : 2015-12-17 11:16:16
  Start  time           : 2015-12-17 11:16:16
  End    time           : 2015-12-17 15:32:09
  Partition             : jacob
  Node                  : regal12
  CPUs                  : 8
  Memory                : 6.3292 GB
  CPU  Time             : 1-03:35:33  (      99333 s)
  Wall Time             : 04:15:53    (      15353 s)
  CPU  Time / Wall Time : 6.4699      ( 80.87% ideal)

--Bob Yantosca (talk) 17:33, 21 December 2015 (UTC)

Computing scalability from Grid Engine scheduler output

After your job has finished, type:

  qacct -j JOBID

From the returned output, note the values for cpu and ru_wallclock. For example:

  ==============================================================
  qname        bench               
  hostname     titan-10.as.harvard.edu
  group        mpayer              
  owner        mpayer              
  project      NONE                
  department   defaultdepartment   
  jobname      v10-01-public-release-Run1.run
  jobnumber    81969               
  taskid       undefined
  account      sge                 
  priority     0                   
  qsub_time    Thu Jun 18 17:14:15 2015
  start_time   Thu Jun 18 17:14:55 2015
  end_time     Fri Jun 19 01:01:48 2015
  granted_pe   bench               
  slots        8
  failed       0    
  exit_status  0                   
  ru_wallclock 28013
  ru_utime     189568.938   
  ru_stime     1718.925     
  ru_maxrss    5941376             
  ru_ixrss     0
  ru_ismrss    0                   
  ru_idrss     0                   
  ru_isrss     0                   
  ru_minflt    5437936             
  ru_majflt    23                  
  ru_nswap     0                   
  ru_inblock   36810536            
  ru_oublock   834224              
  ru_msgsnd    0                   
  ru_msgrcv    0                   
  ru_nsignals  0                   
  ru_nvcsw     390660              
  ru_nivcsw    19093052            
  cpu          191287.863
  mem          1266832.593       
  io           30.376            
  iow          0.000             
  maxvmem      6.817G
  arid         undefined

Calculate the CPU time / wall time ratio using:

  cpu / ru_wallclock

From the above example:

  CPU time / wall time = 191287.863 s / 28013 s = 6.8285

A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the number of CPUs (aka "slots") that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance:

  % of ideal performance = ( 6.8285 / 8 ) * 100 = 85.35%

--Bob Yantosca (talk) 17:34, 21 December 2015 (UTC)

Scripts for parsing Grid Engine output

To simplify reporting the scalability for GEOS-Chem jobs from Grid Engine scheduler output, the GEOS-Chem Support Team have written a Perl script ( scale). All you need to supply is the Grid Engine job ID #, as shown below:

  > scale 81969
  
  SGE JobID #       : 81969 
  Submitted at      : Thu Jun 18 17:14:15 2015
  Run began at      : Thu Jun 18 17:14:55 2015
  Run ended at      : Fri Jun 19 01:01:48 2015
  Ran on host       : titan-10.as.harvard.edu
  Ran in queue      : bench
  CPU  Time [s]     : 191287.863 s
  CPU  Time [h:m:s] : 53:08:07
  Wall Time [s]     : 28013 s
  Wall Time [h:m:s] : 07:46:53
  Scalability       : 6.8285

--Bob Yantosca (talk) 17:20, 21 December 2015 (UTC)