Difference between revisions of "Scalability"
(→SLURM scheduler) |
(→Computing scalability from SLURM scheduler output) |
||
(29 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | To calculate how well a GEOS-Chem simulation scaled, we compute the ratio of <span style="color:red">'''''CPU time'''''</span> / <span style="color:green">'''''wall time'''''</span>. The CPU time and wall time metrics can be obtained by checking the job information from your scheduler. We describe the steps to take to obtain this information and calculate this ratio in the sections below. | |
− | == | + | == Computing scalability from SLURM scheduler output == |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
After your job has finished, type: | After your job has finished, type: | ||
Line 15: | Line 11: | ||
sacct -j JOBID --format=JobID,JobName,User,Partition,NNodes,NCPUS,MaxRSS,TotalCPU,Elapsed | sacct -j JOBID --format=JobID,JobName,User,Partition,NNodes,NCPUS,MaxRSS,TotalCPU,Elapsed | ||
− | From the returned output, note the values for <tt> | + | From the returned output, note the values for <tt>TotalCPU</tt> and <tt>Elapsed</tt>. For example: |
− | JobID JobName User Partition NNodes NCPUS MaxRSS <span style="color:red">TotalCPU</span> <span style="color:green">Elapsed</span> | + | JobID JobName User Partition NNodes <span style="color:purple">NCPUS</span> MaxRSS <span style="color:red">TotalCPU</span> <span style="color:green">Elapsed</span> |
− | ------------ ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- | + | ------------ ---------- --------- ---------- -------- <span style="color:purple">----------</span> ---------- <span style="color:red">----------</span> <span style="color:green">----------</span> |
− | 53901011 HEMCO+Hen+ ryantosca jacob 1 8 16? <span style="color:red">1-03:35:33</span> <span style="color:green">04:15:53</span> | + | 53901011 HEMCO+Hen+ ryantosca jacob 1 <span style="color:purple">8</span> 16? <span style="color:red">1-03:35:33</span> <span style="color:green">04:15:53</span> |
− | 53901011.ba+ batch 1 8 6329196K <span style="color:red">1-03:35:33</span> <span style="color:green">04:15:53</span> | + | 53901011.ba+ batch 1 <span style="color:purple">8</span> 6329196K <span style="color:red">1-03:35:33</span> <span style="color:green">04:15:53</span> |
Note that there are 2 entries. The first line represents queue to which you submitted the job (i.e. <tt>jacob</tt>, and the second line represents the internal queue name in which the job actually ran (i.e. <tt>batch</tt>). | Note that there are 2 entries. The first line represents queue to which you submitted the job (i.e. <tt>jacob</tt>, and the second line represents the internal queue name in which the job actually ran (i.e. <tt>batch</tt>). | ||
− | A good measure of how well your job scales across multiple CPUs is the ratio of CPU time to wall-clock time. You can compute this by taking the ratio of the SLURM quantities | + | A good measure of how well your job scales across multiple CPUs is the <span style="color:blue">ratio of CPU time to wall-clock time</span>. You can compute this by taking the ratio of the SLURM quantities |
<span style="color:red"><tt>TotalCPU</tt></span> [s] / <span style="color:green"><tt>Elapsed</tt></span> [s] | <span style="color:red"><tt>TotalCPU</tt></span> [s] / <span style="color:green"><tt>Elapsed</tt></span> [s] | ||
Line 32: | Line 28: | ||
From the above example: | From the above example: | ||
− | CPU time / wall time = <span style="color:red">1d 03h 35m 33s</span> / <span style="color:green">4h 15m 53s</span> | + | CPU time / wall time = <span style="color:red">1d 03h 35m 33s</span> / <span style="color:green">4h 15m 53s</span> |
− | = <span style="color:red">99333 s</span> / <span style="color:green"> | + | = <span style="color:red">99333 s</span> / <span style="color:green">15353 s</span> = <span style="color:blue">6.4699</span> |
− | A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the number of CPUs | + | A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the <span style="color:purple">number of CPUs</span> that were used (in this example, 8), you can get an <span style="color:darkorange">estimate of how efficient your job was, compared to ideal performance</span>: |
− | % of ideal performance = ( 6. | + | % of ideal performance = ( <span style="color:blue">6.4699</span> / <span style="color:purple">8</span> ) * 100 = <span style="color:darkorange">80.87%</span> |
− | + | === Scripts for parsing SLURM scheduler output === | |
− | === | + | To simplify reporting the scalability for GEOS-Chem jobs from SLURM scheduler output, the [[GEOS-Chem Support Team]] have written two Perl scripts, [http://acmg.seas.harvard.edu/geos/wiki_docs/machines/jobstats <tt>jobstats</tt>] and [http://acmg.seas.harvard.edu/geos/wiki_docs/machines/jobinfo <tt>jobinfo</tt>] (upon which <tt>jobstats</tt> relies). All you need to supply is the SLURM job ID #, as shown below: |
+ | |||
+ | > jobstats 53901011 | ||
+ | |||
+ | SLURM JobID # : 53901011 | ||
+ | Job Name : HEMCO+Henry.run | ||
+ | Submit time : 2015-12-17 11:16:16 | ||
+ | Start time : 2015-12-17 11:16:16 | ||
+ | End time : 2015-12-17 15:32:09 | ||
+ | Partition : jacob | ||
+ | Node : regal12 | ||
+ | <span style="color:purple">CPUs : 8</span> | ||
+ | Memory : 6.3292 GB | ||
+ | <span style="color:red">CPU Time : 1-03:35:33 ( 99333 s)</span> | ||
+ | <span style="color:green">Wall Time : 04:15:53 ( 15353 s)</span> | ||
+ | <span style="color:blue">CPU Time / Wall Time : 6.4699</span> <span style="color:darkorange">( 80.87% ideal)</span> | ||
+ | |||
+ | --[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 17:33, 21 December 2015 (UTC) | ||
+ | |||
+ | == Computing scalability from Grid Engine scheduler output == | ||
After your job has finished, type: | After your job has finished, type: | ||
Line 47: | Line 62: | ||
qacct -j JOBID | qacct -j JOBID | ||
− | From the returned output, note the values for <tt>cpu</tt> and <tt>ru_wallclock</tt>. For example: | + | From the returned output, note the values for <span style="color:red">'''<tt>cpu</tt>'''</span> and <span style="color:green">'''<tt>ru_wallclock</tt>'''</span>. For example: |
============================================================== | ============================================================== | ||
Line 65: | Line 80: | ||
end_time Fri Jun 19 01:01:48 2015 | end_time Fri Jun 19 01:01:48 2015 | ||
granted_pe bench | granted_pe bench | ||
− | slots 8 | + | <span style="color:purple">'''slots 8'''</span> |
failed 0 | failed 0 | ||
exit_status 0 | exit_status 0 | ||
− | < | + | <span style="color:green">'''ru_wallclock 28013'''</span> |
ru_utime 189568.938 | ru_utime 189568.938 | ||
ru_stime 1718.925 | ru_stime 1718.925 | ||
Line 86: | Line 101: | ||
ru_nvcsw 390660 | ru_nvcsw 390660 | ||
ru_nivcsw 19093052 | ru_nivcsw 19093052 | ||
− | < | + | <span style="color:red">'''cpu 191287.863'''</span> |
mem 1266832.593 | mem 1266832.593 | ||
io 30.376 | io 30.376 | ||
Line 93: | Line 108: | ||
arid undefined | arid undefined | ||
− | Calculate the | + | Calculate the CPU time / wall time ratio using: |
− | + | <span style="color:red">cpu</span> / <span style="color:green">ru_wallclock</span> | |
From the above example: | From the above example: | ||
− | + | CPU time / wall time = <span style="color:red">191287.863 s</span> / <span style="color:green">28013 s</span> = <span style="color:blue">6.8285</span> | |
+ | |||
+ | A theoretically ideal job running on 8 CPUs would have a <span style="color:blue">CPU time / wall time ratio</span> of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the <span style="color:purple">number of CPUs (aka "slots")</span> that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance: | ||
+ | |||
+ | % of ideal performance = ( <span style="color:blue">6.8285</span> / <span style="color:purple">8</span> ) * 100 = <span style="color:darkorange">85.35%</span> | ||
+ | |||
+ | --[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 17:34, 21 December 2015 (UTC) | ||
+ | |||
+ | === Scripts for parsing Grid Engine output === | ||
+ | |||
+ | To simplify reporting the scalability for GEOS-Chem jobs from Grid Engine scheduler output, the [[GEOS-Chem Support Team]] have written a Perl script ( [http://acmg.seas.harvard.edu/geos/wiki_docs/machines/scale <tt>scale</tt>]). All you need to supply is the Grid Engine job ID #, as shown below: | ||
+ | |||
+ | > scale 81969 | ||
+ | |||
+ | SGE JobID # : 81969 | ||
+ | Submitted at : Thu Jun 18 17:14:15 2015 | ||
+ | Run began at : Thu Jun 18 17:14:55 2015 | ||
+ | Run ended at : Fri Jun 19 01:01:48 2015 | ||
+ | Ran on host : titan-10.as.harvard.edu | ||
+ | Ran in queue : bench | ||
+ | <span style="color:red">CPU Time [s] : 191287.863 s | ||
+ | CPU Time [h:m:s] : 53:08:07</span> | ||
+ | <span style="color:green">Wall Time [s] : 28013 s | ||
+ | Wall Time [h:m:s] : 07:46:53</span> | ||
+ | <span style="color:blue">Scalability : 6.8285</span> | ||
− | --[[User: | + | --[[User:Bmy|Bob Yantosca]] ([[User talk:Bmy|talk]]) 17:20, 21 December 2015 (UTC) |
Latest revision as of 20:44, 21 December 2015
To calculate how well a GEOS-Chem simulation scaled, we compute the ratio of CPU time / wall time. The CPU time and wall time metrics can be obtained by checking the job information from your scheduler. We describe the steps to take to obtain this information and calculate this ratio in the sections below.
Contents
Computing scalability from SLURM scheduler output
After your job has finished, type:
sacct -l -j JOBID
The -l option returns the “long” output information for your job. You may also specify which information you would like to obtain for your job. For example:
sacct -j JOBID --format=JobID,JobName,User,Partition,NNodes,NCPUS,MaxRSS,TotalCPU,Elapsed
From the returned output, note the values for TotalCPU and Elapsed. For example:
JobID JobName User Partition NNodes NCPUS MaxRSS TotalCPU Elapsed ------------ ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- 53901011 HEMCO+Hen+ ryantosca jacob 1 8 16? 1-03:35:33 04:15:53 53901011.ba+ batch 1 8 6329196K 1-03:35:33 04:15:53
Note that there are 2 entries. The first line represents queue to which you submitted the job (i.e. jacob, and the second line represents the internal queue name in which the job actually ran (i.e. batch).
A good measure of how well your job scales across multiple CPUs is the ratio of CPU time to wall-clock time. You can compute this by taking the ratio of the SLURM quantities
TotalCPU [s] / Elapsed [s]
as reported by the sacct command.
From the above example:
CPU time / wall time = 1d 03h 35m 33s / 4h 15m 53s = 99333 s / 15353 s = 6.4699
A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the number of CPUs that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance:
% of ideal performance = ( 6.4699 / 8 ) * 100 = 80.87%
Scripts for parsing SLURM scheduler output
To simplify reporting the scalability for GEOS-Chem jobs from SLURM scheduler output, the GEOS-Chem Support Team have written two Perl scripts, jobstats and jobinfo (upon which jobstats relies). All you need to supply is the SLURM job ID #, as shown below:
> jobstats 53901011 SLURM JobID # : 53901011 Job Name : HEMCO+Henry.run Submit time : 2015-12-17 11:16:16 Start time : 2015-12-17 11:16:16 End time : 2015-12-17 15:32:09 Partition : jacob Node : regal12 CPUs : 8 Memory : 6.3292 GB CPU Time : 1-03:35:33 ( 99333 s) Wall Time : 04:15:53 ( 15353 s) CPU Time / Wall Time : 6.4699 ( 80.87% ideal)
--Bob Yantosca (talk) 17:33, 21 December 2015 (UTC)
Computing scalability from Grid Engine scheduler output
After your job has finished, type:
qacct -j JOBID
From the returned output, note the values for cpu and ru_wallclock. For example:
============================================================== qname bench hostname titan-10.as.harvard.edu group mpayer owner mpayer project NONE department defaultdepartment jobname v10-01-public-release-Run1.run jobnumber 81969 taskid undefined account sge priority 0 qsub_time Thu Jun 18 17:14:15 2015 start_time Thu Jun 18 17:14:55 2015 end_time Fri Jun 19 01:01:48 2015 granted_pe bench slots 8 failed 0 exit_status 0 ru_wallclock 28013 ru_utime 189568.938 ru_stime 1718.925 ru_maxrss 5941376 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 5437936 ru_majflt 23 ru_nswap 0 ru_inblock 36810536 ru_oublock 834224 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 390660 ru_nivcsw 19093052 cpu 191287.863 mem 1266832.593 io 30.376 iow 0.000 maxvmem 6.817G arid undefined
Calculate the CPU time / wall time ratio using:
cpu / ru_wallclock
From the above example:
CPU time / wall time = 191287.863 s / 28013 s = 6.8285
A theoretically ideal job running on 8 CPUs would have a CPU time / wall time ratio of exactly 8. This in practice is never attained due to file I/O as well as system overhead. By dividing the ratio of CPU time / wall time computed above by the number of CPUs (aka "slots") that were used (in this example, 8), you can get an estimate of how efficient your job was, compared to ideal performance:
% of ideal performance = ( 6.8285 / 8 ) * 100 = 85.35%
--Bob Yantosca (talk) 17:34, 21 December 2015 (UTC)
Scripts for parsing Grid Engine output
To simplify reporting the scalability for GEOS-Chem jobs from Grid Engine scheduler output, the GEOS-Chem Support Team have written a Perl script ( scale). All you need to supply is the Grid Engine job ID #, as shown below:
> scale 81969 SGE JobID # : 81969 Submitted at : Thu Jun 18 17:14:15 2015 Run began at : Thu Jun 18 17:14:55 2015 Run ended at : Fri Jun 19 01:01:48 2015 Ran on host : titan-10.as.harvard.edu Ran in queue : bench CPU Time [s] : 191287.863 s CPU Time [h:m:s] : 53:08:07 Wall Time [s] : 28013 s Wall Time [h:m:s] : 07:46:53 Scalability : 6.8285
--Bob Yantosca (talk) 17:20, 21 December 2015 (UTC)