520 likes | 630 Views
George Washington University George Mason University. Experimental Comparative Study of Job Management Systems. http://ece.gmu.edu/lucite. Outline:. Review of experiments Results Encountered problems Functional comparison Extension to reconfigurable hardware. Review of Experiments.
E N D
George Washington University George Mason University Experimental Comparative Study of Job Management Systems http://ece.gmu.edu/lucite
Outline: • Review of experiments • Results • Encountered problems • Functional comparison • Extension to reconfigurable hardware
Our Testbed gmu.edu alicja Solaris 8 – UltraSparcIIi, 360 MHz, 512 MB RAM science.gmu.edu anna Solaris 8 – UltraSparcIIi, 440 MHz, 128 MB RAM palpc2 pallj / m0 Linux – PII, 400 MHz, 128 MB RAM magdalena Linux RH7.0 – PIII 450 MHz, 512 MB RAM Solaris 8 – UltraSparcIIi, 440 MHz, 512 MB RAM m7 m5 m1 m4 redfox 3 x Linux RH6.2 – 2xPIII – 450 MHz, 128MB 4 x Linux RH6.2 – 2xPIII – 500 MHz, 128MB Solaris 8 – UltraSparcIIi, 330 MHz, 128 MB RAM
SHORT JOBS (1 s execution time 2 minutes) SHORT JOBS (1 s execution time 2 minutes) * benchmarks used to determine the relative CPU factors of execution hosts * benchmarks used to determine the relative CPU factors of execution hosts
CPU factors for medium benchmark list based on the execution time for bt.W and Sobel1024i Machine names Host Type Host Model CPU Factor m1-m4 Linux PIII_2_500_128 1.65 m5-m7 Linux PIII_2_450_128 1.55 pallj Linux PIII_1_450_512 1.60 palpc2 Linux P2_1_400_128 1.70 alicja Solaris64 USIIi_1_360_512 1.0 anna Solaris64 USIIi_1_440_128 1.2 magdalena Solaris64 USIIi_1_440_512 1.2 redfox Solaris64 USIIi_1_330_128 1.2
No. Group Name Class Script name CPU time Memory Memory [min:s] Usage [MB] Requirements [MB] 1 NPB EP A 7:45 1.3 3 ep.A.sh 2 NPB LU W 8:09 6.8 9 lu.W.sh sp.W.sh 3* NPB SP W 6:07 15.1 19 4 Crypto Mars M 9:21 0.4 1 crypto.mars.M.sh 5 Crypto RC6 M 6:21 0.4 1 crypto.rc6.M.sh 6 Crypto Rijndael M 4:11 0.4 1 crypto.rijndael.M.sh 7 Crypto Serpent M 8:54 0.4 1 crypto.serpent.M.sh crypto.twofish.M.sh 8* Crypto Twofish M 8:05 0.4 1 7:22 Average MEDIUM JOBS (2 minutes execution time 10 minutes) * benchmarks used to determine the relative CPU factors of execution hosts
LONG JOBS (10 minutes execution time 30 minutes) * benchmarks used to determine the relative CPU factors of execution hosts
INPUT/OUTPUT JOBS (1 second execution time 10 minutes)
Typical experiment Job submissions Pseudorandom delays between consecutive job submissions time 1 N Poisson distribution of the job submission rate 150 for medium and small jobs 75 for long jobs N= Jobs finishing execution time time=0 i1 iN Total time of an experiment 2 hours
Definition of timing parameters tb begin of execution time te end of execution time ts submission time td delivery time time TD delivery time TEXE execution time TR response time TTA turn around time
Typical scenario tb begin of execution time te end of execution time ts submission time time TEXE execution time TR response time TD=0 delivery time=0 TTA turn around time determined using the gettimeofday() function
Total Throughput N Total Throughput = TN Job submissions time 1 N Jobs finishing execution time time=0 i1 iN TN – time necessary to execute N jobs
Partial Throughput k Throughput (k) = Tk Job submissions time 1 N Jobs finishing execution time time=0 i1 iN ik Tk – time necessary to execute k jobs
Utilization CPU utilization 100% average CPU utilization machine 1 job2 job3 job1 0% CPU utilization 100% average CPU utilization machine 2 job2 job3 job1 0% . . . . . . . . . . . . . Overall utilization = CPU utilization 100% machine M average CPU utilization job1 job2 0%
Medium jobs – Total Throughput Throughput [jobs/hour] 114 120 Codine LSF 110 107 102 PBS Condor 100 91 97 86 82 79 76 80 70 68 60 40 20 0 2 jobs/min 12 jobs/min 4 jobs/min Average job submission rate
Medium jobs – Turn-around Time Turn-around Time [s] 2500 LSF PBS 1949 2000 Codine 1765 1627 Condor 1466 1500 1293 1148 1134 944 1000 607 496 505 462 500 0 2 jobs/min 12 jobs/min 4 jobs/min Average job submission rate
Medium jobs – Response Time Response Time [s] 1600 LSF 1385 PBS 1400 1274 Codine 1156 1200 Condor 984 1000 734 800 671 636 600 452 400 200 28 31 13 3 0 12 jobs/min 4 jobs/min 2 jobs/min Average job submission rate
Medium jobs – Utilization Utilization [%] 90 Codine LSF 78 PBS Condor 80 73 74 71 70 69 67 70 63 61 57 60 54 50 41 40 30 20 10 0 2 jobs/min 12 jobs/min 4 jobs/min Average job submission rate
Long jobs – Total Throughput Throughput [jobs/hour] 45 42 LSF 40 PBS 40 Codine 35 30 Condor 28 30 26 25 23 25 18 20 15 10 5 0 2 jobs/min 0.5 job/min Average job submission rate
Long jobs – Turn-around Time Turn-around Time [s] 4000 LSF 3401 PBS 3500 Codine 3000 Condor 2357 2500 2191 2163 1926 1903 2000 1500 1148 1079 1000 500 0 2 jobs/min 0.5 job/min Average job submission rate
Long jobs – Response Time Response Time [s] 1600 1478 LSF PBS 1400 1225 Codine 1200 Condor 1000 860 799 721 800 600 400 200 13 3 3 0 2 jobs/min 0.5 job/min Average job submission rate
Long jobs – Utilization Utilization [%] 80 LSF 69 PBS 70 64 Codine 58 56 60 Condor 52 46 50 43 40 30 24 20 10 0 2 jobs/min 0.5 job/min Average job submission rate
Short jobs – Total Throughput Throughput [jobs/hour] 1400 LSF 1255 1210 PBS 1200 1076 Codine 1027 1000 Condor 800 652 642 607 576 600 414 356 370 337 336 400 322 240 280 234 227 205 160 200 0 4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min Average job submission rate
Short jobs – Turn-around Time Turn-around Time [s] 140 LSF 120 PBS 120 Codine 100 Condor 80 68 62 58 58 60 52 51 50 50 51 42 42 41 40 34 32 33 31 29 29 29 20 0 4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min Average job submission rate
Short jobs – Response Time Response Time [s] 90 83 LSF 80 PBS Codine 70 Condor 60 50 40 32 30 19 19 18 18 17 20 9 9 8 9 9 8 10 2 1 2 3 2 1 1 0 4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min Average job submission rate
Short jobs – Utilization Utilization [%] 45 LSF 38 38 40 PBS 37 37 35 Codine 35 32 Condor 30 26 25 21 20 18 20 16 15 15 12 12 10 9 9 10 8 6 6 5 0 4 jobs/min 6 jobs/min 12 jobs/min 30 jobs/min 60 jobs/min Average job submission rate
Medium jobs – Total Throughput Throughput [jobs/hour] 120 LSF 114 Codine 105 PBS Condor 100 91 90 97 82 80 80 67 60 40 20 0 2 jobs/CPU, 4 jobs/min 1 job/CPU, 4 jobs/min Maximum number of jobs per CPU
Medium jobs – Turn-around Time Turn-around Time [s] 1600 1482 LSF Codine PBS Condor 1400 1297 1273 1293 1147 1134 1200 969 944 1000 800 600 400 200 0 2 jobs/CPU, 4 jobs/min 1 job/CPU, 4 jobs/min Maximum number of jobs per CPU
Medium jobs – Response Time Response Time [s] 800 734 LSF Codine 671 700 PBS Condor 636 600 452 500 386 386 387 400 285 300 200 100 0 2 jobs/CPU, 4 jobs/min 1 job/CPU, 4 jobs/min Maximum number of jobs per CPU
Medium jobs – Utilization Utilization [%] 80 Codine LSF 74 71 Condor PBS 70 63 57 63 63 60 54 58 50 40 30 20 10 0 2 jobs/CPU, 4 jobs/min 1 job/CPU, 4 jobs/min Maximum number of jobs per CPU
1. Jobs with high requirements on the stack size Indication: Certain jobs do not finish execution when run under LSF. The same jobs run correctly outside of any JMS, and under other job management systems Source: Variable STACKLIMIT in $LSB_CONFDIR/<cluster_name>/configdir/lsb.queues Remaining Problem: Documentation of default limits.
2. Frequently submitted small jobs Indication: Unexpectedly high response time and turn-around time for a medium job submission rate Possible solution: Defining variable CHUNK_JOB_SIZE (e.g., =5) in lsb.queues, and the variable LSB_CHUNK_NORUSAGE=y in lsf.conf
3. Ordering of machines fulfilling resource requirements Default: r1m : pg Question: How many machines are dropped from the list based on the first ordering?
4. Random behavior from iteration to iteration Indication: Assignment of jobs to particular machines is different in each iteration of the experiment Question: Why is r1m different each time?
5. Boundary effects in the calculation of the throughput Indication: Steady state partial throughput different than the total throughput Question: How to define the steady state throughput?
6. Throughput vs. turn-around time Indication: No correlation between the ranking of JMSes in terms of the throughput and in terms of the turn-around time Question: How to explain the lack of this correlation?
Operating system, flexibility, user interface RES CONDOR PBS Codine LSF pub com pub/com pub gov Distribution Source code OS Support Solaris Linux Tru64 NT User Interface GUI & CLI GUI & CLI CLI GUI & CLI GUI & CLI
Schedulingand Resource Management RES CONDOR PBS Codine LSF Batch jobs Interactive jobs Parallel jobs Accounting
Efficiency and Utilization RES CONDOR PBS Codine LSF Stage-in and stage-out Timesharing Process migration Dynamic load balancing Scalability
Fault Tolerance and Security RES CONDOR PBS Codine LSF Checkpointing Daemon fault recovery Authentication Authorization
Documentation and Technical Support RES CONDOR PBS Codine LSF Documentation Technical support
JMS features supporting extension to reconfigurable hardware • capability to define new dynamic resources • strong support for stage-in and stage-out • configuration bitstreams • executable code • input/output data • support for Windows NT and Linux
Ranking of Centralized Job Management Systems (1) Capability to define new dynamic resources: Excellent:LSF, PBS, CODINE More difficult: CONDOR, RES Stage-in and stage-out: Excellent:LSF, PBS Limited: CONDOR No: CODINE, RES
Ranking of Centralized Job Management Systems (2) Overall suitability to extend to reconfigurable hardware: • LSF • CODINE • PBS • CONDOR • RES without changing the JMS source code requires changes to the JMS source code
Extension to reconfigurable hardware