120 likes | 200 Views
Connecting LRMS to GRMS. Jeff Templon PDP Group, NIKHEF. HEPiX Batch Workshop 12-13 May 2005. Example Site Scenario. Computer cluster at HIKHEF: 50% guaranteed for SC-Grid 50% guaranteed for LHC experiments Allow either group to exceed 50% if other group not active
E N D
Connecting LRMS to GRMS Jeff Templon PDP Group, NIKHEF HEPiX Batch Workshop 12-13 May 2005
Example Site Scenario • Computer cluster at HIKHEF: • 50% guaranteed for SC-Grid • 50% guaranteed for LHC experiments • Allow either group to exceed 50% if other group not active • Allow D0 experiment to scavenge any crumbs • Give ‘dteam’ (operations group) extremely high priority; but limit to 2 concurrent jobs • Limit running jobs from production groups to ~ 95% of capacity (always keep a few CPUs free for e.g. operations checks)
Example User Scenarios • “polite user” • Uses grid job submission tools ‘bare’ & lets grid figure it out • “high-throughput user” • Ignores grid suggestions on sites; blast each site until jobs start piling up in ‘waiting’ state, then go to next site • “sneaky high-throughput user” • Like above but doesn’t even look at whether jobs pile up … jobs aren’t real jobs, they are ‘pilot’ jobs (supermarket approach) • “fast turnaround user” • Wants jobs to complete as soon as possible (special priority)
Connect Users to Sites with “Maximal Joint Happiness” • Users: work finished ASAP • Sites: always full and usage matches fair-share commitments
Key Question: How Long to Run? • Users: want to submit to sites that will complete job as fast as possible • Sites: site may be “full” i.e. no free CPUs BUT: • HIKHEF 100% full for ATLAS means that • Any ‘SC-Grid’ jobs submitted will run as soon as a free CPU appears • If you can’t get this message to users, won’t get any SC-Grid jobs • Should be clear from this that answer to “how long” depends on who is asking!
Different answers, same question dteam ATLAS Time to start (sec) Real Time -> (sec) Black lines are measured, blue triangles are statistical predictions See Laurence’s Talk
How Long to Run • Need reasonable normalized estimates from users • Need normalized CPU units • Need solution for heterogeneous CPU population behind most site’s grid entry points (HIKHEF has these) • Probably see Laurence’s talk here too! • Added value: good run-time estimates helps LRMS scheduling (eg MPI jobs & backfill)
Sneaky HT vs Polite Users • Polite almost always loses • Sneaky HT good for sites to 0th order – mix of waiting jobs allows good scheduling • However • Templon needs to run 10 jobs • Submits 10 jobs to each of 100 sites in grid • First ten to start grab the ‘real’ jobs • Other 990 look exactly like black hole jobs • Waste ~ 16 CPU hrs (2 min scheduling cycle * 500 passes)
Polite Users still Lose unless we solve: • One question, one answer … one size fits nobody • High overhead in WMS: avg 250 sec life cycle for 20 sec job! • Two hour job • Single user • Single RB • Best RB perf • Sched cycle is only delay at site Grid Speedup Number of Jobs Submitted
High Priority Users • Sol’n 1: dedicated CPUs (standing reservations) (expensive!) • Soln’ 2: virtualization w/preemption (long way off?)
Other Issues • Transferring Info to LRMS • Run-time estimate • helps enormously in e.g. scheduling MPI jobs • Also may help in answering “the question” • Memory usage, disk space needs, etc etc • MPI & accounting – what about “the dip”? • Self-disabling sites (avoid hundreds of lost jobs and tens of lost person-hours) • “Circuit breakers”? (Miron Livny)