170 likes | 286 Views
Scheduling under LCG at RAL. UK HEP Sysman, Manchester 11th November 2004 Steve Traylen s.traylen@rl.ac.uk. RAL, LCG, torque and Maui. Observations of RAL within LCG vs. traditional batch. Various issues that arose and what was done to tackle this. Upcoming changes in the next release.
E N D
Scheduling under LCG at RAL UK HEP Sysman, Manchester11th November 2004 Steve Traylens.traylen@rl.ac.uk
RAL, LCG, torque and Maui • Observations of RAL within LCG vs. traditional batch. • Various issues that arose and what was done to tackle this. • Upcoming changes in the next release. • Some items still to be resolved.
LCG Grid vs. Traditional Batch • Observations from the LHCb vs. Atlas period earlier this year. • Matters in common for LCG Grid and Batch. • RAL must provide 40% to Atlas, 30% to LHCb, … as dictated by GridPP. • Differences for LCG Grid and Batch. • Batch - 4000 queued for 400 job slots. • LCG - often < 5 queued for 400 job slots.
LCG Grid vs. Traditional Batch • Providing allocations difficult with LHCb submitting at a faster rate. • RAL only received LHCb jobs. • Only solution with OpenPBS to just hard limit LHCb. • Idle CPUs a waste of money. • Always better to give allocation as soon as possible. • LHCb jobs pile up due to apparent free resource. • RAL becomes unattractive (via ETT) to Atlas.
Queues per VO • Many sites, CNAF, LIP, NIKHEF, … moved to queues per VO. • Advantages • Estimated traversal calculation orthogonal for each VO. • While piling up LHCb jobs Atlas jobs are still attracted, hopefully always one queued job available. • Queue lengths can be customised. • Disadvantages • Change in farm just to fit into to LCG. • Adding VOs becomes harder.
Queues per VO(2) • ETT calculation rudimentary, just increases as jobs are queued on a per queue basis. • RAL only gives 1 CPU to Zeus and 399 to Atlas • ETT calculation does not reflect this really. • In fact RAL’s queues now have a zero FIFO component. • But it still works once Zeus jobs pile up they stop coming.
CPU Scaling • CPU variation. • Can be removed now within the batch farm by configuring pbs_mom to normalise CPU time. • Normalised speed published into info System. • Walltime scaling more confusing. • RAL does, we fairshare the whole farm on wall time. • However what we advertise is a lie.
CPU Scaling and ETT • Only at RAL the scaling is extreme. • Normalised on Pentium 450 • Nodes are scaled from 4.7 to 5.0 at present. • So CPUlimits and walltimes are very long (9 days). • Once jobs are queued RAL became very unattractive. • We modified info provider to make the “ETT” comparable to other sites. • We will renormalize at some point soon.
OpenPBS to Maui/Torque • OpenPBS (as in LCG today) hangs when one node crashes. • Torque is okay (most of the time). • Torque is just a new version of OpenPBS maintained by www.supercluster.org. • No integration required. • Active user community and mailing list. • Maintained well, bug fixes and patches are accepted and added regularly. • Maui is a more sophisticated scheduler capable of fairshare for instance.
Fairshare with Maui • The default is FIFO and so the same as the default PBS scheduler. • Maui supports fairshare on Walltime. • E.g., Consider last 7 days of operation. • Give Atlas 50%, CMS 20%. • Give lhcbsgm a huge priority but limit “them” to one job. • Reserve one CPU for a 10 minute queue. • Monitoring jobs. • Maui will strive to reach these targets. • Tools exist to diagnose and understand why Maui is not doing what you hope for allowing tuning.
Heterogeneous Clusters • Many farms currently have mixed memory, local disk space, ….. • Glue contains sub clusters but they are currently identified by only the hostname.GlueSubClusterUniqueID=lcgce02.gridpp.rl.ac.ukSo only one hardware type per CE possible. • Does RB joins the SubCluster against a GlueCE object.
Heterogeneous Clusters(2) • It would seem easy to describe a second sub cluster and use a unique key. • GlueSubClusterUniqueID=lcgce02.gridpp.rl.ac.uk-bigmem • Different GlueCEs can then join on this? • Does this work? • Information providers may need tweaking. • Will the RB do this, what else will break. • Can the JobManager support different attributes per queue to target nodes. • Advertising fake queues possible.
Future Possibilities • One queue per VO per memory size per local disk space per time period …. = a lot of queues. • Some sites only have one queue and insist on users setting requirements per job. • It is a good a idea within the batch farm. • The Resource Broker does not pass this on to gram transfer, how much do we want this?
Maui based Info Provider • Current info provider only interrogates PBS. • PBS has no idea what is going to happen next. • A Maui based provider could calculate the ETT better. • But it may difficult to port to LSF, BQS, ….
Conclusions • Moving to torque and Maui is transparent. • Maui and queues per VO will introduce more control of resources within a site and increase their occupancy. • Major adjustments are needed to existing queue infrastructures. • Heterogeneous cluster support within LCG…. • As LCG resources merge with other EGEE resources new challenges arise such as running parallel jobs in production – Crossgrid?
References • Maui and Torque homepages including documentation. • http://www.supercluster.org/ • Maui/Torque RPMS appearing in LCG. • http://www.gridpp.rl.ac.uk/tb-support/faq/torque.html • More Maui/Torque RPMS and qstat cache mechanism. • http://www.dutchgrid.nl/Admin/nikhef/