60 likes | 179 Views
Parallel Session B2 - CPU and Resource Allocation. Panelists: Charles Young (BaBar) David Bigagli Seed Questions: Batch queuing system in use? Turnaround guarantees? Pre-allocation of resources?. Batch System. Vendor or home-brewed? Maintenance and support issues.
E N D
Parallel Session B2 - CPU and Resource Allocation • Panelists: • Charles Young (BaBar) • David Bigagli • Seed Questions: • Batch queuing system in use? • Turnaround guarantees? • Pre-allocation of resources?
Batch System • Vendor or home-brewed? • Maintenance and support issues. • If vendor, licensing and cost issues. • Already significant fraction of H/W. • Per node? Per unit computing power? Or? • Management concerns. • Can one really management 10K nodes? • Split into separate management domains? Slide from Charles Young (BaBar)
LSF - Talk given by David Bigagli (Platform Computing) • What is LSF • Developer's view of LSF • architecture • Scalability • Dealing with resources • Load Information manager • Batch • Lively discussion
Discussion: which batch systems are you using and why? • LSF (~30% ?) • LSF has worked for us and continues to work. • PBS (~30%) • PBS is free • Collaborators want us to use PBS (because PBS is free) • Ability to modify source • Nobody is using ProPBS • Condor (~20%) • Condor costs nothing • Cycle stealing allows us to get computing done
Discussion: which batch systems continued... • BQS • Homegrown at IN2P3 • used in a small number of external sites • Have had it for seven years • Everyone likes it • FBS • Homegrown at FNAL • used in some external sites • LSF is expensive • FBS is designed to be used on farms • Lightweight and flexible
Other issues • Mosix • Only CERN has looked at it • Appears to be difficult to take down individual machines in the cluster • How to deal with abusers? • Turn them over to the user community • How do people schedule downtime ? • Train people that jobs longer than 24 hours are at risk. • CERN posts a future shutdown time for the job starter (internal) • BQS has this feature inside. • Condor has eventd for draining. Labs reboot and have maintenance windows.