1 / 29

Workload Management

Workload Management . Status of current activity GridPP 13, Durham, 6 th July 2005. Activity…. Scalability testing Analysis of current middleware performance SGE integration GridCC. Scalability Testing. People involved:

tim
Download Presentation

Workload Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workload Management Status of current activity GridPP 13, Durham, 6th July 2005

  2. Activity… • Scalability testing • Analysis of current middleware performance • SGE integration • GridCC Workload Management David Colling, Imperial College London

  3. Scalability Testing People involved: Janusz Martyniak, Luke Dikens, Barry MacEvoy, Steve McGough, David Colling Workload Management David Colling, Imperial College London

  4. Scalability Testing Why… • From EDG we knew that it was easy to build a system capable of running 5 jobs concurrently. • No so easy to build one capable of running 500 jobs or 5000 jobs concurrently. • The plan was to perform testing to find software bottlenecks and hot spots • Feed the results back to the developers in a “virtuous circle” Workload Management David Colling, Imperial College London

  5. Scalability Testing The methodology … • Original plan was to build a testbed across 2 sites (Imperial HEP and LeSC). This was deliverable X.Y • Take an “engineering” approach. I.e. Submit tests to the testbed and monitor how the different components respond. • Metrics to be tested to evolve in complexity as the stability grew. Workload Management David Colling, Imperial College London

  6. Scalability Testing What happened… • Decided to join the JRA1 testbed instead of forming our own. This gave us better access to the developers and much support on other parts of the system that we were not directly testing but which are needed to run the tests e.g. VOMs, RGMA. Also thus made a contribution to wider community. This decision has been praised by Bob Jones and Frederic Hemmer. • Still decided to run two sites (as per deliverable) as this gave a better testing environment for scalability tests Workload Management David Colling, Imperial College London

  7. Scalability Testing • What happenned … • We were delayed by the late release of the WMS in EGEE • However have had two sites in JRA1 testing since immediately after the Athens meeting. The two sites are maintained by JM and LD and they consist of: • Machines: 1 WMS • 2 CEs (+1) • 2 WNs (+1) • Install: apt • Config: Site • Version: R1.1 (+ QF7&8) • Machines: 1 WMS • 1 CE • 2 WNs • 1 RGMA Server. • 1 IO Server • 1 UI • Install: Manual • Config: Site (mostly) • Version: R1.1 Site 2 Site 1 Workload Management David Colling, Imperial College London

  8. Scalability Testing To add to these sites… • SEs • VOMS • Second RGMA server (to complete split) Workload Management David Colling, Imperial College London

  9. Scalability Testing Actual testing… • Only really started writing scalability tests a couple of weeks ago • Have defined some basic metrics • Time to submit as a function of number of jobs for serial submission • Time to submit for parallel submission • Failure rates as function of active jobs • etc • Use LB database and system monitoring on WMS node to reconstruct what is going on Workload Management David Colling, Imperial College London

  10. Scalability Testing So, 100 simple jobs submitted sequentially… • Result preliminary • Example of what we are trying to do • Bypassed known problems especially cross matching • Summary… Workload Management David Colling, Imperial College London

  11. Scalability Testing Summary… 28 Success53 Proxy expired (12 hours after the jobs were submitted !)3 Aborted due to reaching retry count16 Ready state In this sample greatest source of failure is CondorC Workload Management David Colling, Imperial College London

  12. Scalability testing 100 jobs submitted sequentially All registered in 3 minutes Workload Management David Colling, Imperial College London

  13. Scalability Testing Long tail of retries Greatest number <5000s (Excel binning) Workload Management David Colling, Imperial College London

  14. Scalability Testing 100 jobs submitted sequentially Can plot for individual or groups of processes Still activity 1 hour later 5 Minutes Workload Management David Colling, Imperial College London

  15. Scalability Testing Future Plans… • Automate testing scripts • Output directed to web-pages • Expand metrics as appropriate Workload Management David Colling, Imperial College London

  16. Performance of middleware We access to the job data through the LB databases, so why not have a look? • People involved Gidon Moont and David Colling Workload Management David Colling, Imperial College London

  17. Performance of middleware Long tail Workload Management David Colling, Imperial College London

  18. Performance of middleware Number of entries Efficiency RunTime (s) Workload Management David Colling, Imperial College London

  19. Performance of middleware • Future plans… • Keep monitoring this across different releases • Low level activity • Feedback into JRA2 Workload Management David Colling, Imperial College London

  20. SGE Porting People involved David McBride, Mona Aggarwal and Owen Maroney Workload Management David Colling, Imperial College London

  21. SGE Porting LCG Integration with Sun Grid Engine (SGE) • Wish to add LCG as an additional entry point for our existing SGE cluster • Problem: LCG installation assumes the use of PBS as the cluster management system. • Solution: replace PBS-specific components with SGE specific components. Workload Management David Colling, Imperial College London

  22. SGE Porting PBS-specific components in LCG(That need replacing) • Globus JobManager • Already have an existing alternative Globus JobManager for Sun Grid Engine to replace lcgpbs version. • Implemented in Perl, well understood. • Supports 5.x, 6.x revisions of SGE. • Currently installed, about to enter the first run of testing as part of an LCG CE installation. Workload Management David Colling, Imperial College London

  23. SGE Porting PBS-specific Components in LCG (That need replacing) • Information Reporter • Have developed first-pass attempt at an SGE information reporter. • Again, developed in Perl, small, relatively straightforward. (Existing PBS code wasn't very clear, but GLUE Schema is public.) • Installed on site CE, about to enter first run of validation and iterative improvement. Workload Management David Colling, Imperial College London

  24. SGE Porting PBS-specific components in LCG(That need replacing) • Accounting (APEL) • APEL: Accountingusing PBS Event Logs. • SGE does have advanced accounting records but are not stored in the same format as PBS! • Existing Java-based tooling seems large and complex for what should be a fairly straightforward task; not obvious where changes could/should be made. • Refactored version exists in gLite, but would still require new implementation of SGE-specific backend. • Using updated gLite revision on site may well work, but would introduce manageability issues at upgrade-time. • Currently wondering whether APEL can simply be replaced with a small perl script(!) Currently looking up for documentation on the APEL/R-GMA reporting interface. Workload Management David Colling, Imperial College London

  25. SGE Porting Community of Interest formed • Code available from: http://www.lesc.ic.ac.uk/projects/SGE-LCG.html • Mailing list coi-sge-lcg@imperial.ac.uk Workload Management David Colling, Imperial College London

  26. GridCC People involved: Marko Krznaric, Janusz Martyniak, Luke Dickens, John Darlington, Steve McGough, David McBride and David Colling + Tiziana & Costas Workload Management David Colling, Imperial College London

  27. GridCC Lot about GridCC at GridPP12 so brief update • Discussions between GridCC and EGEE (Bob Jones and Frederic Hemmer) • Agreed to collaborate (e.g. use EGEE CVS) GridCC relies on EGEE • First release September this year • Review October this year Workload Management David Colling, Imperial College London

  28. GridCC Bits in red from UK wms activity Workload Management David Colling, Imperial College London

  29. Summary • Activity in 4 areas • testing, • analysis, • SGE port, • GridCC Workload Management David Colling, Imperial College London

More Related