1 / 10

Job submission, CE and WMS

Job submission, CE and WMS. Claudio Grandi (INFN & CERN) GDB meeting 31/8/2007. WMS/LB: acceptance criteria. Before being taken in certification: 5 days consecutive run without intervention 10K jobs/day handled by a single instance of the WMS number of jobs in non-final state < 0.5%

hasad-mckee
Download Presentation

Job submission, CE and WMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Job submission, CE and WMS Claudio Grandi (INFN & CERN) GDB meeting 31/8/2007

  2. WMS/LB: acceptance criteria • Before being taken in certification: • 5 days consecutive run without intervention • 10K jobs/day handled by a single instance of the WMS • number of jobs in non-final state < 0.5% • Proxy renewal must work at the 98% level • Note that the LB should be installed on a separate machine (that can serve multiple WMS instances at the same time) GDB - 31 August 2007

  3. WMS/LB: status • Easter: • run one week at 15000 jobs/day without manual intervention with 0.3% of jobs in non-final state • Stress tests: • almost 27000 jobs/day • Check-point patches: • #1116 on 02/04/07 • #1140 on 20/04/07 • #1167 on 15/05/07 • #1203 on 21/06/07 • #1251 on 18/07/07 • Patch #1251 on the PPS • Will be released to PS next Wednesday if no major issues found • Issue: all that is built in the gLite 3.0 environment • The SL4/VDT 1.6.0 release (from ETICS) is being tested now 27000 jobs/day Final states } Job on CE } Job in WMS Small number of jobs not assigned to CEs GDB - 31 August 2007

  4. CE: acceptance criteria • Before being taken in certification: • 5000 simultaneous jobs per CE node • Job failure rates due to CE in normal operation: < 0.5% • Job failures due to restart of CE services or CE reboot <0.5%. • For 2007 dress rehearsals • 50 user/role/submission_node combinations (Condor_C instances) per CE node • 5 days unattended running with performance on day 5 equivalent to that on day 1 • In the longer term • 1 CE node should support an unlimited number of user/role/submission node combinations, from at least 20 VOs • On the gLiteCE requires user switching done by glexec in blah • 1 month unattended running without performance degradation GDB - 31 August 2007

  5. CE: additional requirements • Support for commonly used batch systems • High priority: LSF, PBS-Torque/Maui • Condor, SGE • Support for commonly used submission systems • gLite WMS, Condor-G, pre-WS GRAM • In future BES. WS-GRAM? • Note that by construction not all the submission systems may be implemented by the same product • may need more than one product at each site... • Support for services needed on the infrastructure • GIP to publish on the bdII information system • Accounting data to the GOC (APEL, DGAS?) • Interoperability with our major partner projects (OSG) GDB - 31 August 2007

  6. The gLiteCE and CREAM passed the accepatance tests gLite CE (GSI-enabled Condor-C) on SL3 (tests @CERN) Implements the identity change with glexec (called by BLAH) Up to 8000 jobs in a day submitted with 40 users with no failures https://edms.cern.ch/document/863275/1 CREAM on SL3 - Tests going on on SL4 now (tests @CNAF) Uses the same BLAH of the gLite CE > 90000 in 8 days with 50 users and 111 failures (LSF errors) https://edms.cern.ch/document/863276/1 CE: status 1/2 CREAM ~ 6 K jobs on the CE at any time Run phase: ~10 K jobs/day Load phase: ~ 1 K jobs/hour } GDB - 31 August 2007

  7. CE: status 2/2 • LCG-CE has been ported to SL4 / VDT 1.6.0 • GT4 pre-WS GRAM (tests @CERN) • Uses the globus call-out implemented in GT4 and the new LCAS/LCMAPS that was developed for the gLiteCE • Does not scale to the level of the acceptance tests • ok for single user, but 21% success rate with 9000 jobs/day submitted by 15 users and high load on the machine • https://twiki.cern.ch/twiki/bin/view/LCG/LCGCETest GDB - 31 August 2007

  8. CE: next steps • Assuming that the project cannot support more than one CE in the long term, the TCG decided that: • The first priority is to deploy the LCG-CE on SL4 as it is • On the PPS in early September, on the PS by end of September • Full develop CREAM and make it available on the production infrastructure • Certification by December, in production in February-March 2008 • https://twiki.cern.ch/twiki/bin/view/EGEE/CECheckList • Sites supporting applications that use native globus interface for job submission will continue to deploy the LCG-CE • The development of a new node type that combines the LCG-CE and CREAM is in discussion and will be addressed after the release of CREAM in production • Ask VOs to switch to other supported interfaces • No effort will be spent on the gLite CE • We will continue to accept requirements on BLAH and glexec from Condor that will be included in the RedHat and Fedora distributions • TCG:https://twiki.cern.ch/twiki/bin/viewfile/EGEE/TCGHome?rev=1;filename=CE-proposal.doc GDB - 31 August 2007

  9. CE: summary CREAM or BES client User Globus client UI • LCG-CE (GT4 pre-WS GRAM) • Globus client, Condor-G • Globus code not modified • Not passing acceptance tests • Condor-C (was gLite CE) • Condor-G • Maintained by Condor • VDT includes BLAH & glexec • CREAM (WS-I) • CREAM and BES client, ICE, Condor-G (prototype) gLite WMS ICE Condor-G CREAM CEMon Condor-C LCG-CE (GT4 + add-ons) Site gLite component non-gLite component User / Resource BLAH EGEE authZ, InfoSys, Accounting Batch System In production Existing prototype GIP GDB - 31 August 2007

  10. Job Prioritization • TCG-WG active since more than one year • Focus is on a short term solution that support access to batch system shares based on the user credentials (VOMS) • Situation summarized in a note (Andrea and Simone): • http://cern.ch/egee-intranet/NA1/TCG/wgs/Job%20Priorities%20Implementation%20Plan.doc • Proposal: the VOView Access Control Base Rule (ACBR) is published in a way that reflects the behaviour of LCAS/LCMAPS • VOViews are mutually exclusive (use DENY tags) • Required modifications to the match maker (WMS), the Generic Information Provider (GIP) and YAIM • Implementation is done but a few details still prevent a correct configuration of the GIP • In the longer term a consistent way of treating authorization in middleware components is needed • Authorization service that will also be able to handle policies • Better mechanisms on the CE to map users to batch shares GDB - 31 August 2007

More Related