Update on gLite WMS tests

Update on gLite WMS tests Andrea Sciabà WLCG-OSG-EGEE Operations meeting September 21, 2006

Testing the gLite WMS • RB installed with gLite 3.0.2 + various patches • Dedicated machine at CERN (rb102.cern.ch) • 2 × Xeon 3.0 GHz • 4 GB of RAM • 3 RAID1 partitions for better I/O performance • Closely monitored by GD, FIO and JRA1 people • Tests run by CMS, GD, ATLAS

CMS Test description • Application • Fake analysis jobs (~30’ of CPU time) • Run on CMS Tier-1’s and Tier-2’s • Different submission methods • Network server • WMProxy • Bulk submission • Submission from 1-3 UI’s in parallel • VOMS proxies • Myproxy renewal on • Deep resubmission off • Shallow resubmission ≤ 3

Latest results (I) • No. of jobs = 3 UI × 33 CEs × 200 jobs/collection  20000 jobs • ~2.5 hours to submit all jobs • ~0.5 sec/job • Submission failed for 6 collections • ~17 hours to dispatch all jobs • Equivalent to ~26000 jobs/day

Latest results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0

Failure reasons • Application errors • Maradona errors • “Got a job held event, reason: "The PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ” errors • The WMS could not submit the job to a gLite CE • Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error • Mkfifo /tmp/…: File Exists • Unspecified gridmanager error • Normally a batch system problem • Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) • Authentication failed with Belgian CE (CRL expired) • Negligible fractions of other errors • Could not upload a sandbox file • Got a job held event, reason: Globus error 124: old job manager is still alive • Gatekeeper unreachable

Efficiency table (I)

Efficiency table (II)

Conclusions • Very small fraction of failed jobs due to the WMS • Only those remaining in Waiting status (O(100)) • All other failures are due either to the application, to the CE or to authentication problems (expired CRL) • Performance seems to indicate a maximum rate of ~26000 jobs/day • “Job Robot” jobs, it may be different for other kinds of jobs • The WMS looks reasonably fine now

Update on gLite WMS tests

Update on gLite WMS tests

Presentation Transcript

WMS Update

MCWG Update to WMS

WMS UPDATE

Latest results with the gLite WMS

Architecture of the gLite WMS (Workload Management System)

DSWG Update to WMS

DSWG Update to WMS

MCWG Update to WMS

WMS Update

MCWG Update to WMS

QMWG Update to WMS

Architecture of the gLite WMS

CMWG Update to WMS

Architecture of gLite WMS (Workload Management System)

WMS UPDATE

Update on out-gassing tests

WMS UPDATE

gLite WMS Installation and configuration

RCWG Update to WMS

Update on gLite WMS tests

WMS UPDATE

MCWG Update to WMS