100 likes | 215 Views
Latest results with the gLite WMS. Andrea Sciabà. CMS-WLCG Task Force Meeting September 20, 2006. Status of the WMS. rb102.cern.ch has been reconfigured to improve the I/O disk performance RAID5 partition split in three RAID1 partitions No new patches since two weeks ago. Test.
E N D
Latest results with the gLite WMS Andrea Sciabà CMS-WLCG Task Force Meeting September 20, 2006
Status of the WMS • rb102.cern.ch has been reconfigured to improve the I/O disk performance • RAID5 partition split in three RAID1 partitions • No new patches since two weeks ago
Test • 200 jobs/collection • 33 CEs • 3 UI used in parallel • Total number of jobs: 19800 • “Job Robot” jobs with CMSSW_0_6_1 • VOMS proxies of 24 hours • Myproxy renewal on • No deep resubmission • Shallow resubmission ≤ 3 times
Results (I) • Submission failed for six collections • 18600 jobs effectively submitted • ~ 2h30m to submit all jobs • 0.5 s/job • ~ 17 hours to process all jobs • No Submitted jobs left • 26000 jobs/day
Results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0
Summary of failures • Application errors • Maradona errors • “Periodic Hold…” errors • Only with gLite CE • Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error • Mkfifo /tmp/…: File Exists • Unspecified gridmanager error • Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) • Authentication failed with Belgian CE (CRL expired) • Negligible fractions of other errors • Could not upload a sandbox file • Got a job held event, reason: Globus error 124: old job manager is still alive • Gatekeeper unreachable
Efficiencies • gLite CEs ignored (too unreliable) • Collections not submitted ignored • Scheduled, Running and Done (Success) jobs all considered as successes
Conclusions • Very small fraction of failed jobs due to the WMS • Only those remaining in Waiting status • All other failures are due either to the application, to the CE or to authentication problems (expired CRL) • Performance seems to indicate a maximum rate of ~26000 jobs/day • “Job Robot” jobs, it may be different for other kinds of jobs • The WMS looks reasonably fine now