Latest results with the gLite WMS

Latest results with the gLite WMS Andrea Sciabà CMS-WLCG Task Force Meeting September 20, 2006

Status of the WMS • rb102.cern.ch has been reconfigured to improve the I/O disk performance • RAID5 partition split in three RAID1 partitions • No new patches since two weeks ago

Test • 200 jobs/collection • 33 CEs • 3 UI used in parallel • Total number of jobs: 19800 • “Job Robot” jobs with CMSSW_0_6_1 • VOMS proxies of 24 hours • Myproxy renewal on • No deep resubmission • Shallow resubmission ≤ 3 times

Results (I) • Submission failed for six collections • 18600 jobs effectively submitted • ~ 2h30m to submit all jobs • 0.5 s/job • ~ 17 hours to process all jobs • No Submitted jobs left • 26000 jobs/day

Results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0

Summary of failures • Application errors • Maradona errors • “Periodic Hold…” errors • Only with gLite CE • Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error • Mkfifo /tmp/…: File Exists • Unspecified gridmanager error • Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) • Authentication failed with Belgian CE (CRL expired) • Negligible fractions of other errors • Could not upload a sandbox file • Got a job held event, reason: Globus error 124: old job manager is still alive • Gatekeeper unreachable

Efficiencies • gLite CEs ignored (too unreliable) • Collections not submitted ignored • Scheduled, Running and Done (Success) jobs all considered as successes

Efficiency table (I)

Efficiency table (II)

Conclusions • Very small fraction of failed jobs due to the WMS • Only those remaining in Waiting status • All other failures are due either to the application, to the CE or to authentication problems (expired CRL) • Performance seems to indicate a maximum rate of ~26000 jobs/day • “Job Robot” jobs, it may be different for other kinds of jobs • The WMS looks reasonably fine now

Latest results with the gLite WMS

Latest results with the gLite WMS

Presentation Transcript

Update on gLite WMS tests

Architecture of the gLite WMS (Workload Management System)

Architecture of the gLite WMS

Architecture of gLite WMS (Workload Management System)

gLite WMS Installation and configuration

Latest Results

Latest Results

Latest Results

Latest Results

Update on gLite WMS tests

Use of the gLite-WMS in CMS for production and analysis

Docking and molecular dynamics – complete 16 stage workflow with gLite WMS