1 / 10

Latest results with the gLite WMS

Latest results with the gLite WMS. Andrea Sciabà. CMS-WLCG Task Force Meeting September 20, 2006. Status of the WMS. rb102.cern.ch has been reconfigured to improve the I/O disk performance RAID5 partition split in three RAID1 partitions No new patches since two weeks ago. Test.

marc
Download Presentation

Latest results with the gLite WMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latest results with the gLite WMS Andrea Sciabà CMS-WLCG Task Force Meeting September 20, 2006

  2. Status of the WMS • rb102.cern.ch has been reconfigured to improve the I/O disk performance • RAID5 partition split in three RAID1 partitions • No new patches since two weeks ago

  3. Test • 200 jobs/collection • 33 CEs • 3 UI used in parallel • Total number of jobs: 19800 • “Job Robot” jobs with CMSSW_0_6_1 • VOMS proxies of 24 hours • Myproxy renewal on • No deep resubmission • Shallow resubmission ≤ 3 times

  4. Results (I) • Submission failed for six collections • 18600 jobs effectively submitted • ~ 2h30m to submit all jobs • 0.5 s/job • ~ 17 hours to process all jobs • No Submitted jobs left • 26000 jobs/day

  5. Results (II) Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 0 0 0 0 0 200 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 2 122 0 0 76 0 0 ce01-lcg.projects.cscs.ch 0 0 0 195 5 0 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 200 0 0 0 0 0 0 ce04-lcg.cr.cnaf.infn.it 0 10 0 0 0 0 23 0 0 167 ce04.pic.es 0 0 0 0 0 200 0 0 0 0 ce101.cern.ch 0 0 0 0 0 0 0 200 0 0 ce102.cern.ch 0 0 0 0 0 0 0 200 0 0 ce103.cern.ch 0 9 0 0 0 0 1 16 0 174 ce104.cern.ch 0 10 0 0 0 0 66 28 0 96 ce105.cern.ch 0 0 0 0 0 0 0 200 0 0 ce106.cern.ch 0 0 0 0 0 0 0 200 0 0 ceitep.itep.ru 0 0 0 150 3 47 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 200 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 200 0 0 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 0 0 200 0 0 egeece.ifca.org.es 0 0 0 0 0 190 10 0 0 0 grid-ce1.desy.de 0 0 0 1 0 199 0 0 0 0 grid-ce2.desy.de 0 0 0 200 0 0 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 0 0 200 0 0 grid109.kfki.hu 0 0 0 0 0 189 0 11 0 0 gridba2.ba.infn.it 0 0 0 0 1 0 0 199 0 0 gridce.iihe.ac.be 0 9 0 0 0 0 3 15 0 173 gridce.pi.infn.it 0 0 0 180 20 0 0 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 86 11 103 0 0 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 200 0 0 0 0 0 0 lcg02.ciemat.es 0 10 0 12 2 150 2 0 0 24 lcg06.sinp.msu.ru 0 1 0 34 11 154 0 0 0 0 lcgce01.gridpp.rl.ac.uk 0 10 0 0 0 0 158 0 0 32 lcgce01.jinr.ru 0 1 0 199 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 0 3 197 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 200 0 0 0 0

  6. Summary of failures • Application errors • Maradona errors • “Periodic Hold…” errors • Only with gLite CE • Jobs remaining in Waiting status while Pending events are generated every 5 minutes with error • Mkfifo /tmp/…: File Exists • Unspecified gridmanager error • Shallow resubmission often recovers, but if the error happens again, the job is aborted (but sometimes appears as Cancelled) • Authentication failed with Belgian CE (CRL expired) • Negligible fractions of other errors • Could not upload a sandbox file • Got a job held event, reason: Globus error 124: old job manager is still alive • Gatekeeper unreachable

  7. Efficiencies • gLite CEs ignored (too unreliable) • Collections not submitted ignored • Scheduled, Running and Done (Success) jobs all considered as successes

  8. Efficiency table (I)

  9. Efficiency table (II)

  10. Conclusions • Very small fraction of failed jobs due to the WMS • Only those remaining in Waiting status • All other failures are due either to the application, to the CE or to authentication problems (expired CRL) • Performance seems to indicate a maximum rate of ~26000 jobs/day • “Job Robot” jobs, it may be different for other kinds of jobs • The WMS looks reasonably fine now

More Related