80 likes | 193 Views
Current status WMS and CREAM CE deployment. Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09). WMS: some highlights. In December 2008 ALICE finished the migration of all sites to a WMS submission approach
E N D
Currentstatus WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
WMS: somehighlights • In December 2008 ALICE finished the migration of all sites to a WMS submissionapproach • The instabilitiesfound in the system has forced the experiment and the support to babysitcontinuously the system and the production • This proceduredoes not scale in a real data takingapproach (in few months) • ALICE has not changed the submissionproceduredefinedevenbefore 2006 DC • IMHO is not the experimentchaging the submissionprocedurebecause a new service is not providing the correspondingstability • It is the service copingwith the experimentrequirements and computing model, not the opposite • Let’s stop: • sayingthatthis issue affectes ALICE only: It issimply NOT TRUE • Daily I seesimilar issues with Geant4, Lattice QCD, sixT. • Asking ALICE to change the submissionprocedure • It is not realisticatthis point, in addition not see the point of changing one workload management system due to (not wellunderstood) instabilities in a service
ALICE approach • ALICE requiresdeployment of the CREAM-CE at all sites • This is the highestpriority • Sites mightbeexcluded of the production if the service is not provided • The experimentthereforewill not maintain a new submissionprocedure for somemonths • Intermedium time from WMS to CREAM • In addition bothsystems must bemaintaintogether • bulksubmissionis not supported to the CLI levelyet by CREAM • It is not realistic to have 2 submissionapproachesatthis time by NONE application
Status of the WMS in production • Distribution of WMS in the ALICE production • For T0 site • Optimal situation: 3 WMS covering the production and the Pass 1 reconstruction at the T0 only • The reality: Eachnode has achieved a limit of 13K jobs/day (confirmed by the WMS operation experts). In addition thesenodes have to copewith the instabilities of external WMS • For T1 sites • Optimal situation: Each T1 site shouldprovideat least 2 WMS whichshouldbededicated in the case of manydepending T2 sites in the country • The reality: This affects basicallyItaly and France and itisensured by Italy • For T2 sites • Optimal situation:Largefederations WITHOUT a regional T1 shouldfollow the structure asked for the T2 sites (case of Russia) • The reality: the available T1 WMS must flyfrom one T2 to anotherdepending on the dailyoverloadstatus
Sometrues and some lies about the ALICE Submissionprocedure and the WMS • Thelatest WMS mega-patchsolves the overloding issues observed in gLite3.0: FALSE • We have not seenhugebacklogsanymore: TRUE • The ALICE submissionprocedure has changed in the last time producing the instabilitiesobserved in some WMS: FALSE • The experimenttried to accomodate as much as possible the submissionprocedure to WMS withintheirowncomputing model limits: TRUE • Same WMS configuration file as in AFS@CERN • Proxyrenewaltrigeredonly once per hour • RESUBMISSION FEATURE OF THE WMS DISCARTED BY THE EXPERIMENT AT THE JDL LEVEL SINCE FEB2009 • ALICE isthereforeusing the WMS to a treelevel (RB mode) • All the rest of the features are simply not used and not required
WHAT WAS HAPPENING IN FRANCE? • Issues in GRIF and CCIN2P3 are totalyuncorrelated • GRIF • grid33.lal.in2p3.fr gotoverloadedyesterday • In addition itwasannouncedthat ALICE wasoverloading the CE • Resubmissionapproachwasdiscarted • Number of jobs not visible in the IS not the LB (later on) • CCIN2P3 • This is the unique VO supporting CE in the T1 and T2 • CEs withdifferentranks • This situation wasfulfilling one CE (best ranking) leaving the rest of CE empty • The query to the info system wasproviding 0 waiting jobs for those (worseranking) CE and therefore the system kept on submitting jobs • T1 and T2 clisterswillbeseparated in different VOBOXES
Status of the CREAM-CE • New sites providing CREAM-CE: • RU-SPbSU (undertesting) • Prague (still to betested) • Subatech (still to betested) • Alreadyexisting sites with production infrastructures: • FZK (justupgraded to the next version) • Kolkata (performing fine) • KISTI (no issues) • GSI (pending the setup in production) • RAL (no issues) • CNAF (no issues) • CERN (moving the system from SLC5 to SLC4 to increase the number of resources) • Torino (no issues) • SARA (no issues)