90 likes | 210 Views
Christmas production- wms part. ALICE TF MEETING 8th-Jan-2009. General points. ALICE has been running in production mode during Christmas 2009 About 3500-4000 concurrent jobs in average AliEn v2.16 has entered the production for the 1st time
E N D
Christmas production-wms part ALICE TF MEETING 8th-Jan-2009
General points • ALICE has been running in production mode during Christmas 2009 • About 3500-4000 concurrent jobs in average • AliEn v2.16 has entered the production for the 1st time • New WMS submission mode (multi-WMSused in random mode) • New CREAM module availablealthough not usedduringthisperiod • Most important issue in terms of job submissionduring Xmas’09 • CERN WMS have been overloadedmost of the time • Continuous observation needed • Drainingoperationsstillneededthisweek • GS gave us access to one of the CMS WMS. • wms103 (Alice old-hw), wms109 (Alice old-hw), wms204 (Alice new-hw) and wms117 (CMS)
WMS Issues • The load of the CERN WMS has been extremellyhighduring the vacations period • Wewerecreatinghugebacklogs in the WMS • http://wmsmon.cern.ch/monitoring/monitoring.html • In addition thishighload has been observed in several VOBOXES • i.fl: it represents the number of unprocessed entries in the input.fl file, the input queue for the WM service • q.fl: it represents the number of unprocessed entries in the queue.fl file, the input queue for the Job Controller service. i.fl q.fl
Effects of thishighload • Wewere seing jobs in status READY and WAITING for a long time • And wealready know whatthismeans for us in terms of hugebunches of jobs submitted to the sites once the WMS came back in production • Stillwe do not consider READY and WAITING as problematicstatus and wekeepsubmitting and submitting… SNOWBALL: creatinghugebacklogs • Whythese job status? • the Workload Manager has to deal with all precedingrequests in the "i.fl" queue at least once (a submissionthatdoes not match willberetriedlater if it has not expired by that time); • the Job Controller has to deal with all precedingrequests in the "q.fl" queue. • The 30th of December the numberswere: • wms103 hada "q.fl" backlog of 5979 and wms109 had 1529 in thatcolumn. wms204 has an "i.fl" backlog of 31302.
Whythiseffectappeared? • The WMS service isready to deal with up to 15000 jobs per day and per nodewithanyremarkable issue • The issue was not in the WMS therefore…. But in myproxy server at CERN • Wewereoverloading the myproxy server at CERN • Hugenumber of connections to myproxy server frommany VOBOXES • For example: the 2nd of January ipnvobox.in2p3.fr wasconnectingat a rate exceeding 1 Hz?! • This waseffectingalsoother services including the the WMS and not only the ALICE ones
Actions to take • From the point of view of IT/FIO • Myproxy server shouldbemigrated to faster machines (alreadyproposed by GS to FIO) • From the point of view of ALICE • Wewill have to change oursubmission and control mechanisms
ALICE actions • Control actions: • READY and WAITING are extremellydangerousstatus for us • Submission actions: • Single job submissionadmitstwoprocedures • glite-wms-job-submit–ajdlfile • Automaticdelegationprocedurekept by WMProxy • This iswhatwe are doingnow • In addition wedelegate a proxy each time wesubmit an agent • glite-wms-job-submit–d mydelegID • Explicitlyitcreates a nameddelegatedcredential on the Wmproxy and itrefers to thisdelegated proxy ateach job submission • Advantage of the former issimplicity, while the latter isbetter performance (delegationrequires a non-negligibleamount of time) • In addition from the VOBOX itis trivial
Alice action (cont.) • …. And itis trivial becausewe have already a refresheddelegated proxy intoeach VOBOX • Evenrefreshed by the VOBOX itself • Weshould stop delegating a new proxy each time wemake a new agent submission • In addition wecan exploit the bulksubmissionprocedure • So easy as caching all agents jdl’s in a file • Not somany changes into the LCG package • The wholebunchisjust made with one delegation • Atthis point the –a and –d options are irrelevant (must beused but irrelevantfrom the proxy delegationprocedure)
conclusions • IMHO weshould not think about CREAM as the solution in the current moment • It willsurely, but we do not have a date • Whatever CREAM will do for us, weshouldmake WMS perfectlyworkable for us • CREAM willbe in testing phase for a while • In case of issues, WMS submissionshouldbe the quick replacement knowingthatwecan trust in its performance