1 / 9

Christmas production- wms part

Christmas production- wms part. ALICE TF MEETING 8th-Jan-2009. General points. ALICE has been running in production mode during Christmas 2009 About 3500-4000 concurrent jobs in average AliEn v2.16 has entered the production for the 1st time

kami
Download Presentation

Christmas production- wms part

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Christmas production-wms part ALICE TF MEETING 8th-Jan-2009

  2. General points • ALICE has been running in production mode during Christmas 2009 • About 3500-4000 concurrent jobs in average • AliEn v2.16 has entered the production for the 1st time • New WMS submission mode (multi-WMSused in random mode) • New CREAM module availablealthough not usedduringthisperiod • Most important issue in terms of job submissionduring Xmas’09 • CERN WMS have been overloadedmost of the time • Continuous observation needed • Drainingoperationsstillneededthisweek • GS gave us access to one of the CMS WMS. • wms103 (Alice old-hw), wms109 (Alice old-hw), wms204 (Alice new-hw) and wms117 (CMS)

  3. WMS Issues • The load of the CERN WMS has been extremellyhighduring the vacations period • Wewerecreatinghugebacklogs in the WMS • http://wmsmon.cern.ch/monitoring/monitoring.html • In addition thishighload has been observed in several VOBOXES • i.fl: it represents the number of unprocessed entries in the input.fl file, the input queue for the WM service • q.fl: it represents the number of unprocessed entries in the queue.fl file, the input queue for the Job Controller service. i.fl q.fl

  4. Effects of thishighload • Wewere seing jobs in status READY and WAITING for a long time • And wealready know whatthismeans for us in terms of hugebunches of jobs submitted to the sites once the WMS came back in production • Stillwe do not consider READY and WAITING as problematicstatus and wekeepsubmitting and submitting… SNOWBALL: creatinghugebacklogs • Whythese job status? • the Workload Manager has to deal with all precedingrequests in the "i.fl"  queue at least once (a submissionthatdoes not match willberetriedlater if it has not expired by that time); • the Job Controller has to deal with all precedingrequests in the "q.fl"  queue. • The 30th of December the numberswere: • wms103 hada "q.fl" backlog of 5979 and wms109 had 1529 in thatcolumn.  wms204 has an "i.fl" backlog of 31302.

  5. Whythiseffectappeared? • The WMS service isready to deal with up to 15000 jobs per day and per nodewithanyremarkable issue • The issue was not in the WMS therefore…. But in myproxy server at CERN • Wewereoverloading the myproxy server at CERN • Hugenumber of connections to myproxy server frommany VOBOXES • For example: the 2nd of January ipnvobox.in2p3.fr wasconnectingat a rate exceeding 1 Hz?! • This waseffectingalsoother services including the the WMS and not only the ALICE ones

  6. Actions to take • From the point of view of IT/FIO • Myproxy server shouldbemigrated to faster machines (alreadyproposed by GS to FIO) • From the point of view of ALICE • Wewill have to change oursubmission and control mechanisms

  7. ALICE actions • Control actions: • READY and WAITING are extremellydangerousstatus for us • Submission actions: • Single job submissionadmitstwoprocedures • glite-wms-job-submit–ajdlfile • Automaticdelegationprocedurekept by WMProxy • This iswhatwe are doingnow • In addition wedelegate a proxy each time wesubmit an agent • glite-wms-job-submit–d mydelegID • Explicitlyitcreates a nameddelegatedcredential on the Wmproxy and itrefers to thisdelegated proxy ateach job submission • Advantage of the former issimplicity, while the latter isbetter performance (delegationrequires a non-negligibleamount of time) • In addition from the VOBOX itis trivial

  8. Alice action (cont.) • …. And itis trivial becausewe have already a refresheddelegated proxy intoeach VOBOX • Evenrefreshed by the VOBOX itself • Weshould stop delegating a new proxy each time wemake a new agent submission • In addition wecan exploit the bulksubmissionprocedure • So easy as caching all agents jdl’s in a file • Not somany changes into the LCG package • The wholebunchisjust made with one delegation • Atthis point the –a and –d options are irrelevant (must beused but irrelevantfrom the proxy delegationprocedure)

  9. conclusions • IMHO weshould not think about CREAM as the solution in the current moment • It willsurely, but we do not have a date • Whatever CREAM will do for us, weshouldmake WMS perfectlyworkable for us • CREAM willbe in testing phase for a while • In case of issues, WMS submissionshouldbe the quick replacement knowingthatwecan trust in its performance

More Related