1 / 8

Resource Broker, gLite etc. CMS vs. middleware

Resource Broker, gLite etc. CMS vs. middleware. Middleware for CMS. Event Data: Catalogs are CMS-made: need to be tailored to experiment Non-Event Data: Access outside CERN via http + standard web caches (Squid) Data transfer Middleware provides the storage: SRM v2.2

geri
Download Presentation

Resource Broker, gLite etc. CMS vs. middleware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Resource Broker, gLite etc.CMS vs. middleware Middleware

  2. Middleware for CMS • Event Data: • Catalogs are CMS-made: need to be tailored to experiment • Non-Event Data: • Access outside CERN via http + standard web caches (Squid) • Data transfer • Middleware provides the storage: SRM v2.2 • Middleware provides File Transfer Service • CMS moves datasets on top of that: PhEDEx • Running jobs • Middleware provides remote job submission • LCG RB, gLite WMS, Condor-G • CMS embeds that into CMS users workflows • CRAB, CRAB Analysis Server, ProductionAgent • Resource sharing (job priorities and all that) • In the near future: managed at the sites Middleware

  3. Issues for 2007 • Data Management: SRM v2 • site interoperability • better control at Tier1 of disk/tape, pin/unpin • New FTS and some changes needed in PhEDEx • Job Priorities: only configuration/deployment issue • Have asked 3 “service classes” at all sites • software manager: express queue • production: up to 50% resources • normal users: all the rest, fair share based, static mapping will help • Job Submission: still a big issues • LCG RB is slow (~one job/minute) • LCG RB chokes at ~5K jobs/day vs. 200K/day target for 2008 • gLite WMS : much promised still not production after 2 years • Condor-G: fast and basic (too basic ?) • Will the CE be the next bottleneck ? Middleware

  4. CMS plans • For 2007 middleware integration and test for CMS is tackled within the Computing Commissioning sub-project (i.e. S.B.) • Work on the current issues (especially scaling up the Job Submission tools) will be tackled jointly with OSG collaborators • Means everybody checks their own tools, but we compare, possibly using same test suite, and will jointly pick the best solution for each use case • A workplan for the next 6 months has been outlined • CMS-Italy and INFN have responsibility for testing the gLite tools • gLite WMS • gLite CE • CREAM CE (the next all-italian computing element) Middleware

  5. From Computing Commissioning Plan • SRM v2.2 • Make sure CMS can use new SRM’s • gLite 3.x • New WMS, new gLite CE • gLite3.1 single job (CMS), bulk submission (ATLAS) • Better error reporting in UI (important for dashboard) • OSG • Stress test of various job submission tools • Stress test of current and future OSG CE’s • Stress test of dCache • Job priorities • Verify that is consistently deployed and works • Interoperability • Keep OSG and EGEE interoperating • Integrate NGDF aka NordUgrid • Condor-G submission to work for EGEE sites Middleware

  6. Work program on gLite WMS • gLite WMS to replace LCG-RB for single job submission • Better scalability, faster submission, additional features • Tested already to 1~2K jobs/day continously, 5K for short times • Work by EIS team (Andrea Sciaba’ and Enzo Miccio) • Time to use it with Production • gLite WMS for bulk submission: higher performance • Stress test until April by EIS team • Already available in CRAB (but not advised for general users) • Work in progress to integrate in Production Agent • Carlos, Ale, Giuseppe, William • gLite CE • EIS team to add them to test suite, easy • Expect better reliability and error reporting • Work for March,April, May • Cream CE • Use same test suite, easy to add, have to see how it works • From April, onward Middleware

  7. Status of gLite WMS • Bulk submission from UI to WMS is fast • Problem so far is that WMS dies under its own load • Could make 20K jobs/day, but not day after day • Not as simple as “reboot it”. Need specific actions (kill processes, restart processes, clean hung jobs, clean logs) every day or so. Not viable for production • Current “production” version gLite 3.0: no way • Crash effort started since last fall on gLite 3.1 • One machine at CERN under stress by Atlas (same pattern as CMS, using Andrea Scaba’s test suite) • Enormous work and progress by developers in last months, many components improved, including new Condor versions, processes that teminates themselves after some time. Tons of new patches • As of last week it submits ~15K jobs/day using bulk submission continously (5 days in a row by now) • More robustness expected after rewriting one critical piece to avoid Condor DAGMAN (work by F.Giacomini, almost finished) Middleware

  8. Conclusion • Resource Broker is not what one would like yet • Still it may be almost there • Future is in the hands of CMS-Italy (and ATLAS-Italy) • Keeping the Grid filled from a few submission points (lxplus, a few ProductionAgents) will be a daunting task anyhow • One hammer does not fit all screws • Do not be surprised if in the end different submission tools will better serve different use cases • CRAB and Production Tools developers will make that transparent to users • Do not panic at Grid cryptic error messages, we will analyse the data Middleware

More Related