80 likes | 174 Views
Resource Broker, gLite etc. CMS vs. middleware. Middleware for CMS. Event Data: Catalogs are CMS-made: need to be tailored to experiment Non-Event Data: Access outside CERN via http + standard web caches (Squid) Data transfer Middleware provides the storage: SRM v2.2
E N D
Middleware for CMS • Event Data: • Catalogs are CMS-made: need to be tailored to experiment • Non-Event Data: • Access outside CERN via http + standard web caches (Squid) • Data transfer • Middleware provides the storage: SRM v2.2 • Middleware provides File Transfer Service • CMS moves datasets on top of that: PhEDEx • Running jobs • Middleware provides remote job submission • LCG RB, gLite WMS, Condor-G • CMS embeds that into CMS users workflows • CRAB, CRAB Analysis Server, ProductionAgent • Resource sharing (job priorities and all that) • In the near future: managed at the sites Middleware
Issues for 2007 • Data Management: SRM v2 • site interoperability • better control at Tier1 of disk/tape, pin/unpin • New FTS and some changes needed in PhEDEx • Job Priorities: only configuration/deployment issue • Have asked 3 “service classes” at all sites • software manager: express queue • production: up to 50% resources • normal users: all the rest, fair share based, static mapping will help • Job Submission: still a big issues • LCG RB is slow (~one job/minute) • LCG RB chokes at ~5K jobs/day vs. 200K/day target for 2008 • gLite WMS : much promised still not production after 2 years • Condor-G: fast and basic (too basic ?) • Will the CE be the next bottleneck ? Middleware
CMS plans • For 2007 middleware integration and test for CMS is tackled within the Computing Commissioning sub-project (i.e. S.B.) • Work on the current issues (especially scaling up the Job Submission tools) will be tackled jointly with OSG collaborators • Means everybody checks their own tools, but we compare, possibly using same test suite, and will jointly pick the best solution for each use case • A workplan for the next 6 months has been outlined • CMS-Italy and INFN have responsibility for testing the gLite tools • gLite WMS • gLite CE • CREAM CE (the next all-italian computing element) Middleware
From Computing Commissioning Plan • SRM v2.2 • Make sure CMS can use new SRM’s • gLite 3.x • New WMS, new gLite CE • gLite3.1 single job (CMS), bulk submission (ATLAS) • Better error reporting in UI (important for dashboard) • OSG • Stress test of various job submission tools • Stress test of current and future OSG CE’s • Stress test of dCache • Job priorities • Verify that is consistently deployed and works • Interoperability • Keep OSG and EGEE interoperating • Integrate NGDF aka NordUgrid • Condor-G submission to work for EGEE sites Middleware
Work program on gLite WMS • gLite WMS to replace LCG-RB for single job submission • Better scalability, faster submission, additional features • Tested already to 1~2K jobs/day continously, 5K for short times • Work by EIS team (Andrea Sciaba’ and Enzo Miccio) • Time to use it with Production • gLite WMS for bulk submission: higher performance • Stress test until April by EIS team • Already available in CRAB (but not advised for general users) • Work in progress to integrate in Production Agent • Carlos, Ale, Giuseppe, William • gLite CE • EIS team to add them to test suite, easy • Expect better reliability and error reporting • Work for March,April, May • Cream CE • Use same test suite, easy to add, have to see how it works • From April, onward Middleware
Status of gLite WMS • Bulk submission from UI to WMS is fast • Problem so far is that WMS dies under its own load • Could make 20K jobs/day, but not day after day • Not as simple as “reboot it”. Need specific actions (kill processes, restart processes, clean hung jobs, clean logs) every day or so. Not viable for production • Current “production” version gLite 3.0: no way • Crash effort started since last fall on gLite 3.1 • One machine at CERN under stress by Atlas (same pattern as CMS, using Andrea Scaba’s test suite) • Enormous work and progress by developers in last months, many components improved, including new Condor versions, processes that teminates themselves after some time. Tons of new patches • As of last week it submits ~15K jobs/day using bulk submission continously (5 days in a row by now) • More robustness expected after rewriting one critical piece to avoid Condor DAGMAN (work by F.Giacomini, almost finished) Middleware
Conclusion • Resource Broker is not what one would like yet • Still it may be almost there • Future is in the hands of CMS-Italy (and ATLAS-Italy) • Keeping the Grid filled from a few submission points (lxplus, a few ProductionAgents) will be a daunting task anyhow • One hammer does not fit all screws • Do not be surprised if in the end different submission tools will better serve different use cases • CRAB and Production Tools developers will make that transparent to users • Do not panic at Grid cryptic error messages, we will analyse the data Middleware