1 / 28

Project Status Report

WLCG Overview Board 3 rd September 2010, CERN. Project Status Report. Ian Bird. Topics. Experience with data Service quality Resource installations Milestones Budget issues RRB issues Tier 3s Service evolution. 5 months with LHC data. ATLAS: 1.7 PB raw. CMS:

lelia
Download Presentation

Project Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WLCG Overview Board 3rd September 2010, CERN Project Status Report Ian Bird

  2. Topics • Experience with data • Service quality • Resource installations • Milestones • Budget issues • RRB issues • Tier 3s • Service evolution Ian.Bird@cern.ch

  3. 5 months with LHC data ATLAS: 1.7 PB raw • CMS: • 220 TB of RAW data at 7 TeV • 70 TB Cosmics during this period • 110 TB Tests and exercises with trigger.  LHCb: 70 TB raw data since June ALICE: 550 TB Ian.Bird@cern.ch

  4. WLCG Usage • Use remains consistently high • 1 M jobs/day; 100k CPU-days/day 1 M jobs/day 100k CPU-days/day LHCb CMS • Large numbers of analysis users • CMS ~500, ATLAS~1000, LHCb/ALICE ~200 ALICE: ~200 users, 5-10% of Grid resources

  5. Job workloads CMS: 100k jobs per day; Red: analysis LHCb ATLAS: analysis jobs ALICE: 60 sites, ~20K jobs running in parallel

  6. Data transfers (mostly!) at rates anticipated Data transfers 24x24 500MB/s Traffic on OPN up to 70 Gb/s! - ATLAS reprocessing campaigns 48x48 800MB/s ATLAS Throughput • CMS saw effect of going from 24x24 to 48x48 bunches

  7. Data distribution ATLAS: Total throughput T0-T1; T1-T1; T1-T2 CMS: T0 – T1 LHCb: T0 – T1 Ian.Bird@cern.ch

  8. Data distribution for analysis • For all experiments: early data has been available for analysis within hours of data taking Ian.Bird@cern.ch

  9. Resource status Ian.Bird@cern.ch

  10. Issues with resources • CNAF • Remainder of disk and CPU to be available 1st Sep. • NL-T1 • Had serious issues with disks. Data corruption meant 2/3 was unusable for some time. • ASGC: • Remaining 1 PB disk not available before end Oct. Ian.Bird@cern.ch

  11. CPU – July • Significant use of Tier 2s for analysis • frequently-expressed concern that too much analysis would be done at CERN is not reflected • Tier 0 capacity underused in general • But this is expected to change as luminosity increases Ian.Bird@cern.ch

  12. Q2 2010 Service incidents – Q2 Ian.Bird@cern.ch

  13. Q3 2010 Incidents – Q3 NL-T1 DB issue: Illustrates criticality of LFC for ATLAS. We must set up a live stand-by for all ATLAS T1s. Done already at CNAF. Could be hosted centrally. Ian.Bird@cern.ch

  14. ALICE data loss (May) • A configuration error in Castor resulted in data being directed across all available tape pools instead of to the dedicated raw data pools • For ALICE, ATLAS, CMS this included a pool where the tapes were re-cycled after a certain time • The result of this was that a number of files were lost on tapes that were recycled • For ATLAS and CMS the tapes had not been overwritten and could be fully recovered (fall back would have been to re-copy files back from Tier 1s) • For ALICE 10k files were on tapes that were recycled, inc 1700 files of 900 GeV data • Actions taken: • Underlying problem addressed; all recycle pools removed • Software change procedures being reviewed now • Action to improve user-facing monitoring in Castor • Tapes sent to IBM and SUN for recovery – have been able to recover ~97% of critical (900 GeV sample) files, ~50% of all ALICE files • Work with ALICE to ensure that always 2 copies of data available • In HI running there is a risk for several weeks until all data is copied to Tier 1s; several options to mitigate this risk under discussion • As this was essentially a procedural problem: we will organise a review of Castor operations procedures (sw dev, deployment, operation etc) together with experiments and outside experts – timescale of September.

  15. Milestones • Pilot jobs, glexec, SCAS, etc (1st July) • No pressure at the moment as data taking takes priority • CREAM deployment: • >100 sites have a CREAM CE; • still missing job submission from condor-g (needed by ATLAS) • Stability and reliability of CREAM still not good enough to replace LCG-CE, many open tickets • Gathering of installed capacity  Ian.Bird@cern.ch

  16. Milestones – 2 • Storage accounting reports: • For Tier 1 sites, proto-report for beginning Sep • For Tier 2 sites, for end Oct • Before these are regularly published significant effort has to be devoted to checking the published data (see slide on installed capacity) – lots missing! • From DM and virtualisation activities: • Create a small architectural team • Address several strategic questions: (Nov) • Xrootd as a strategy for data access by all experiments? • Need for ability to reserve whole WN (to better use cores) – is this a clear requirement for all experiments? • CERNVM and/or CVMFS as a strategy for all experiments? • Proposal for better managing Information System and data (end Sep) Ian.Bird@cern.ch

  17. Installed capacity • Action has been pending for a long time: • Information requested by RRB/RSG • Significant effort was put into defining and agreeing how to publish and collect the data – via the information system. • Document has been available for some time • Tool (gstat2) now has the ability to present this information  • Significant effort still required to validate this data before it is publishable • Propose to task Tier 1s with validation for Tier 2s, but Tier 1 data is not good yet! • Sites need to publish good data urgently Ian.Bird@cern.ch

  18. Published capacities ... Ian.Bird@cern.ch

  19. Ian.Bird@cern.ch

  20. Budget Issues at CERN • IT asked to reduce 15M over MTP (2011-15) • Proposal of 12M reduction accepted – this is what was approved by FC last week • Part (8.25MCHF over 5 years) comes from the LCG budgets: • Move to 4-year equipment replacement cycles • Save 2 MCHF in 2011, very little afterwards • Stop CERN contribution to USLHCnet • Save 350 kCHF/year. • NB. No CERN contribution to costs of other Tier 1s • Reduce slope of Tier 0 computing resource increase • Save ~ 1 MCHF/year on average • This is the main mechanism we have to reduce costs. Current assumption was ~30%/year growth. Ian.Bird@cern.ch

  21. Budget implications • Note that the proposals have not been discussed with the experiments, nor with the Committees overseeing LCG (LHCC, RRB/C-RSG, OB, etc...) • Reducing the Tier-0 Computing resources for LHC experiments does not seem wise now the detectors are ramping up • Not even taking into account the lack of experience with Heavy Ions • Slowing down the replacement cycles and reducing the slope of computing resources increase • Delays the need for additional computing infrastructure needed for the Tier-0 (e.g. containers) • Will strictly limit the overall experiment computing growth rate in the Tier-0 • Assumes no additional requests from non-LHC experiments • Detailed planning requires further studies • Extending lifetime of hardware from 3 to 4 years • Does not gain anything if the maintenance is extended • Has negative impact on power budget • Requires additional spares, therefore additional budget • May impact service quality • Implies additional effort to cover for repairs • USLHCNet • May have secondary effects ... Ian.Bird@cern.ch

  22. Tier 0 planning – status • Plan for a new Tier 0 building at CERN is cancelled (noted at SPC last week) • Delayed need for containers (or other alternate solution) by ~1 year or more: • Budget reduction will have implications for total power needs • Efforts invested in recent years have benefitted total power needs • Ongoing plans for upgrading B513 from 2.9 to 3.5MW including up to 600 kW diesel-backed power, together with use of local hosting (100 kW today, potentially more) • Remote Tier 0-bis  F. Hemmer talk Ian.Bird@cern.ch

  23. Schedule Ian.Bird@cern.ch

  24. October RRB • Two points to be addressed: • Scrutiny report of use of resources with experience of data • Plan to have a report with same details from each experiment • Discuss with chair of C-RSG at MB on Sep 7 • What should be said about requirements for 2013 • According to MoU process should have said something already • Very difficult to be precise without details of running conditions, and better analysis of how experience with data correlates with planning (TDR etc). Ian.Bird@cern.ch

  25. Tier 3 resources • Formally Tier 3s are outside the scope of the WLCG MoU; but ... • May be co-located at a Tier 1 or 2 • Many Tier 3 sites coming: • Some are “Tier 2” but just not party to the MoU (yet...) – these are not a problem – treated in same way as a formal Tier 2 • Some just analysis but need to access data, so need to be grid aware (customer to grid services), need to be known to the infrastructure(s) • Strong statements of support for them from Experiment managements – they need these Tier 3s to be able to do analysis • Who supports them operationally? • Operational support: • In countries with a ROC (NGI ops, or equiv) – not a problem – ROC should support • What about non-ROC countries? Can we rely on the infrastructures? What is the process? • Come back to this in point on EGI-WLCG interactions Ian.Bird@cern.ch

  26. Tier 3 issues – proposal • Want Tier 3s to be known and supported as clients... • They should not interfere with Tier 1,2 daily operations • Either because of overloading the Tier 1,2 • Or because of support load • No central support if the Tier 3 is really only a national analysis facility (i.e. Not for use of the wider experiment collaboration) • Issue: shared facility (e.g. Part of a cluster is Tier 2, part is for national use) • The part for national-only use should not be part of the pledge • How should this be accounted? (Should not be in principle, but in a shared cluster care needs to be taken) Ian.Bird@cern.ch

  27. Evolution and sustainability • Need to adapt to changing technologies • Major re-think of storage and data access • Use of many-core CPUs (and other processor types?) • Virtualisation as a solution for job management • Brings us in line with industrial technology • Integration with public and commercial clouds • Network infrastructure • This is the most reliable service we have • Invest in networks and make full use of the distributed system • Grid Middleware • Complexity of today’s middleware compared to the actual use cases • Evolve by using more “standard” technologies: e.g. Message Brokers, Monitoring systems are first steps • But: retain the WLCG infrastructure • Global collaboration, service management, operational procedures, support processes, etc. • Security infrastructure – this is a significant achievement • both the global A&A service and trust network (X509) and • the operational security & policy frameworks Ian Bird, CERN

  28. Service evolution strategy • Have started discussion on several aspects – • Data management: Kors Bos talk • Discussion in March, workshop in June, follow ups in GDBs (next week for the first) • Multi-core and virtualisation: PredragBuncic talk • 2nd workshop was held in June (1st was last year) • Long term: • Clarify the distinction between the WLCG distributed computing infrastructure and the software that we use to implement it • Understand how to make the grid middleware supportable and sustainable • Use of other components (e.g. Industrial messaging, tools like Nagios, etc); what does cloud technology bring us (and how to integrate/maintain our advantages – global trust, VOs, etc.) ? Ian.Bird@cern.ch

More Related