310 likes | 447 Views
GridPP: Meeting The Particle Physics Computing Challenge. Tony Doyle. Contents.
E N D
GridPP: Meeting The Particle Physics Computing Challenge Tony Doyle AHM05 Meeting
Contents “The particle physicists are now well on their way to constructing a genuinely global particle physics Grid to enable them to exploit the massive data streams expected from the Large Hadron Collider in CERN that will turn on in 2007.” Tony Hey, AHM 2005 Introduction • Why? • LHC Motivation (“one in a billion events”, “20 million readout channels”, “1000s of physicists” “10 million lines of code”) • What? • The World’s Largest Grid (according to the Economist) • How? • “Get Fit Plan” and Current Status (197 sites, 13,797 CPUs, 5PB storage) • When? • Accounting and Planning Overview (“50 PetaBytes of data”, “100,000 of today’s processors” “2007-08”) Reference: http://www.allhands.org.uk/2005/proceedings/papers/349.pdf AHM05 Meeting
4 LHC Experiments • ATLAS • general purpose: origin of mass, supersymmetry, micro-black holes? • 2,000 scientists from 34 countries • CMS • general purpose detector • 1,800 scientists from 150 institutes • LHCb • to study the differences between matter and antimatter • producing over 100 million b and b-bar mesons each year • ALICE • heavy ion collisions, to create quark-gluon plasmas • 50,000 particles in each collision • “One Grid to Rule Them All”? AHM05 Meeting
1. Rare Phenomena - Huge Background Why (particularly) the LHC? 2. Complexity All interactions 9 orders of magnitude The HIGGS • “one in a billion events” • “20 million readout channels” AHM05 Meeting
What are the Grid challenges? • Must • share databetween thousands of scientists with multiple interests • link major (Tier-0 [Tier-1]) and minor (Tier-1 [Tier-2])computer centres • ensure all data accessible anywhere, anytime • grow rapidly, yet remainreliablefor more than a decade • cope withdifferent management policiesof different centres • ensure data security • be up and running routinely by2007 AHM05 Meeting
What are the Grid challenges? 2. Software efficiency 1. Software process 3. Deployment planning 4. Link centres 10. Policies 5. Share data Data Management, Security and Sharing 9. Accounting 8. Analyse data 7. Install software 6. Manage data AHM05 Meeting
Grid Overview • Aim: by 2008 (full year’s data taking) • CPU ~100MSi2k (100,000 CPUs) • Storage ~80PB • - Involving >100 institutes worldwide • Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT) • Prototype went live in September 2003 in 12 countries • Extensively tested by the LHC experiments in September 2004 • Currently 197 sites, 13,797 CPUs, 5PB storage in September 2005 AHM05 Meeting
Tier Structure CERN computer centre Tier 0 Offline farm RAL,UK USA Germany Italy France Tier 1 National centres Online system Tier 2 Regional groups ScotGrid NorthGrid SouthGrid London Tier 3 Institutes Glasgow Edinburgh Durham Tier 4 Workstations AHM05 Meeting
Functionality for the LHC Experiments • The basic functionality of the Tier-1s is: ALICE Reconstruction, Chaotic Analysis ATLAS Reconstruction, Scheduled Analysis/skimming, Calibration CMS Reconstruction LHCb Reconstruction, Scheduled skimming, Analysis • The basic functionality of the Tier-2s is: ALICE Simulation Production, Analysis ATLAS Simulation, Analysis, Calibration CMS Analysis, All Simulation Production LHCb Simulation Production, No analysis AHM05 Meeting
Technical Design Reports (June 2005) Computing Technical Design Reports: http://doc.cern.ch/archive/electronic/cern/ preprints/lhcc/public/ ALICE: lhcc-2005-018.pdf ATLAS: lhcc-2005-022.pdf CMS: lhcc-2005-023.pdf LHCb: lhcc-2005-019.pdf LCG: lhcc-2005-024.pdf LCG Baseline Services Group Report: http://cern.ch/LCG/peb/bs/BSReport-v1.0.pdf Contains all you (probably) need to know about LHC computing. End of prototype phase. AHM05 Meeting
Timescales • Service Challenges – UK deployment plans • End point April ’07 • Context: first real (cosmics) data ’05 AHM05 Meeting
Baseline Functionality Concentrate on robustness and scale. Experiments have assigned external middleware priorities. AHM05 Meeting
Exec2 Summary • GridPP2 has already met 21% of its original targets with 86% of the metrics within specification • “Get fit” deployment plan in place: LCG 2.6 deployed at 16 sites as a preliminary production service • Glite 1 was released in April as planned but components have not yet been deployed or their robustness tested by the experiments (1.3 available on pre-production service) • Service Challenge (SC)2 addressing networking was a success at CERN and the RAL Tier-1 in April 2005 • SC3 also addressing file transfers has just been completed • Long-term concern: planning for 2007-08 (LHC startup) • Short-term concerns: some under-utilisation of resources and the deployment of Tier-2 resources At the end of GridPP2 Year 1, the initial foundations of “The Production Grid” are built. The focus is on “efficiency”. AHM05 Meeting
People and Roles • More than 100 people in the UK • http://www.gridpp.ac.uk/members/ AHM05 Meeting
Project Map AHM05 Meeting
GridPP Deployment Status 18/9/05 [2/7/05] (9/1/05) • Measurable Improvements: • Sites Functional - Tested • 3000 CPUs • Storage via SRM interfaces • UK+Ireland federation AHM05 Meeting
New Grid Monitoring Maps Demo AND Google Map http://gridportal.hep.ph.ic.ac.uk/rtm/http://map.gridpp.ac.uk/ Preliminary Production Grid Status AHM05 Meeting
Accounting AHM05 Meeting
LCG Tier-1 Planning • 2006: Pledged • 2007-10: • Bottom up • Top down } ~50% uncertainty AHM05 Meeting
LCG Tier-1 Planning(CPU & Storage) Experiment requests are large e.g. in 2008 CPU ~50MSi2k Storage ~50PB! They can be met globally except in 2008. UK plan to contribute >7%. [Currently contribute >10%] First LCG Tier-1 Compute Law: CPU:Storage ~1[kSi2k/TB] Second LCG Tier-1 Storage Law: Disk:Tape ~ 1 (The number to remember is.. 1) AHM05 Meeting
LCG Tier-1 Planning(Storage) AHM05 Meeting
LCG Tier-2 Planning Third LCG Tier-2 Compute Law: Tier-1:Tier-2 CPU ~1 Zeroth LCG Law: There is no Zeroth law – all is uncertain Fifth LCG Tier-2 Storage Law:CPU:Disk~5[kSi2k/TB]) • 2006: Pledged • 2007-10: • Bottom up • Top down } ~100% uncertainty AHM05 Meeting
The “Get Fit” Plan • Set SMART (Specific Measurable Achievable Realistic Time-phased) Goals • Systematic approach and measurable improvements in deployment area • See Grid Deployment and Operations for EGEE, LCG and GridPPJeremy ColesProvides context for Grid “efficiency” AHM05 Meeting
Service Challenges • SC2 (April) : RAL joined computing centres around the world in a networking challenge, transferring 60 TeraBytes of data over ten days. • SC3 (September): RAL to CERN (T1-T0) at rates of up to 650 Mb/s. • e.g. Edinburgh to RAL (T2-T1) at rates of up 480Mb/s. • UKLight service tested from Lancaster to RAL. • Overall, the File Transfer Service is very reliable, failure rate now below 1%. AHM05 Meeting
Middleware Development Network Monitoring Configuration Management Grid Data Management Storage Interfaces Information Services Security AHM05 Meeting
Glite Status 1.2 installed on the Grid pre-production service 1.3 some components have been upgraded 1.4 upgrades to VOMs and registration tools plus additional bulk job submission components LCG 2.6 is the (August) production release UK’s R-GMA incorporated (production and pre-prodn.) LCG 3 will be based uponGlite.. AHM05 Meeting
Application Developmente.g.Reprocessing DØ data with SAMGridFrederic Villeneuve-Seguier ATLAS LHCb CMS SAMGrid (FermiLab) QCDGrid BaBar (SLAC) AHM05 Meeting
Workload Management Efficiency Overview Integrated over all VOs and RBs: Successes/Day 12722 Success %67% Improving from 42% to 70-80% during 2005 • Problems identified: • half WMS (Grid) • half JDL (User) AHM05 Meeting
LHC VOs ALICE ATLAS CMS LHCb Successes/Day N/A 2435 448 3463 Success %53% 84% 59% 68% Note. Some caveats, see http://egee-jra2.web.cern.ch/EGEE-JRA2/QoS/JobsMetrics/JobMetrics.htm Selection by experiments of “production sites” using Site Functional Tests (currently ~110 of the 197 sites) or use of pre-test software agents leads to >90% experiment production efficiency AHM05 Meeting
“UK contributes to EGEE's battle with malaria” Number of Biomedical jobs processed by country BioMed Successes/Day 1107 Success % 77% WISDOM (Wide In Silico Docking On Malaria) The first biomedical data challenge for drug discovery, which ran on the EGEE grid production service from 11 July 2005 until 19 August 2005. GridPP resources in the UK contributed ~100,000 kSI2k-hours from 9 sites Normalised CPU hours contributed to the biomedical VO for UK sites, July-August 2005 AHM05 Meeting
Why? 2. What? • 3. How? 4. When? • From Particle Physics perspective the Grid is: • mainly (but not just) for physicists, more generally for those needing to utilise large-scale computing resources efficiently and securely • 2. a) a working production-scale system running today • b) about seamless discovery of computing resources • c) using evolving standards for interoperation • d) the basis for computing in the 21st Century • e) not (yet) as seamless, robust or efficient as end-users need • 3. methods outlined – please come to the PPARC stand, Jeremy Coles’ talk • a) now at “preliminary production service” level, for simple(r) applications (e.g. experiment Monte Carlo production) • b) 2007 for a fully tested 24x7 LHC service (a large distributed computing resource) for more complex applications (e.g. data analysis) • c) planned to meet the LHC Computing Challenge AHM05 Meeting