420 likes | 556 Views
Grid Computing and High Energy Physics. LHC and the Experiments The LHC Computing Grid The Data Challenges and some first results from the ATLAS experiment. The Experiments and the challenges of their computing. GRID is a “novel” technology
E N D
Grid Computing and High Energy Physics LHC and the Experiments The LHC Computing Grid The Data Challenges and some first results from the ATLAS experiment AICA@Udine L.Perini
The Experiments and the challenges of their computing • GRID is a “novel” technology • Many High Energy Physics experiments are now using it at some level, but… • The experiments at the new LHC collider at CERN • Start of data taking foreseen for mid 2007 are the first ones with a computing model designed from the beginning for the GRID • This talk will concentrate on the GRID as planned and used by the LHC experiments • And in the end on one experiment (ATLAS) specifically AICA@Udine L.Perini
CERN: Annual budget: ~1000 MSFr (~700 M€) Staff members: 2650 Member states: 20 + 225 Fellows, + 270 Associates + 6000 CERN users CERN (founded 1954) = “Conseil Européen pour la Recherche Nucléaire” “European Organisation for Nuclear Reseach” Particle Physics 27 km circumference tunnel CERN Large Hadron Collider AICA@Udine L.Perini
Particle Physics Establish a periodic system of the fundamental building blocks andunderstandforces
LHC: protons colliding at E=14 TeV/c2 The most powerful microscope Creating conditions similar to the Big Bang
e+ f Z0 f e- Detector response apply calibration, alignment Fragmentation, Decay, Physics analysis Basic physics Results Particle physics data From raw data to physics results 2037 2446 1733 1699 4003 3611 952 1328 2132 1870 2093 3271 4732 1102 2491 3216 2421 1211 2319 2133 3451 1942 1121 3429 3742 1288 2343 7142 _ Raw data Convert to physics quantities Interaction with detector material Pattern, recognition, Particle identification Analysis Reconstruction Simulation (Monte-Carlo) AICA@Udine L.Perini
Challenge 1: Large, distributed community ATLAS “Offline” software effort: 1000 person-yearsper experiment CMS Software life span: 20 years ~ 5000 Physicistsaround the world- around the clock LHCb AICA@Udine L.Perini
LHC users and participants Istitutes LCG: The worldwide Grid project Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users
First physics analysis expected to start in 2008 ATLAS is not one experiment Heavy Ion Physics Higgs Extra Dimensions QCD B physics Electroweak SUSY AICA@Udine L.Perini
Balloon (30 Km) CD stack with 1 year LHC data! (~ 20 Km) Concorde (15 Km) 6 cm Mt. Blanc (4.8 Km) 50 CD-ROM = 35 GB Challenge 2: Data Volume Annual data storage: 12-14 PetaBytes/year AICA@Udine L.Perini
All interactions 9 orders of magnitude! The HIGGS Rare phenomena - Huge backgroundComplex events Challenge 3: Find the Needle in a Haystack AICA@Udine L.Perini
Therefore: Provide mountains of CPU CalibrationReconstructionSimulationAnalysis For LHC computing,some 100 Million SPECint2000 are needed! Produced by Inteltoday in ~6 hours 1 SPECint2000 = 0.1 SPECint95 = 1 CERN-unit = 4 MIPS- a 3 GHz Pentium 4 has ~ 1000 SPECint2000 AICA@Udine L.Perini
The CERN Computing Centre ~2,400 processors ~200 TBytes of disk ~12 PB of magnetic tape Even with technology-driven improvements in performance and costs – CERN can provide nowhere near enough capacity for LHC! AICA@Udine L.Perini
LCG: the LHC Computing Grid • A single BIG computing center is not the best solution for the challenges we have seen • Single points of failure • Difficult to handle costs • Countries dislike paying checks without having local returns and sharing of responsibilities… • Use the Grid idea and plan a really distributed computing system: LCG • In June 2005 the Grid based Computing Technical Design Reports of the 4 experiments and of LCG have been published AICA@Udine L.Perini
The LCG Project • Approved by the CERN Council in September 2001 • Phase 1 (2001-2004): • Development and prototyping a distributed production prototype at CERN and elsewhere that will be operated as a platform for the data challenges • Leading to a Technical Design Report, which will serve as a basis for agreeing the relations between the distributed Grid nodes and their co-ordinated deployment and exploitation. • Phase 2 (2005-2007):Installation and operation of the full world-wide initial production Grid system, requiring continued manpower efforts and substantial material resources. • A Memorandum of Understanding • Has been developed defining the Worldwide LHC Computing Grid Collaboration of CERN as host lab and the major computing centres. • Defines the organizational structure for Phase 2 of the project. AICA@Udine L.Perini
5.44 Gbps 1.1 TB in 30 min. What is the Grid? • Resource Sharing • On a global scale, across the labs/universities • Secure Access • Needs a high level of trust • Resource Use • Load balancing, making most efficient use • The “Death of Distance” • Requires excellent networking • Open Standards • Allow constructive distributed development • There is not (yet) a single Grid 6.25 Gbps 20 April 2004 AICA@Udine L.Perini
How will it work? • The GRID middleware: • Finds convenient places for the scientists “job” (computing task) to be run • Optimises use of the widely dispersed resources • Organises efficient access to scientific data • Deals with authentication to the different sites that the scientists will be using • Interfaces to local site authorisation and resource allocation policies • Runs the jobs • Monitors progress • Recovers from problems • … and …. • Tells you when the work is complete and transfers the result back! AICA@Udine L.Perini
Collaboration LHC Experiments Grid projects: Europe, US Regional & national centres Choices Adopt Grid technology. Go for a “Tier” hierarchy. Use Intel CPUs in standard PCs Use LINUX operating system. Goal Prepare and deploy the computing environment to help the experiments analyse the data from the LHC detectors. Lab m Uni x CERNTier 1 grid for a regional group Uni a UK USA Lab a France Tier3 physics department Tier 1 Uni n Tier2 Japan Italy CERN Tier 0 Lab b Germany Taipei Lab c grid for a physics study group Uni y Uni b Desktop The LHC Computing Grid Project - LCG AICA@Udine L.Perini
Cooperation with other projects • Grid Software • Globus, Condor and VDT have provided key components of the middleware used. Key members participate in OSG and EGEE • Enabling Grids for E-sciencE (EGEE) includes a substantial middleware activity. • Grid Operational Groupings • The majority of the resources used are made available as part of the EGEE Grid (~140 sites, 12,000 processors). EGEE also supports Core Infrastructure Centres and Regional Operations Centres. • The US LHC programmes contribute to and depend on the Open Science Grid (OSG). Formal relationship with LCG through US-Atlas and US-CMS computing projects. • Network Services • LCG will be one of the most demanding applications of national research networks such as the pan-European backbone network, GÉANT • The Nordic Data Grid Facility (NDGF) will begin operation in 2006. Prototype work is based on the NorduGrid middleware ARC. AICA@Udine L.Perini
Grid Projects Until deployments provide interoperability the experiments must provide it themselves ATLAS must span 3 major Grid deployments AICA@Udine L.Perini
EGEE • Proposal submitted to EU IST 6th framework • Project started April 1rst 2004 • Total budget approved of approximately 32 M€ over 2 years activities • Deployment and operation of Grid Infrastructure (SA1) • Re-Engineering of grid middleware (WSRF environment) (JRA1) • Dissemination, Training and Applications (NA4) • Italy take part to all 3 area of activities with a global financing of 4.7 M€ • EGEE2 project with similar funding submitted for further 2 years work 11 regional federations covering 70 partners in 26 countries
EGEE Activities 24% Joint Research 28% Networking • JRA1:Middleware Engineering and Integration • JRA2: Quality Assurance • JRA3: Security • JRA4: Network Services Development • NA1:Management • NA2:Dissemination and Outreach • NA3: User Training and Education • NA4: Application Identification and Support • NA5:Policy and International Cooperation Emphasis in EGEE is on operating a production grid and supporting the end-users 48% Services • SA1:Grid Operations, Support and Management • SA2: Network Resource Provision Starts 1st April 2004 for 2 years (1st phase) with EU funding of ~32M€
LCG/EGEE coordination • LCG Project Leader in EGEE Project Management Board • Most of the other members are representatives of HEP funding agencies and CERN • EGEE Project Director in LCG Project Overview Board • Middleware and Operations are common to both LCG and EGEE • Cross-representation on Project Executive Boards • EGEE Technical director in LCG PEB • EGEE HEP applications hosted in CERN/EP division
The Hierarchical “Tier” Model • Tier-0 at CERN • Record RAW data (1.25 GB/s ALICE) • Distribute second copy to Tier-1s • Calibrate and do first-pass reconstruction • Tier-1 centres (11 defined) • Manage permanent storage – RAW, simulated, processed • Capacity for reprocessing, bulk analysis • Tier-2 centres (>~ 100 identified) • Monte Carlo event simulation • End-user analysis • Tier-3 • Facilities at universities and laboratories • Access to data and processing in Tier-2s, Tier-1s • Outside the scope of the project AICA@Udine L.Perini
Tier-1s AICA@Udine L.Perini
Tier-2s ~100 identified – number still growing AICA@Udine L.Perini
The Eventflow 50 days running in 2007107 seconds/year pp from 2008 on ~109 events/experiment106 seconds/year heavy ion AICA@Udine L.Perini
CPU Requirements Tier-2 Tier-1 58%pledged CERN AICA@Udine L.Perini
Disk Requirements Tier-2 Tier-1 54%pledged CERN AICA@Udine L.Perini
Tape Requirements Tier-1 CERN 75%pledged AICA@Udine L.Perini
Experiments’ Requirements • Single Virtual Organization (VO) across the Grid • Standard interfaces for Grid access to Storage Elements (SEs) and Computing Elements (CEs) • Need of a reliable Workload Management System (WMS) to efficiently exploit distributed resources. • Non-event data such as calibration and alignment data but also detector construction descriptions will be held in data bases • read/write access to central (Oracle) databases at Tier-0 and read access at Tier-1s with a local database cache at Tier-2s • Analysis scenarios and specific requirements are still evolving • Prototype work is in progress (ARDA) • Online requirements are outside of the scope of LCG, but there are connections: • Raw data transfer and buffering • Database management and data export • Some potential use of Event Filter Farms for offline processing AICA@Udine L.Perini
Architecture – Grid services • Storage Element • Mass Storage System (MSS) (CASTOR, Enstore, HPSS, dCache, etc.) • Storage Resource Manager (SRM) provides a common way to access MSS, independent of implementation • File Transfer Services (FTS) provided e.g. by GridFTP or srmCopy • Computing Element • Interface to local batch system e.g. Globus gatekeeper. • Accounting, status query, job monitoring • Virtual Organization Management • Virtual Organization Management Services (VOMS) • Authentication and authorization based on VOMS model. • Grid Catalogue Services • Mapping of Globally Unique Identifiers (GUID) to local file name • Hierarchical namespace, access control • Interoperability • EGEE and OSG both use the Virtual Data Toolkit (VDT) • Different implementations are hidden by common interfaces AICA@Udine L.Perini
Prototypes • It is important that the hardware and software systems developed in the framework of LCG be exercised in more and more demanding challenges • Data Challenges have now been done by all experiments. • Though the main goal was to validate the distributed computing model and to gradually build the computing systems, the results have been used for physics performance studies and for detector, trigger, and DAQ design. • Limitations of the Grids have been identified and are being addressed. • Presently, a series of Service Challenges aim to realistic end-to-end testing of experiment use-cases over in extended period leading to stable production services. AICA@Udine L.Perini
Data Challenges • ALICE • PDC04 using AliEn services native or interfaced to LCG-Grid. 400,000 jobs run producing 40 TB of data for the Physics Performance Report. • PDC05: Event simulation, first-pass reconstruction, transmission to Tier-1 sites, second pass reconstruction (calibration and storage), analysis with PROOF – using Grid services from LCG SC3 and AliEn • ATLAS • Using tools and resources from LCG, NorduGrid, and Grid3 at 133 sites in 30 countries using over 10,000 processors where 235,000 jobs produced more than 30 TB of data using an automatic production system in 2004 • In 2005 Production for Physics Workshop in Rome – next slides • CMS • 100 TB simulated data reconstructed at a rate of 25 Hz, distributed to the Tier-1 sites and reprocessed there. • LHCb • LCG provided more than 50% of the capacity for the first data challenge 2004-2005. The production used the DIRAC system. AICA@Udine L.Perini
ProdDB AMI Data Man. System Don Quijote Windmill super super super super super soap jabber jabber jabber soap LCG exe LCG exe NG exe G3 exe LSF exe Capone Dulcinea Lexor RLS RLS RLS LCG NG Grid3 LSF ATLAS Production System A big problem is data management Must cope with >= 3 Grid catalogues Demands even greater for analysis AICA@Udine L.Perini
ATLAS Massive productions on 3 Grids • July-September 2004: DC2 Geant-4 simulation (long jobs) • 40% on LCG/EGEE Grid, 30% on Grid3 and 30% on NorduGrid • October-December 2004: DC2 digitization and reconstruction (short jobs) • February-May 2005: Rome production (mix of jobs as digitization and reconstruction was started as soon as samples had been simulated) • 65% on LCG/EGEE Grid, 24% on Grid3, 11% on NorduGrid • CPU Consumption for the simulation CPU-intensive phase (till may 20th) • Grid3: 80 kSI2K.years NorduGrid: 22 kSI2k.years • LCG-tot: 178 kSI2K.years • Total: 280 kSI2K.years • Note: this CPU was almost fully consumed in 40 days, and the results were used for the real physics analysis presented in the Workshop at Rome, with the participation of > 400 ATLAS physicists. AICA@Udine L.Perini
Rome production statistics 73 data sets containing 6.1M events simulated and reconstructed (without pile-up) Total simulated data: 8.5M events Pile-up done later (for 1.3M events done, 50K reconstructed) AICA@Udine L.Perini
This is the first successful use of the grid by a large user community, which has however also revealed several shortcomings which need now to be fixed as LHC turn-on is only two years ahead! Very instructive comments from the user feedback have been presented at the Workshop (obviously this was one of the main themes and purposes of the meeting) All this is available on the Web AICA@Udine L.Perini
ATLAS Rome production: countries (sites) • Austria (1) • Canada (3) • CERN (1) • Czech Republic (2) • Denmark (3) • France (4) • Germany (1+2) • Greece (1) • Hungary (1) • Italy (17) • Netherlands (2) • Norway (2) • Poland (1) • Portugal (1) • Russia (2) • Slovakia (1) • Slovenia (1) • Spain (3) • Sweden (5) • Switzerland (1+1) • Taiwan (1) • UK (8) • USA (19) 22 countries 84 sites 17 countries; 51 sites 7 countries; 14 sites AICA@Udine L.Perini
Status and plans for ATLAS production on LCG • The global efficiency of the ATLAS production for Rome was good in WLMS area ( >95%), while improvements are still needed in the Data Management area (~75%) • WLMS however speed needs however improvement • ATLAS is ready to test new EGEE mw components as soon as they are released from the internal certification process • The File Transfer Service and the LCG File Catalogue, together with the new ATLAS Data Management layer • The new (gLite) version of the WMLS with support for bulk submission, task queue and full model • Accounting, monitoring and priority ( VOMS role and group based) systems are expected to be in production use for mid 2006 new big production rounds AICA@Udine L.Perini
Conclusions • The HEP experiments at the LHC collider are committed to a GRID based computing • The LHC Computing Grid Project is providing the common effort needed for supporting them • EU and US funded Grid projects develop, mantain and deploy the middleware • In the last year the Data Challenge have demonstrated the feasibility of huge real productions • Still much work need to be done in the next 2 years for meeting the challenge of the real data to be analyzed AICA@Udine L.Perini