200 likes | 325 Views
The Promise of Computational Grids in the LHC Era Paul Avery University of Florida Gainesville, Florida, USA avery@phys.ufl.edu http://www.phys.ufl.edu/~avery/ CHEP 2000 Padova, Italy Feb. 7-11, 2000. LHC Computing Challenges. Complexity of LHC environment and resulting data
E N D
The Promise of Computational Grids in the LHC Era Paul AveryUniversity of FloridaGainesville, Florida, USA avery@phys.ufl.eduhttp://www.phys.ufl.edu/~avery/ CHEP 2000Padova, ItalyFeb. 7-11, 2000 Paul Avery (Data Grids in the LHC Era)
LHC Computing Challenges • Complexity of LHC environment and resulting data • Scale: Petabytes of data per year • Geographical distribution of people and resources Example: CMS 1800 Physicists 150 Institutes 32 Countries Paul Avery (Data Grids in the LHC Era)
Dimensioning / Deploying IT Resources • LHC computing scale is “something new” • Solution requires directed effort, new initiatives • Solution must build on existing foundations • Robust computing at national centers essential • Universities must have resources to maintain intellectual strength, foster training, engage fresh minds • Scarce resources are/will be a fact of life plan for it • Goal: get new resources, optimize deployment of all resources to maximize effectiveness • CPU: CERN / national lab / region / institution / desktop • Data: CERN / national lab / region / institution / desktop • Networks: International / national / regional / local Paul Avery (Data Grids in the LHC Era)
Deployment Considerations • Proximity of datasets to appropriate IT resources • Massive CERN & national labs • Data caches Regional centers • Mini-summary Institutional • Micro-summary Desktop • Efficient use of network bandwidth • Local > regional > national > international • Utilizing all intellectual resources • CERN, national labs, universities, remote sites • Scientists, students • Leverage training, education at universities • Follow lead of commercial world • Distributed data, web servers Paul Avery (Data Grids in the LHC Era)
Solution: A Data Grid • Hierarchical grid best deployment option • Hierarchy Optimal resource layout (MONARC studies) • Grid Unified system • Arrangement of resources • Tier 0 Central laboratory computing resources (CERN) • Tier 1 National center (Fermilab / BNL) • Tier 2 Regional computing center (university) • Tier 3 University group computing resources • Tier 4 Individual workstation/CPU • We call this arrangement a “Data Grid” to reflect the overwhelming role that data plays in deployment Paul Avery (Data Grids in the LHC Era)
Layout of Resources • Want good “impedance match” between Tiers • TierN-1 serves TierN • TierN big enough to exert influence on TierN-1 • TierN-1 small enough to not duplicate TierN • Resources roughly balanced across Tiers Reasonable balance? Paul Avery (Data Grids in the LHC Era)
4 4 4 4 Data Grid Hierarchy (Schematic) Tier 0 (CERN) 3 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 Paul Avery (Data Grids in the LHC Era)
2.4 Gbps Tier 3 Univ WG 1 Tier 1: FNAL/BNL 70k Si95 70 Tbytes Disk; Robot 2.4 Gbps Tier 3 Univ WG 2 Tier 2 Center 20k Si95 25 Tbytes Disk, Robot Optional Air Freight N 622 Mbits/s CERN (CMS/ATLAS) 350k Si95 350 Tbytes Disk; Robot Tier 3 Univ WG M 622Mbits/s 622 Mbits/s 622 Mbits/s US Model Circa 2005 Paul Avery (Data Grids in the LHC Era)
Tier2 Center ~1 TIPS Tier2 Center ~1 TIPS Tier2 Center ~1 TIPS Tier2 Center ~1 TIPS Tier2 Center ~1 TIPS HPSS HPSS HPSS HPSS HPSS Institute ~0.25TIPS Institute Institute Institute Data Grid Hierarchy (CMS) 1 TIPS = 25,000 SpecInt95 PC (today) = 10-20 SpecInt95 ~PBytes/sec Online System ~100 MBytes/sec Offline Farm~20 TIPS Bunch crossing per 25 nsecs. 100 triggers per second Event is ~1 MByte in size ~100 MBytes/sec Tier 0 CERN Computer Center ~622 Mbits/sec Tier 1 Fermilab~4 TIPS France Regional Center Germany Regional Center Italy Regional Center ~2.4 Gbits/sec Tier 2 ~622 Mbits/sec Tier 3 Physicists work on analysis “channels”. Each institute has ~10 physicists workingon one or more channels Data for these channels is cached by the institute server Physics data cache 1-10 Gbits/sec Tier 4 Workstations Paul Avery (Data Grids in the LHC Era)
Why a Data Grid: Physical • Unified system: all computing resources part of grid • Efficient resource use (manage scarcity) • Averages out spikes in usage • Resource discovery / scheduling / coordination truly possible • “The whole is greater than the sum of its parts” • Optimal data distribution and proximity • Labs are close to the data they need • Users are close to the data they need • No data or network bottlenecks • Scalable growth Paul Avery (Data Grids in the LHC Era)
Why a Data Grid: Political • Central lab cannot manage / help 1000s of users • Easier to leverage resources, maintain control, assert priorities regionally • Cleanly separates functionality • Different resource types in different Tiers • Funding complementarity (NSF vs DOE) • Targeted initiatives • New IT resources can be added “naturally” • Additional matching resources at Tier 2 universities • Larger institutes can join, bringing their own resources • Tap into new resources opened by IT “revolution” • Broaden community of scientists and students • Training and education • Vitality of field depends on University / Lab partnership Paul Avery (Data Grids in the LHC Era)
Tier 0 Tier 1 Tier 2 More Organization More Flexibility Tier 3 Tier 4 Tier 2 Regional Centers • Possible Model : CERN:National:Tier 2 1/3 : 1/3 : 1/3 • Complementary role to Tier 1 lab-based centers • Less need for 24 7 operation lower component costs • Less production-oriented respond to analysis priorities • Flexible organization, i.e. by physics goals, subdetectors • Variable fraction of resources available to outside users • Range of activities includes • Reconstruction, simulation, physics analyses • Data caches / mirrors to support analyses • Production in support of parent Tier 1 • Grid R&D • ... Paul Avery (Data Grids in the LHC Era)
Distribution of Tier 2 Centers • Tier 2 centers arranged regionally in US model • Good networking connections to move data (caches) • Location independence of users always maintained • Increases collaborative possibilities • Emphasis on training, involvement of students • High quality desktop environment for remote collaboration, e.g., next generation VRVS system Paul Avery (Data Grids in the LHC Era)
Strawman Tier 2 Architecture • Linux Farm of 128 Nodes $ 0.30 M • Sun Data Server with RAID Array $ 0.10 M • Tape Library $ 0.04 M • LAN Switch $ 0.06 M • Collaborative Infrastructure $ 0.05 M • Installation and Infrastructure $ 0.05 M • Net Connect to Abilene network $ 0.14 M • Tape Media and Consumables $ 0.04 M • Staff (Ops and System Support) $ 0.20 M* • Total Estimated Cost (First Year)$ 0.98 M • Cost in Succeeding Years, for evolution, $ 0.68 Mupgrade and ops: * 1.5 – 2 FTE support required per Tier 2. Physicists from institute also aid in support. Paul Avery (Data Grids in the LHC Era)
Strawman Tier 2 Evolution 2000 2005 • Linux Farm: 1,500 SI95 20,000 SI95* • Disks on CPUs 4 TB 20 TB • RAID Array 1 TB 20 TB • Tape Library 1 TB 50 - 100 TB • LAN Speed 0.1 - 1 Gbps 10 - 100 Gbps • WAN Speed 155 - 622 Mbps 2.5 - 10 Gbps • Collaborative MPEG2 VGA Realtime HDTVInfrastructure (1.5 - 3 Mbps) (10 - 20 Mbps) RAID disk used for “higher availability” data * Reflects lower Tier 2 component costs due to less demanding usage, e.g. simulation. Paul Avery (Data Grids in the LHC Era)
The GriPhyN Project • Joint project involving • US-CMS, US-ATLAS • LIGO Gravity wave experiment • SDSS Sloan Digital Sky Survey • http://www.phys.ufl.edu/~avery/mre/ • Requesting funds from NSF to build world’s first production-scale grid(s) • Sub-implementations for each experiment • NSF pays for Tier 2 centers, some R&D, some networking • Realization of unified Grid system requires research • Many common problems for different implementations • Requires partnership with CS professionals Paul Avery (Data Grids in the LHC Era)
R & D Foundations I • Globus (Grid middleware) • Grid-wide services • Security • Condor (see M. Livny paper) • General language for service seekers / service providers • Resource discovery • Resource scheduling, coordination, (co)allocation • GIOD (Networked object databases) • Nile (Fault-tolerant distributed computing) • Java-based toolkit, running on CLEO Paul Avery (Data Grids in the LHC Era)
R & D Foundations II • MONARC • Construct and validate architectures • Identify important design parameters • Simulate extremely complex, dynamic system • PPDG (Particle Physics Data Grid) • DOE / NGI funded for 1 year • Testbed systems • Later program of work incorporated into GriPhyN Paul Avery (Data Grids in the LHC Era)
The NSF ITR Initiative • Information Technology Research Program • Aimed at funding innovative research in IT • $90M in funds authorized • Max of $12.5M for a single proposal (5 years) • Requires extensive student support • GriPhyN submitted preproposal Dec. 30, 1999 • Intend that ITR fund most of our Grid research program • Major costs for people, esp. students / postdocs • Minimal equipment • Some networking • Full proposal due April 17, 2000 Paul Avery (Data Grids in the LHC Era)
Summary of Data Grids and the LHC • Develop integrated distributed system, while meeting LHC goals • ATLAS/CMS: production, data handling oriented • (LIGO/SDSS: computation, “commodity component” oriented) • Build, test the regional center hierarchy • Tier 2 / Tier 1 partnership • Commission and test software, data handling systems, and data analysis strategies • Build, test the enabling collaborative infrastructure • Focal points for student-faculty interaction in each region • Realtime high-res video as part of collaborative environment • Involve students at universities in building the data analysis, and in the physics discoveries at the LHC Paul Avery (Data Grids in the LHC Era)