LHC Computing Grid Project

LHC Computing Grid Project Les Robertson CERN - IT Division les.robertson@cern.ch

Background

CERN Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch

HEP Computing Characteristics • Large numbers of independent events • trivial parallelism • Large data sets • smallish records • mostly read-only • Modest I/O rates • few MB/sec per fast processor • Modest floating point requirement • SPECint performance • Very largeaggregate requirements – computation, data • Scaling up is not just big – it is also complex • …and once you exceed the capabilities of a single geographical installation ………?

History • 1960s thru 1980s • The largest scientific supercomputers & mainframes (Control Data, Cray, IBM, Siemens/Fujitsu) • Time-sharing interactive services on IBM & DEC-VMS • Scientific workstations from 1982 (Apollo) for development, final analysis • 1989-- First batch services on RISC - joint project with HP (Apollo DN10.000 ) • 1990 -- Simulation service - 4 X mainframe capacity • 1991 -- SHIFT - data intensive applications, distributed model • 1993 -- First central interactive service on RISC • 1996 -- Last of the mainframes de-commissioned • 1997 -- First batch services on PCs • 1998 -- NA48 record 70 TeraBytes of data • 2000 -- >75% capacity from PCs

The SHIFT Software Model (1990) all data available to all processes - via an API which can be implemented over IP replicated component model – scalable, heterogeneous, distributed standard APIs – disk I/O; mass storage; job scheduler mass storage model – active data cached on disk (stager) physical implementation transparent to the application/user (implementations on SMPs, SP2, clusters, WAN clusters) flexible evolution – scalable capacity; multiple platforms smooth integration of new technologies

WAN application servers mass storage data cache Generic computing farm

LHC Computing Fabric— Can we scale up the current commodity-component based approach?

Not everything has been commoditised yet

CERN Physics Data Handling Evolution of capacity and cost through the nineties 50% annual growth CPU capacity les.robertson@cern.ch 80% annual growth LEP startup

LHC Offline Computing Scale, Cost and the Model

CERN's Users in the World Europe: 267 institutes, 4603 usersElsewhere: 208 institutes, 1632 users

On-line System • Multi-level trigger • Filter out background • Reduce data volume • 24 x 7 operation 40 MHz (1000 TB/sec) Level 1 - Special Hardware 75 KHz (75 GB/sec) Level 2 - Embedded Processors 5 KHz(5 GB/sec) Level 3 – Farm of commodity CPUs 100 Hz (100 MB/sec) Data Recording & Offline Analysis

How Much Data is Involved? High Level-1 Trigger(1 MHz) High No. ChannelsHigh Bandwidth(500 Gbit/s) Level 1 Rate (Hz) 106 1 billion people surfing the Web LHCB ATLAS CMS 105 HERA-B KLOE CDF II 104 High Data Archive(PetaByte) CDF 103 H1ZEUS ALICE NA49 UA1 102 104 105 106 107 LEP Event Size (bytes)

The Large Hadron Collider Project 4 detectors CMS ATLAS Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 5-8 PetaBytes/year 10 PetaBytes of disk Processing – 200,000 of today’s fastest PCs LHCb

Worldwide distributed computing system • Small fraction of the analysis at CERN • ESD analysis – using 12-20 large regional centres • how to use the resources efficiently • establishing and maintaining a uniform physics environment • Data exchange – with tens of smaller regional centres, universities, labs

Other experiments Other experiments LHC LHC Moore’s law Disk Mass Storage CPU Planned capacity evolution at CERN

IT Division - LTP Planning - Materials Importance of cost containment • components & architecture • utilisation efficiency • maintenance, capacity evolution • personnel & management costs • ease of use (usability efficiency)

CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps 622 Mbps Uni n Lab a Tier2 Uni b Lab c   Department  Desktop MONARC report: http://home.cern.ch/~barone/monarc/RCArchitecture.html The MONARC Multi-Tier Model (1999) les.robertson@cern.ch

The opportunity of Grid technology Lab m Uni x regional group CERN Tier 1 Uni a UK USA Lab a France Tier 1 Tier3 physics department Uni n CERN Tier2 ………. Italy Desktop Lab b Germany ………. Lab c  Uni y Uni b physics group   LHC Computing Model2001 - evolving The LHC Computing Centre les.robertson@cern.ch

What has to be done

Major Activities • Computing Fabric Management • Networking • Grid Technology • Software • Prototyping & Data Challenges • Deployment • Regional Centre Coordination & Planning

Computing Fabric Management Key Issues – • scale • efficiency & performance • resilience – fault tolerance • cost – acquisition, maintenance, operation • usability • security

single physical cluster – Tier 0, Tier 1, 4 experiments partitioned by function, (maybe) by user an architecture that accommodates mass market components and supports cost-effective and seamless capacity evolution new level of operational automationnovel style of fault tolerance – self-healing fabrics Working assumptions for Computing Fabric at CERN WAN connection to the Grid application servers mass storage data cache Where are the industrial products? • plan for active mass storage (tape) .. but hope to use it onlyas an archive • one platform – Linux, Intel ESSENTIAL to remain flexible on all fronts

Grid Technology • wave of interest in grid technology as a basis for revolutionisinge-Scienceande-Commerce • LHC offers an ideal testbed, and will gain major usability benefits • a win-win situation? • DataGrid & associated national initiatives have placed HEP at the centre of the action in Europe and the US • important to stay mainline, embrace standards and industrial solutions • important to get the DataGrid testbed going nowintegrate successive LHC Grid Prototypesand get to work with Data Challenges driven by the experiments’ needs attack the real not the theoretical problems

(>40) Dubna Lund Moscow Estec KNMI RAL Berlin IPSL Prague Paris Brno CERN Lyon Santander Milano Grenoble PD-LNL Torino Madrid Marseille BO-CNAF HEP sites Pisa Lisboa Barcelona ESRIN ESA sites Roma Valencia Catania DataGrid Testbed Sites Francois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.it

Grid Technology Coordination Significant coordination issuesDataGrid, INFN Grid, GridPP, PPDG, NorduGrid, GriPhyN, CrossGrid, Globus, GGF, Dutch Grid, Hungarian Grid, ……… “InterGrid” Europe-US committee, DataTag, ………… • LHC  HEP regional and national initiatives • HEP national initiatives  national grids • DataGrid • includes earth observation & biology applications but is dominated by HEP • came from a HEPCCC initiative that included the coordination of national HEP grid activities • key role in founding of GGF • close relationship with Globus • The LHC Project should • invest in and support DataGrid and associated projects to deliver the grid technology for the LHC prototype • support GGF for long-term standardisation • and keep a close watch on industry

Software • basic environment – worldwide deployment • libraries, compilers, development tools, webs, portals • common frameworks & tools • simulation, analysis, .. • what is common – 4, 3, 2? • essential that developments are collaborative efforts between the experiments and the labs • need clear requirements from SC2 • and clear guidelines from the POB • adaptation of collaboration software to grid middleware • every “job” must run on 100s of processors at many different sites • anticipate strong pressure for a standard environment in regional centres ---- and the same on the desktop/notebook/palm top

Prototyping & Data Challenges local, network, grid testing usability, performance, reliability operating the Data Challenges driven by the needs of the collaborations Deployment system integration, distribution, maintenance and support grid operation  LHC computing service operation registration, accounting, reporting Regional Centre Coordination fostering common solutions – standards & strategies prototyping planning

Schedule

Time constraints proto 2 proto 1 proto 3 continuing R&D programme prototyping pilot technology selection pilot service system software selection, development, acquisition hardware selection, acquisition 1st production service 2001 2002 2003 2004 2005 2006

LHC Computing Grid Project