Petabyte-scale computing for LHC

Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader ISEF Students 18th June 2012 Accelerating Science and Innovation

Enter a New Era in Fundamental Science Start-up of the Large Hadron Collider (LHC), one of the largest and truly global scientific projects ever, is the most exciting turning point in particle physics. CMS LHCb ALICE ATLAS data Exploration of a new energy frontier LHC ring: 27 km circumference

Some history of scale… For comparison: 1990’s: Total LEP data set ~few TB Would fit on 1 tape today Today: 1 year of LHC data ~25 PB CERN has about 60,000 physical disks to provide about 20 PB of reliable storage Where does all this data come from? CERN / January 2011

CERN / May 2011

150 million sensors deliver data … … 40 million times per second

What is this data? • Raw data: • Was a detector element hit? • How much energy? • What time? • Reconstructed data: • Momentum of tracks (4-vectors) • Origin • Energy in clusters (jets) • Particle type • Calibration information • … CERN / January 2011

Data and Algorithms • HEP data are organized as Events (particle collisions) • Simulation, Reconstruction and Analysis programs process “one Event at a time” • Events are fairly independent  Trivial parallel processing • Event processing programs are composed of a number of Algorithms selecting and transforming “raw” Event data into “processed” (reconstructed) Event data and statistics RAW Triggered events recorded by DAQ Detector digitisation ~2 MB/event ESD/RECO Pseudo-physical information: Clusters, track candidates Reconstructed information ~100 kB/event Physical information: Transverse momentum, Association of particles, jets, id of particles AOD Analysis information ~10 kB/event TAG Classification information Relevant information for fast event selection ~1 kB/event Ian Bird, CERN

simulation Data Handling and Computation for Physics Analysis reconstruction event filter (selection & reconstruction) detector analysis processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch Ian Bird, CERN

The LHC Computing Challenge • Signal/Noise: 10-13 (10-9 offline) • Data volume • High rate * large number of channels * 4 experiments • 15 PetaBytes of new data each year • Compute power • Event complexity * Nb. events * thousands users • 200 k CPUs • 45 PB of disk storage • Worldwide analysis & funding • Computing funding locally in major regions & countries • Efficient analysis everywhere •  GRID technology  22 PB in 2011  250 k CPU  150 PB Ian Bird, CERN

A collision at LHC Ian Bird, CERN

The Data Acquisition Ian Bird, CERN

Tier 0 at CERN: Acquisition, First pass reconstruction,Storage& Distribution 2011: 4-6 GB/sec 2011: 400-500 MB/sec 1.25 GB/sec (ions) Ian.Bird@cern.ch

WLCG – what and why? • A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments • Managed and operated by a worldwide collaboration between the experiments and the participating computer centres • The resources are distributed – for funding and sociological reasons • Our task was to make use of the resources available to us – no matter where they are located • Tier-0 (CERN): • Data recording • Initial data reconstruction • Data distribution • Tier-1 (11 centres): • Permanent storage • Re-processing • Analysis • Tier-2 (~130 centres): • Simulation • End-user analysis Ian Bird, CERN

e-Infrastructure

WLCG Grid Sites • Today >140 sites • >250k CPU cores • >150 PB disk • Tier 0 • Tier 1 • Tier 2

US-BNL WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations CERN Bologna/CNAF Ca-TRIUMF • Today we have 49 MoU signatories, representing 34 countries: • Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. Taipei/ASGC NDGF US-FNAL UK-RAL Amsterdam/NIKHEF-SARA De-FZK Barcelona/PIC Ian Bird, CERN Lyon/CCIN2P3

Original Computing model Ian.Bird@cern.ch

From testing to data: e.g. DC04 (ALICE, CMS, LHCb)/DC2 (ATLAS) in 2004 saw first full chain of computing models on grids Independent Experiment Data Challenges 2004 SC1 Basic transfer rates • Service Challenges proposed in 2004 • To demonstrate service aspects: • Data transfers for weeks on end • Data management • Scaling of job workloads • Security incidents (“fire drills”) • Interoperability • Support processes 2005 SC2 Basic transfer rates SC3 Sustained rates, data management, service reliability 2006 SC4 Nominal LHC rates, disk tape tests, all Tier 1s, some Tier 2s 2007 • Focus on real and continuous production use of the service over several years (simulations since 2003, cosmic ray data, etc.) • Data and Service challenges to exercise all aspects of the service – not just for data transfers, but workloads, support structures etc. 2008 CCRC’08 Readiness challenge, all experiments, ~full computing models 2009 STEP’09 Scale challenge, all experiments, full computing models, tape recall + analysis 2010 Ian Bird, CERN

WLCG: Data in 2010;11;12 HI: ALICE data into Castor > 4 GB/s (red) • In 2010+2011 ~38 PB of data have been accumulated, expect about 30 PB more in 2012 • Data rates to tape in excess of original plans : up to 6 GB/s in HI running (cf. nominal 1.25 GB/s) HI: Overall rates to tape > 6 GB/s (r+b) 23 PB data written in 2011 …and 2012, 3 PB/month

Grid Usage 1.5M jobs/day Use remains consistently high: • >1.5 M jobs/day; • ~150k CPU 109 HEPSPEC-hours/month (~150 k CPU continuous use) CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 24 months At the end of 2010 we saw all Tier 1 and Tier 2 job slots being filled CPU usage now >> double that of mid-2010 (inset shows build up over previous years) As well as LHC data, large simulation productions always ongoing • Large numbers of analysis users: • ATLAS, CMS ~1000 • LHCb,ALICE ~250 In 2011 WLCG delivered ~ 150 CPU-millennia!

Tiers usage vs pledges We use everything we are given! Ian.Bird@cern.ch

CPU – around the Tiers • The grid really works • All sites, large and small can contribute • And their contributions are needed! Ian.Bird@cern.ch

Data transfers Global transfers (last month) Global transfers > 10 GB/s (1 day) CERN  Tier 1s (last 2 weeks)

LHC Networking • Relies on • OPN, GEANT, US-LHCNet • NRENs & other national & international providers Ian Bird, CERN

e-Infrastructure

Today’s Grid Services Data Management Services Security Services Job Management Services Certificate Management Service VO Membership Service Compute Element Storage Element Workload Management Service File Catalogue Service Authentication Service VO Agent Service Grid file access tools Authorization Service Application Software Install Service File Transfer Service GridFTP service Information Services Database and DB Replication Services Information System Messaging Service Accounting Service POOL Object Persistency Service Site Availability Monitor Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users Monitoring tools: experiment dashboards; site monitoring

Technical evolution: Background Consider that: • Computing models have evolved • Far better understanding of requirements now than 10 years ago • Even evolved since large scale challenges • Experiments have developed various workarounds to manage shortcomings in middleware • Pilot jobs and central task queues (almost) ubiquitous • Operational effort often too high; • lots of services were not designed for redundancy, fail-over, etc. • Technology evolves rapidly, rest of world also does (large scale) distributed computing – don’t need entirely home grown solutions • Must be concerned about long term support and where it will come from

Computing model evolution Evolution of computing models Hierarchy Mesh Ian.Bird@cern.ch

Connectivity challenge • Not just bandwidth • We are a Global collaboration … but well connected countries do better • Need to effectively connect everyone that wants to participate in LHC science • Large actual and potential communities in Middle East, Africa, Asia, Latin America … but also on the edges of Europe Ian.Bird@cern.ch

Impact of the LHC Computing Grid • Archeology • Astronomy • Astrophysics • Civil Protection • Comp. Chemistry • Earth Sciences • Finance • Fusion • Geophysics • High Energy Physics • Life Sciences • Multimedia • Material Sciences • … Ian.Bird@cern.ch • WLCG has been leveraged on both sides of the Atlantic, to benefit the wider scientific community • Europe: • Enabling Grids for E-sciencE(EGEE) 2004-2010 • European Grid Infrastructure (EGI) 2010-- • USA: • Open Science Grid (OSG) 2006-2012 (+ extension?) • Many scientific applications 

Spectrum of grids, clouds, supercomputers, etc. • Grids • Collaborative environment • Distributed resources (political/sociological) • Commodity hardware (also supercomputers) • (HEP) data management • Complex interfaces (bug not feature) • Supercomputers • Expensive • Low latency interconnects • Applications peer reviewed • Parallel/coupled applications • Traditional interfaces (login) • Also SC grids (DEISA, Teragrid) Many different problems: Amenable to different solutions No right answer • Clouds • Proprietary (implementation) • Economies of scale in management • Commodity hardware • Virtualisation for service provision and encapsulating application environment • Details of physical resources hidden • Simple interfaces (too simple?) • Volunteer computing • Simple mechanism to access millions CPUs • Difficult if (much) data involved • Control of environment  check • Community building – people involved in Science • Potential for huge amounts of real work • Consider ALL as a combined e-Infrastructure ecosystem • Aim for interoperability and combine the resources into a consistent whole • Keep applications agile so they can operate in many environments

Grid <-> Cloud?? • Grid: Is a distributed computing service • Integrates distributed resources • Global single-sign-on (use same credential everywhere) • Enables (virtual) collaboration • Cloud: Is a large (remote) data centre • Economy of scale – centralize resources in large centres • Virtualisation – enables dynamic provisioning of resources • Technologies are not exclusive • In the future our collaborative grid sites will use cloud technologies (virtualisationetc) • We will also use cloud resources to supplement our own Ian Bird, CERN

Grids  clouds?? • We have a grid because: • We need to collaborate and share resources • Thus we will always have a “grid” • Our network of trust is of enormous value for us and for (e-)science in general • We also need distributed data management • That supports very high data rates and throughputs • We will continually work on these tools • But, the rest can be more mainstream (open source, commercial, … ) • We use message brokers more and more as inter-process communication • Virtualisation of our grid sites is happening • many drivers: power, dependencies, provisioning, … • Remote job submission … could be cloud-like • Interest in making use of commercial cloud resources, especially for peak demand Ian.Bird@cern.ch

Clouds & Virtualisation • Several strategies: • Use of virtualisation in the CERN & other CCs: • Lxcloud pilot + CVI  dynamic virtualised infrastructure (which may include “bare-metal” provisioning) • No change to any grid or service interfaces (but new possibilities) • Likely based on Openstack • Other WLCG sites also virtualising their infrastructure • Investigating use of commercial clouds – “bursting” • Additional resources; • Potential of outsourcing some services? • Prototype with Helix Nebula project; • Experiments have various activities (with Amazon, etc) • Can cloud technology replace/supplement some grid services? • More speculative: Feasibility? Timescales? Ian Bird, CERN

CERN Data Centre Numbers From http://sls.cern.ch/sls/service.php?id=CCBYNUM CERN Infrastructure Evolution

Evolution of capacity: CERN & WLCG Ian.Bird@cern.ch

Petabyte-scale computing for LHC

Petabyte-scale computing for LHC

Presentation Transcript

Grids and LHC Computing

Warehouse-Scale Computing

Computing for the LHC

Computing for ALICE at the LHC

Computing for the LHC

Architectures for Extreme-Scale Computing

The LHC Computing Project Common Solutions for the LHC

LHC computing

LHC Computing Grid

LHC Computing Grid Deployment

LHC Computing Grid

Computing at LHC

LHC Computing Review Recommendations lhc-computing-review-public.web.cern.ch

BBM: Bayesian Browsing Model from Petabyte -scale Data

Web Scale Computing

CMS LHC-Computing

LHC Computing Plans

LHC Computing Grid Project

Summary of the LHC Computing Review lhc-computing-review-public.web.cern.ch