The Data Deluge and the Grid

The Data Deluge and the Grid • The Data Deluge • The Large Hadron Collider • The LHC Data Challenge • The Grid • Grid Applications • GridPP • Conclusion Steve Lloyd Queen Mary University of London s.l.lloyd@qmul.ac.uk The Data Deluge and the Grid

The Data Deluge Expect massive increases in amount of data being collected in several diverse fields over the next few years: • Astronomy - Massive sky surveys • Biology - Genome databases etc. • Earth Observing • Digitisation of paper, film, tape records etc to create Digital Libraries, Museums . . . • Particle Physics - Large Hadron Collider • . . . 1PByte ~1000 TBytes ~ 1M GBytes ~ 1.4M CDs [Petabyte Terabyte Gigabyte] The Data Deluge and the Grid

Digital Sky Project Federating new astronomical surveys: ~ 40,000 square degrees ~ 1/2 trillion pixels (1 arc second) ~ 1 TB x multi-wavelengths > 1 billion sources Integrated catalogue and image database: • Digital Palomer Observatory Sky Survey • 2 All Sky Survey • NRAO VLA Sky Survey • VLA FIRST Radio Survey Later: • ROSAT • IRAS • Westerbork 327 MHz Survey The Data Deluge and the Grid

Sloan Digital Sky Survey Survey 10,000 square degrees of Northern Sky over 5 years • ~ 1 million spectra • positions and images of 100 million objects • 5 wavelength bands • ~ 40 TB The Data Deluge and the Grid

VISTA Visible and Infrared Survey Telescope for Astronomy The Data Deluge and the Grid

Crab Nebula X-ray Optical Infra-red Radio Virtual Observatories Chandra X-ray HST optical Gemini mid-IR VLA radio Jet in M87 The Data Deluge and the Grid

NASA’s Earth Observing System Galapagos Oil Spill: 1 TB/day The Data Deluge and the Grid

ESA EO Facilities LANDSAT 7 TERRA/MODIS AVHRR SEAWIFS SPOT IRS-P3 MATERA (I) HISTORICAL ARCHIVES KIRUNA (S) - ESRANGE TROMSO (N) MATERA (I) STANDARD PRODUCTION CHAINS MASPALOMAS (E) NEUSTREL.ITZ (D) PRODUCTS GOME analysis detected ozone thinningoverEurope 31 Jan 2002 ESRIN USERS USERS The Data Deluge and the Grid

Species 2000 To enumerate all ~1.7 million known species of plants, animals, fungi and microbes on Earth A federation of initially 18 taxonomic databases - eventually ~ 200 databases The Data Deluge and the Grid

Genomics The Data Deluge and the Grid

The LHC The Large Hadron Collider (LHC) will be a 14 TeV centre of mass proton proton collider operating in the existing 26.7Km LEP tunnel at CERN. Due to start operation > 2006 • 1,232 superconducting main dipoles of 8.3Tesla • 788 quadrupoles • 2,835 bunches of 1011 protons per bunch spaced by 25ns The Data Deluge and the Grid

Particle Physics Questions • Need to discover (confirm) Higgs Particle • Study its properties • Prove that Higgs couplings depend on masses • Other unanswered questions: • Does Supersymmetry exist? • How are quarks and leptons related? • Why are there 3 sets of quarks and leptons? • What about Gravity? • Anything unexpected? The Data Deluge and the Grid

The LHC The Data Deluge and the Grid

The LEP/LHC Tunnel The Data Deluge and the Grid

LHC Experiments LHC will house 4 experiments: • ATLAS and CMS are large 'General Purpose' detectors designed to detect everything and anything • LHCb is a specialised experiment designed to study CP violation in the b quark system • ALICE is a dedicated Heavy Ion Physics Detector The Data Deluge and the Grid

Schematic View of the LHC The Data Deluge and the Grid

The ATLAS Experiment ATLAS Consists of • An inner tracker to measures the momentum of each charged particle • A calorimeter to measure the energies carried by the particles • A muon spectrometer to identify and measure muons • A huge magnet system for bending charged particles for momentum measurement A total of > 108 electronic channels The Data Deluge and the Grid

The ATLAS Detector The Data Deluge and the Grid

Simulated ATLAS Higgs Event The Data Deluge and the Grid

LHC Event Rates • The LHC proton bunches collide every 25ns and each collision yields ~20 proton proton interactions superimposed in the Detector i.e. • 40 MHz x 20 = 8x108 pp interactions/sec • The (110 GeV) Higgs cross section is 24.2pb. • A good channel is H   with a branching ratio of 0.19% and a detector acceptance ~50% • At full (1034cm-2s-1) LHC luminosity this gives 1034 x 24.2x10-12 x 10-24 x 0.0019 x 0.5 = 2x10-4 H   per second A 2x10-4 needle in a 8x108 Haystack The Data Deluge and the Grid

'Online' Data Reduction Collision Rate 40 MHz 40 TB/sec Level 1 Special Hardware Trigger 104 - 105 Hz 10-100 GB/sec Selecting interesting events based on progressively more detector information Level 2 Embedded Processor Trigger 1-10 GB/sec 102 - 103 Hz Level 3 Processor Farm 10 - 100 Hz 100-200 MB/sec Raw Data Storage Offline Data Reconstruction The Data Deluge and the Grid

Offline Analysis Raw Data from Detector 1-2 MB/event @ 100-400 Hz Total Data per year from one experiment 1 to 8 PBytes (1015 Bytes) Data Reconstruction (Digits to Energy/momentum etc) Event Summary Data 0.5 MB/event Analysis Event Selection 10 kB/event Analysis Object Data Physics Analysis The Data Deluge and the Grid

Computing Resources Required CPU Power (Reconstruction, Simulation, User Analysis etc) • 2 Million SpecInt95 • (A 1 GHz PC is rated at ~40 SpecInt95) • i.e. 50,000 of today's PCs 'Tape' Storage • 20,000 TB Disk Storage • 2,500 TB Analysis carried out throughout the world by hundreds of Physicists The Data Deluge and the Grid

Worldwide Collaboration CMS: 1800 physicists 150 institutes 32 countries The Data Deluge and the Grid

Solutions • Centralised Solution: • Put all resources at CERN • Funding agencies certainly won't place all their investment at CERN • Sociological problems • Distributed solution: • exploit established computing expertise & infrastructure in national labs and universities • reduce dependence on links to CERN • tap additional funding sources (spin off) Is the Grid the solution? The Data Deluge and the Grid

What is The Grid? Analogy with the Electricity Power Grid: • Unlimited ubiquitous distributed computing • Transparent access to multipetabyte distributed databases • Easy to plug in • Complexity of infrastructure hidden The Data Deluge and the Grid

The Grid • Five emerging models: • Distributed Computing • - synchronous processing • High-Throughput Computing • - asynchronous processing • On-Demand Computing • - dynamic resources • Data-Intensive Computing • - databases • Collaborative Computing • - scientists Ian Foster andCarl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999, http://www.mkp.com/grids The Data Deluge and the Grid

The Grid Ian Foster / Carl Kesselman: "A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities." The Data Deluge and the Grid

The Grid • Dependable - Need to rely on remote equipment as much as the machine on your desk • Consistency - Machines need to communicate so need consistent environments and interfaces • Pervasive - The more resources that participate in the same system the more useful they all are • Inexpensive - Important for pervasiveness - i.e. built using commodity PCs and disks The Data Deluge and the Grid

The Grid • You simply submit your job to the 'Grid'- you shouldn't have to know where the data you want is or where the job will run. The Grid software (Middleware) will take care of: • running the job where the data is or • moving the data to where there is CPU power available The Data Deluge and the Grid

E = mc2 @#%&*! Grid Middleware The Grid for the Scientist “Putting the bottleneck back in the Scientist’s mind” The Data Deluge and the Grid

Grid Tiers • For the LHC we envisage a 'Hierarchical' structure based on several 'Tiers' since the data mostly originates at one place: • Tier-0 - CERN - the source of the data • Tier-1 - ~ 10 Major Regional Centres (inc UK) • Tier-2 - smaller more specialised Regional Centres (4 in UK?) • Tier-3 - University Groups • Tier-4 – My laptop? Mobile Phone? • Doesn't need to be hierarchical e.g. for Biologists probably not desirable The Data Deluge and the Grid

Grid Services Cosmology Chemistry Environment Applications Biology Particle Physics Data- Remote Problem Remote Collaborative Distributed Intensive Solving Instrumentation Application Visualization Applications Computing Applications Applications Applications Applications Toolkits Toolkit Toolkit Toolkit Toolkit Toolkit Toolkit Grid Services Resource-independent and application-independent services (Middleware) authentication, authorization, resource location, resource allocation, events, accounting, remote data access, information, policy, fault detection Resource-specific implementations of basic services Grid Fabric e.g., Transport protocols, name servers, differentiated services, CPU schedulers, public key (Resources) infrastructure, site accounting, directory service, OS bypass The Data Deluge and the Grid

Problems • Scalability • Will it scale to thousands of processors, thousands of disks, PetaBytes of data, Terabits/sec of IO? • Wide-area distribution • How to distribute, replicate, cache, synchronise, catalogue the data? • How to balance local ownership of resources with the requirements of the whole? • Adaptability/Flexibility • Need to adapt to rapidly changing hardware and costs, new analysis methods etc. The Data Deluge and the Grid

SETI@home • A distributed computing project - not really a Grid project • You pull the data from them rather than they submit the job to you • total of 3,864,230 users • 564,194,228 results received • 1,063,104 years of cpu time • 1.8x1021 floating point operations • 77 different cpu types • ~100 different OS Arecibo telescope in Puerto Rico The Data Deluge and the Grid

SETI@home The Data Deluge and the Grid

Entropia • Uses idle cycles on Home PCs for profit and non-profit projects: • Mersenne Prime Search • 146,622 machines • 784,360,165 cpu hours • FightAIDS@Home • 13,944 Machines • 1,652,126 cpu hours The Data Deluge and the Grid

NASA Information Power Grid • Knit together widely distributed computing, data, instrumentation and human resources • to address complex large scale computing and data analysis problems The Data Deluge and the Grid

Collaborative Engineering Unitary Plan Wind Tunnel Multi-source Data Analysis Real-time collection Archival storage The Data Deluge and the Grid

Other Grid Applications • Distributed Supercomputing • Simultaneous execution across multiple supercomputers • Smart Instruments • Enhance the power of scientific instruments by providing access to data archives and online processing capabilities and visualisation e.g. coupling Argonne’s Photon Source to a supercomputer The Data Deluge and the Grid

GridPP http://www.gridpp.ac.uk The Data Deluge and the Grid

GridPP Overview Provide architecture and middleware Future LHC Experiments Running US Experiments Build prototype Tier-1 and Tier-2s in the UK and implement middleware in experiments Use the Grid with simulation data Use the Grid with real data The Data Deluge and the Grid

The Prototype UK Tier-1 • Jan 2002 Central Facilities used by all experiments: • 250 CPUs (450Mhz-1GHz) • 10TB Disk • 35TB Tape in use (theoretical tape capacity 330 TB) • March 2002 Extra Resources for LHC and BaBar • 312 CPUs • 40TB Disk • extra 36 TB of tape and three new drives The Data Deluge and the Grid

Conclusions • Enormous data challenges in next few years. • The Grid is likely solution. • The Web gives ubiquitous access to distributed information. • The Grid will give ubiquitous access to computing resources and hence knowledge. • Many Grid projects and testbeds starting to take off. • GridPP is building a UK Grid for Particle Physicists to prepare for future LHC Data. The Data Deluge and the Grid

The Data Deluge and the Grid

The Data Deluge and the Grid

Presentation Transcript

Shooting the data deluge rapids:

The Data Grid Overview

The Data Deluge

A Deluge of Data

The Digital Deluge Lecture 5

Riding the Wave: Learning to Surf the Data Deluge

High-throughput Biological Data The data deluge and bioinformatics algorithms

Deterministic Annealing and Robust Scalable Data mining for the Data Deluge

Surviving the Data Deluge: Challenges and Opportunities for Academic Research

Dealing with the Data Deluge: Understanding the Application Platform Opportunity

Bioinformatics Data and the Grid: The GeneGrid Data Manager

Data Monitoring Confidentiality and the Grid

Addressing the Challenges of the Scientific Data Deluge

The Grid Data Warehouse

Data and the Grid

PHENIX and the data grid

Data Deluge

High-throughput Biological Data The data deluge

MICE Data and the Grid

The Grid Data Warehouse

Beyond the Data Deluge:

The Digital Deluge Lecture 2