Large-Scale Computing with Grids

Large-Scale Computing with Grids Jeff Templon KVI Seminar 17 February 2004

Information I intend to transfer • Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem. • What are Computational Grids? • How are Grids being used, in HEP as well as in other fields? • What are some examples of current Grid “research topics”?

Our Problem • Place event info on 3D map • Trace trajectories through hits • Assign type to each track • Find particles you want • Needle in a haystack! • This is “relatively easy” case

More complex example

Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis

Computational Aspects • To reconstruct and analyze 1 event takes about 90 seconds • Maybe only a few out of a million are interesting. But we have to check them all! • Analysis program needs lots of calibration; determined from inspecting results of first pass. • Each event will be analyzed several times!

40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) level 2 - embedded processors 5 KHz (5 GB/sec) level 3 - PCs 100 Hz (100 MB/sec) data recording & offline analysis One of the four LHC detectors online system multi-level trigger filter out background reduce data volume

Computational Implications (2) • 90 seconds per event to reconstruct and analyze • 100 incoming events per second • To keep up, need either: • A computer that is nine thousand times faster, or • nine thousand computers working together • Moore’s Law: wait 20 years and computers will be 9000 times faster (we need them in 2007!)

More Computational Issues • Four LHC experiments – roughly 36k CPUs needed • BUT: accelerator not always “on” – need fewer • BUT: multiple passes per event – need more! • BUT: haven’t accounted for Monte Carlo production – more!! • AND: haven’t addressed the needs of “physics users” at all!

Large Dynamic Range Computing • Clear that we have enough work to keep somewhere between 50 and 100 k CPUs busy • Most of the activities are “bursty”, so doing them with dedicated facilities is inefficient • Need some meta-facility that can serve all the activities • Reconstruction & reprocessing • Analysis • Monte-Carlo • All four LHC experiments • All the LHC users

LHC User Distribution • Putting all computers in one spot leads to traffic jams • Which spot is willing to pay for & maintain 100k CPUs?

Classic Motivation for Grids • Large Scales: 50k CPUs, petabytes of data • (if we’re only talking ten machines, who cares?) • Large Dynamic Range: bursty usage patterns • Why buy 25k CPUs if 60% of the time you only need 900 CPUs? • Multiple user groups on single system • Can’t “hard-wire” the system for your purposes • Wide-area access requirements • Users not in same lab or even continent

Solution using Grids • Large Scales: 50k CPUs, petabytes of data • Assemble 50k+ CPUs and petabytes of mass storage • Don’t need to be in the same place! • Large Dynamic Range: bursty usage patterns • When you need less than you have, others use excess capacity • When you need more, use others’ excess capacities • Multiple user groups on single system • “Generic” grid software services (think web server here) • Wide-area access requirements • Public Key Infrastructure for authentication & authorization

How do Grids Work? Public Key Infrastructure & Generic Grid Software Services

Public Key Infrastructure – Single Login • Based on asymmetric cryptography: public/private key pairs • Need network of trusted Certificate Authorities (e.g. Verisign) • CAs issue certificates to users, computers, or services running on computers (typically valid for two years) • Certificates carry • an “identity” /O=dutchgrid/O=users/O=kvi/CN=Nasser Kalantar • a public key • identity of the issuing CA • a digital signature (transformation on cert text using CA private key) • Need to tighten this part up, takes too long

What is a Grid? (what are these Generic Grid Software Services?)

Info system defines grid DS Grid software service = C = = C C C C C C C (like http server) Information System Information System is Central Nervous System of Grid

Data Grid DS Grid software service = C = = DS DS DS C C DS C C C C C DS DS I.S. D.M.S

Computing Task Submission DS Grid software service = C = = DS DS DS C C DS C C Get fresh, detailed info Coarse Requirements C C Candidate Clusters C DS proxy + command; (data); DS I.S. W.M.S. D.M.S

Computing Task Execution DS Grid software service = C = = proxy DS DS DS DS List of best locations Where is my data? proxy + command; (data); Find DMS DS DS C I.S. W.M.S. D.M.S logger

Computing Task Execution DS Grid software service = C = = proxy + data Where do I put the data? How to contact O.D.S.? Register output Done DS DS DS C I.S. W.M.S. D.M.S logger

Can do a bit more with data…

Computing Task Submission with data DS Grid software service = C = = DS DS DS C C DS C C Get fresh, detailed info Coarse Requirements C C Candidate Clusters Find copies of data C DS proxy + command; (data file spec); DS I.S. W.M.S. D.M.S

Computing Task Execution DS Grid software service = C = = Local access DS DS DS DS List of best locations Where is my data? proxy + command; (data file spec); Find DMS DS DS C I.S. W.M.S. D.M.S logger

Grids are working now • See it in action

Biomedical Image Comparison Applications D0 Run II Data Reprocessing Ozone Profile Computation, Storage, & Access

Grid Research Topics • “push” vs. “pull” model of operation (you just saw “push”) • We know how to do fine-grained priority & access control in “push” • We think “pull” is more robust operationally • Software distribution • HEP has BIG programs with many versions … “pre” vs. “zero” install • Data distribution • Similar issues (software is just executable data!) • A definite bottleneck in D0 reprocessing • (1 Gbit/100/2/2=modem!) • Speedup limited to 92% of theoretical • About 1 hr CPU lost per job • Optimization in general is hard, and is it worth it????

Directed Acyclic Graphs

Didn’t talk about • Grid projects • European Data Grid – ends this month, 10 M€ / 3y • VL-E – ongoing, 20 M€ / 5y • EGEE – starts in April, 30 M€ / 2y • Gridifying HEP software & computing activities • Transition from supercomputing to Grid computing (funding) • Integration with other technologies • Submit, monitor, control computing work from your PDA or mobile!

Conclusions • Grids are working now for workload management in HEP • Big developments in next two years with large-scale deployments of LCG & EGEE projects • Seems Grids are catching on (IBM / Globus alliance) • Lots of interesting stuff to try out!!!

What’s There Now? • Job Submission • Marriage of Globus, Condor-G, EDG Workload Manager • Latest version reliable at 99% level • Information System • New System (R-GMA) – very good information model, implementation still evolving • Old System (MDS) – poor information model, poor architecture, but hacks allow 99% “uptime” • Data Management • Replica Location Service – convenient and powerful system for locating and manipulating distributed data; mostly still user-driven (no heuristics) • Data Storage • “Bare gridFTP server” – reliable but mostly suited to disk-only mass store • SRM – no mature implementations

Grids are Information A-Grid Information System B-Grid Information System

Large-Scale Computing with Grids

Large-Scale Computing with Grids

Presentation Transcript

Very Large Scale Computing In Accelerator Physics

Large-scale Processing with MapReduce

Large Scale Sky Computing Applications with Nimbus

Large Scale Visualization with ParaView

Large-Scale Distributed Computing in the Netherlands

Challenges and Solutions in Large-Scale Computing Systems

LARGE SCALE

Large Scale Distributed Computing Systems

Computing and Data Infrastructure for Large-Scale Science Deploying Production Grids:

Large-Scale Distributed Computing in the Netherlands

Large Scale Computing Systems

Large Scale Experimental Grids

Scaling Issues for Large-Scale Grids: Security and Directory Services

Realistic Online Simulation of Large-Scale Grids

Large scale

Computing Hopf Bifurcations in Large-Scale Problems

Flexible QoS-Based Service Selection in Large-Scale Grids

Optimal Tree Structures for Large-Scale Grids

CSCI 365 – Introduction to Large Scale Computing