530 likes | 676 Views
Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. Physics Colloquium University of Texas at Arlington Jan. 24, 2002. What is a Grid?.
E N D
Global Data Grids for21st Century Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu Physics ColloquiumUniversity of Texas at ArlingtonJan. 24, 2002 Paul Avery
What is a Grid? • Grid: Geographically distributed computing resources configured for coordinated use • Physical resources & networks provide raw capability • “Middleware” software ties it together Paul Avery
Applications for Grids • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Engineering • Civil engineers collaborate to design, execute, & analyze shake table experiments • A multidisciplinary analysis in aerospace couples code and data in four companies From Ian Foster Paul Avery
Applications for Grids (cont.) • Application Service Providers • A home user invokes architectural design functions at an application service provider • An application service provider purchases cycles from compute cycle providers • Commercial • Scientists at a multinational soap company design a new product • Communities • An emergency response team couples real time data, weather model, population data • A community group pools members’ PCs to analyze alternative designs for a local road • Health • Hospitals and international agencies collaborate on stemming a major disease outbreak From Ian Foster Paul Avery
Proto-Grid: SETI@home • Community: SETI researchers + enthusiasts • Arecibo radio data sent to users (250KB data chunks) • Over 2M PCs used Paul Avery
More Advanced Proto-Grid:Evaluation of AIDS Drugs • Community • 1000s of home computer users • Philanthropic computing vendor (Entropia) • Research group (Scripps) • Common goal • Advance AIDS research Paul Avery
Network-centric Simple, fixed end systems Few embedded capabilities Few services No user-level quality of service Early Information Infrastructure O(108) nodes Network Paul Avery
Application-centric Heterogeneous, mobile end-systems Many embedded capabilities Rich services User-level quality of service Emerging Information Infrastructure O(1010) nodes Caching ResourceDiscovery Processing QoS Grid Qualitatively different,not just “faster andmore reliable” Paul Avery
Why Grids? • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage and computing • Groups of people • Communities require access to common services • Scientific collaborations (physics, astronomy, biology, eng. …) • Government agencies • Health care organizations, large corporations, … • Goal is to build “Virtual Organizations” • Make all community resources available to any VO member • Leverage strengths at different institutions • Add people & resources dynamically Paul Avery
Grid Challenges • Overall goal • Coordinated sharing of resources • Technical problems to overcome • Authentication, authorization, policy, auditing • Resource discovery, access, allocation, control • Failure detection & recovery • Resource brokering • Additional issue: lack of central control & knowledge • Preservation of local site autonomy • Policy discovery and negotiation important Paul Avery
Application Application Internet Protocol Architecture Transport Internet Link Layered Grid Architecture(Analogy to Internet Architecture) Specialized services:App. specific distributed services User Managing multiple resources:ubiquitous infrastructure services Collective Sharing single resources:negotiating access, controlling use Resource Talking to things:communications, security Connectivity Controlling things locally:Accessing, controlling resources Fabric From Ian Foster Paul Avery
Globus Project and Toolkit • Globus Project™ (Argonne + USC/ISI) • O(40) researchers & developers • Identify and define core protocols and services • Globus Toolkit™ • A major product of the Globus Project • Reference implementation of core protocols & services • Growing open source developer community • Globus Toolkit used by all Data Grid projects today • US: GriPhyN, PPDG, TeraGrid, iVDGL • EU: EU-DataGrid and national projects Paul Avery
Globus General Approach Applications • Define Grid protocols & APIs • Protocol-mediated access to remote resources • Integrate and extend existing standards • Develop reference implementation • Open source Globus Toolkit • Client & server SDKs, services, tools, etc. • Grid-enable wide variety of tools • Globus Toolkit • FTP, SSH, Condor, SRB, MPI, … • Learn about real world problems • Deployment • Testing • Applications Diverse global services Core services Diverse OS services Paul Avery
Globus Toolkit Protocols • Security (connectivity layer) • Grid Security Infrastructure (GSI) • Resource management (resource layer) • Grid Resource Allocation Management (GRAM) • Information services (resource layer) • Grid Resource Information Protocol (GRIP) • Data transfer (resource layer) • Grid File Transfer Protocol (GridFTP) Paul Avery
Data Grids Paul Avery
Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Geographically distributed collaboration • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery
Global Data Grid Challenge “Global scientific communities will perform computationally demanding analyses of distributed datasets that will grow by at least 3 orders of magnitude over the next decade, from the 100 Terabyte to the 100 Petabyte scale.” Paul Avery
Data Intensive Physical Sciences • High energy & nuclear physics • Gravity wave searches • LIGO, GEO, VIRGO • Astronomy: Digital sky surveys • Now: Sloan Sky Survey, 2MASS • Future: VISTA, other Gigapixel arrays • “Virtual” Observatories (Global Virtual Observatory) • Time-dependent 3-D systems (simulation & data) • Earth Observation • Climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios Paul Avery
Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Digitizing patient records (ditto) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (3-D, time dependent) • Virtual Population Laboratory (proposed) • Database of populations, geography, transportation corridors • Simulate likely spread of disease outbreaks Craig Venter keynote @SC2001 Paul Avery
Data and Corporations • Corporations and Grids • National, international, global • Business units, research teams • Sales data • Transparent access to distributed databases • Corporate issues • Short term and long term partnerships • Overlapping networks • Manage, control access to data and resources • Security Paul Avery
Example: High Energy Physics “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery
LHC Computing Challenges • “Events” resulting from beam-beam collisions: • Signal event is obscured by 20 overlapping uninteresting collisions in same crossing • CPU time does not scale from previous generations 2000 2007 Paul Avery
LHC: Higgs Decay into 4 muons 109 events/sec, selectivity: 1 in 1013 Paul Avery
LHC Computing Challenges • Complexity of LHC interaction environment & resulting data • Scale: Petabytes of data per year (100 PB by ~2010-12) • GLobal distribution of people and resources 1800 Physicists 150 Institutes 32 Countries Paul Avery
Tier 0 (CERN) 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 4 4 4 4 Global LHC Data Grid Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation • Key ideas: • Hierarchical structure • Tier2 centers Paul Avery
Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS Global LHC Data Grid CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 Experiment ~PBytes/sec Online System ~100 MBytes/sec Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size CERN Computer Center > 20 TIPS Tier 0 +1 HPSS 2.5 Gbits/sec France Center Italy Center UK Center USA Center Tier 1 2.5 Gbits/sec Tier 2 ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute 100 - 1000 Mbits/sec Physics data cache Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Tier 4 Workstations,other portals Paul Avery
Image Data Standards Source Catalogs Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Example: Global Virtual Observatory Multi-wavelength astronomy,Multiple surveys Paul Avery
GVO Data Challenge • Digital representation of the sky • All-sky + deep fields • Integrated catalog and image databases • Spectra of selected samples • Size of the archived data • 40,000 square degrees • Resolution < 0.1 arcsec > 50 trillion pixels • One band (2 bytes/pixel) 100 Terabytes • Multi-wavelength: 500-1000 Terabytes • Time dimension: Many Petabytes • Large, globally distributed database engines • Integrated catalog and image databases • Multi-Petabyte data size • Gbyte/s aggregate I/O speed per site Paul Avery
Sloan Digital Sky Survey Data Grid Paul Avery
INet2 Abilene LSC LSC LSC LSC LSC Tier2 LIGO (Gravity Wave) Data Grid MIT LivingstonObservatory HanfordObservatory OC48 OC3 OC3 OC12 Caltech Tier1 OC48 Paul Avery
Data Grid Projects Paul Avery
Large Data Grid Projects • Funded projects • GriPhyN USA NSF $11.9M + $1.6M 2000-2005 • EU DataGrid EU EC €10M 2001-2004 • PPDG USA DOE $9.5M 2001-2004 • TeraGrid USA NSF $53M 2001-? • iVDGL USA NSF $13.7M + $2M 2001-2006 • DataTAG EU EC €4M 2002-2004 • Proposed projects • GridPP UK PPARC >$15M? 2001-2004 • Many national projects • Initiatives in US, UK, Italy, France, NL, Germany, Japan, … • EU networking initiatives (Géant, SURFNet) Paul Avery
PPDG Middleware Components Resource Management Object- and File-based Application Services (Request Interpreter) File Replication Index Matchmaking Service File Access Service (Request Planner) Cost Estimation File Fetching Service Cache Manager File Mover Mass Storage Manager File Mover Future • OO-collection export • Cache, state tracking • Prediction End-to-End Network Services Security Domain Site Boundary Paul Avery
EU DataGrid Project Paul Avery
GriPhyN: PetaScale Virtual-Data Grids Production Team Individual Investigator Workgroups ~1 Petaflop ~100 Petabytes Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid • Resource • Security and • Other Grid Security and Management • Management • Policy • Services Policy Services Services • Services • Services Services Transforms Distributed resources(code, storage, CPUs,networks) Raw data source Paul Avery
GriPhyN Research Agenda • Virtual Data technologies (fig.) • Derived data, calculable via algorithm • Instantiated 0, 1, or many times (e.g., caches) • “Fetch value” vs “execute algorithm” • Very complex (versions, consistency, cost calculation, etc) • LIGO example • “Get gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last year” • For each requested data value, need to • Locate item location and algorithm • Determine costs of fetching vs calculating • Plan data movements & computations required to obtain results • Execute the plan Paul Avery
Fetch item Virtual Data in Action • Data request may • Compute locally • Compute remotely • Access local data • Access remote data • Scheduling based on • Local policies • Global policies • Cost Major facilities, archives Regional facilities, caches Local facilities, caches Paul Avery
MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM GriPhyN/PPDG Data Grid Architecture Application = initial solution is operational DAG Catalog Services Monitoring Planner Info Services DAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Paul Avery
Transparency wrt materialization Derived Metadata Catalog App-specific-attr id … App specific attr . id … … … i2,i10 i2,i10 id id … … … … Update upon materialization Derived Data Catalog Id Trans F Id Trans Param Param Name … Name … GCMS i1 F X F.X … i1 F X F.X … i2 F Y F.Y … i2 F Y F.Y … i10 G Y P i10 G Y P G(P).Y … G(P).Y … Trans. name Trans. name Transformation Catalog Trans Trans Prog Prog Cost … Cost … F URL:f 10 … F URL:f 10 … G URL:g 20 … G URL:g 20 … URLs for program location URLs for program location Program storage Program storage Catalog Architecture Transparency wrt location Metadata Catalog Metadata Catalog Name Name LObjN LObjN … … … X logO1 … … Y logO2 … … F.X F.X logO3 … … logO3 … … G(1).Y logO4 … … Object Name Object Name GCMS GCMS Logical Container Name Replica Catalog Replica Catalog LCN LCN PFNs PFNs … … logC1 URL1 logC1 URL1 logC2 URL2 URL3 logC2 URL2 URL3 logC3 URL4 logC3 URL4 logC4 URL5 URL6 logC4 URL5 URL6 URLs for physical file location Physical file storage Paul Avery
Early GriPhyN Challenge Problem:CMS Data Reconstruction 2) Launch secondary job on Wisconsin pool; input files via Globus GASS Master Condor job running at Caltech Secondary Condor job on UW pool 5) Secondary reports complete to master Caltech workstation 6) Master starts reconstruction jobs via Globus jobmanager on cluster 3) 100 Monte Carlo jobs on Wisconsin Condor pool 9) Reconstruction job reports complete to master April 2001 • Caltech • NCSA • Wisconsin 4) 100 data files transferred via GridFTP, ~ 1 GB each 7) GridFTP fetches data from UniTree NCSA Linux cluster NCSA UniTree - GridFTP-enabled FTP server 8) Processed objectivity database stored to UniTree Paul Avery
Pre / Simulation Jobs / Post (UW Condor) ooDigis at NCSA ooHits at NCSA Delay due to script error Trace of a Condor-G Physics Run Paul Avery
iVDGL: A World Grid Laboratory • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, Asia, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A facility to perform production exercises for LHC experiments • A laboratory for other disciplines to perform Data Grid tests • US part funded by NSF: Sep. 25, 2001 • $13.65M + $2M “We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.” From NSF proposal, 2001 Paul Avery
iVDGL Summary Information • Principal components • Tier1 sites (laboratories) • Tier2 sites (universities) • Selected Tier3 sites (universities) • Fast networks: US, Europe, transatlantic, transpacific • Grid Operations Center (GOC) • Computer Science support teams (6 UK Fellows) • Coordination, management • Proposed international participants • Initially US, EU, Japan, Australia • Other world regions later • Discussions w/ Russia, China, Pakistan, India, Brazil • Complementary EU project: DataTAG • Transatlantic network from CERN to STAR-TAP (+ people) • Initially 2.5 Gb/s Paul Avery
US iVDGL Proposal Participants • U Florida CMS • Caltech CMS, LIGO • UC San Diego CMS, CS • Indiana U ATLAS, iGOC • Boston U ATLAS • U Wisconsin, Milwaukee LIGO • Penn State LIGO • Johns Hopkins SDSS, NVO • U Chicago CS • U Southern California CS • U Wisconsin, Madison CS • Salish Kootenai Outreach, LIGO • Hampton U Outreach, ATLAS • U Texas, Brownsville Outreach, LIGO • Fermilab CMS, SDSS, NVO • Brookhaven ATLAS • Argonne Lab ATLAS, CS T2 / Software CS support T3 / Outreach T1 / Labs(not funded) Paul Avery
Tier1 (FNAL) Proto-Tier2 Tier3 university Initial US-iVDGL Data Grid SKC BU Wisconsin Michigan BNL PSU Fermilab Indiana Hampton Caltech/UCSD Florida Brownsville Other sites to be added in 2002 Paul Avery
Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link iVDGL Map (2002-2003) Surfnet DataTAG Paul Avery
“Infrastructure” Data Grid Projects • GriPhyN (US, NSF) • Petascale Virtual-Data Grids • http://www.griphyn.org/ • Particle Physics Data Grid (US, DOE) • Data Grid applications for HENP • http://www.ppdg.net/ • European Data Grid (EC, EU) • Data Grid technologies, EU deployment • http://www.eu-datagrid.org/ • TeraGrid Project (US, NSF) • Dist. supercomp. resources (13 TFlops) • http://www.teragrid.org/ • iVDGL + DataTAG (NSF, EC, others) • Global Grid lab & transatlantic network • Collaborations of application scientists & computer scientists • Focus on infrastructure development & deployment • Broad application Paul Avery
Q4 00 GriPhyN approved, $11.9M+$1.6M 4th Grid coordination meeting EU DataGrid approved, $9.3M 2nd Grid coordination meeting 3rd Grid coordination meeting 1st Grid coordination meeting Q1 01 PPDG approved, $9.5M Q2 01 Q3 01 LHC Grid Computing Project Q4 01 TeraGrid approved ($53M) iVDGL approved, $13.65M+$2M DataTAG approved (€4M) Q1 02 Data Grid Project Timeline Paul Avery
Need for Common Grid Infrastructure • Grid computing sometimes compared to electric grid • You plug in to get a resource (CPU, storage, …) • You don’t care where the resource is located • This analogy is more appropriate than originally intended • It expresses a USA viewpoint uniform power grid • What happens when you travel around the world? Different frequencies 60 Hz, 50 Hz Different voltages 120 V, 220 V Different sockets! USA, 2 pin, France, UK, etc. Want to avoid this situation in Grid computing Paul Avery
Role of Grid Infrastructure • Provide essential common Grid services • Cannot afford to develop separate infrastructures(Manpower, timing, immediate needs, etc.) • Meet needs of high-end scientific & engin’g collaborations • HENP, astrophysics, GVO, earthquake, climate, space, biology, … • Already international and even global in scope • Drive future requirements • Be broadly applicable outside science • Government agencies: National, regional (EU), UN • Non-governmental organizations (NGOs) • Corporations, business networks (e.g., suppliers, R&D) • Other “virtual organizations” (see Anatomy of the Grid) • Be scalable to the Global level Paul Avery