1 / 53

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. Physics Colloquium University of Texas at Arlington Jan. 24, 2002. What is a Grid?.

cmaupin
Download Presentation

Paul Avery University of Florida phys.ufl/~avery/ avery@phys.ufl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Data Grids for21st Century Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu Physics ColloquiumUniversity of Texas at ArlingtonJan. 24, 2002 Paul Avery

  2. What is a Grid? • Grid: Geographically distributed computing resources configured for coordinated use • Physical resources & networks provide raw capability • “Middleware” software ties it together Paul Avery

  3. Applications for Grids • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Engineering • Civil engineers collaborate to design, execute, & analyze shake table experiments • A multidisciplinary analysis in aerospace couples code and data in four companies From Ian Foster Paul Avery

  4. Applications for Grids (cont.) • Application Service Providers • A home user invokes architectural design functions at an application service provider • An application service provider purchases cycles from compute cycle providers • Commercial • Scientists at a multinational soap company design a new product • Communities • An emergency response team couples real time data, weather model, population data • A community group pools members’ PCs to analyze alternative designs for a local road • Health • Hospitals and international agencies collaborate on stemming a major disease outbreak From Ian Foster Paul Avery

  5. Proto-Grid: SETI@home • Community: SETI researchers + enthusiasts • Arecibo radio data sent to users (250KB data chunks) • Over 2M PCs used Paul Avery

  6. More Advanced Proto-Grid:Evaluation of AIDS Drugs • Community • 1000s of home computer users • Philanthropic computing vendor (Entropia) • Research group (Scripps) • Common goal • Advance AIDS research Paul Avery

  7. Network-centric Simple, fixed end systems Few embedded capabilities Few services No user-level quality of service Early Information Infrastructure O(108) nodes Network Paul Avery

  8. Application-centric Heterogeneous, mobile end-systems Many embedded capabilities Rich services User-level quality of service Emerging Information Infrastructure O(1010) nodes Caching ResourceDiscovery Processing QoS Grid Qualitatively different,not just “faster andmore reliable” Paul Avery

  9. Why Grids? • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage and computing • Groups of people • Communities require access to common services • Scientific collaborations (physics, astronomy, biology, eng. …) • Government agencies • Health care organizations, large corporations, … • Goal is to build “Virtual Organizations” • Make all community resources available to any VO member • Leverage strengths at different institutions • Add people & resources dynamically Paul Avery

  10. Grid Challenges • Overall goal • Coordinated sharing of resources • Technical problems to overcome • Authentication, authorization, policy, auditing • Resource discovery, access, allocation, control • Failure detection & recovery • Resource brokering • Additional issue: lack of central control & knowledge • Preservation of local site autonomy • Policy discovery and negotiation important Paul Avery

  11. Application Application Internet Protocol Architecture Transport Internet Link Layered Grid Architecture(Analogy to Internet Architecture) Specialized services:App. specific distributed services User Managing multiple resources:ubiquitous infrastructure services Collective Sharing single resources:negotiating access, controlling use Resource Talking to things:communications, security Connectivity Controlling things locally:Accessing, controlling resources Fabric From Ian Foster Paul Avery

  12. Globus Project and Toolkit • Globus Project™ (Argonne + USC/ISI) • O(40) researchers & developers • Identify and define core protocols and services • Globus Toolkit™ • A major product of the Globus Project • Reference implementation of core protocols & services • Growing open source developer community • Globus Toolkit used by all Data Grid projects today • US: GriPhyN, PPDG, TeraGrid, iVDGL • EU: EU-DataGrid and national projects Paul Avery

  13. Globus General Approach Applications • Define Grid protocols & APIs • Protocol-mediated access to remote resources • Integrate and extend existing standards • Develop reference implementation • Open source Globus Toolkit • Client & server SDKs, services, tools, etc. • Grid-enable wide variety of tools • Globus Toolkit • FTP, SSH, Condor, SRB, MPI, … • Learn about real world problems • Deployment • Testing • Applications Diverse global services Core services Diverse OS services Paul Avery

  14. Globus Toolkit Protocols • Security (connectivity layer) • Grid Security Infrastructure (GSI) • Resource management (resource layer) • Grid Resource Allocation Management (GRAM) • Information services (resource layer) • Grid Resource Information Protocol (GRIP) • Data transfer (resource layer) • Grid File Transfer Protocol (GridFTP) Paul Avery

  15. Data Grids Paul Avery

  16. Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Geographically distributed collaboration • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery

  17. Global Data Grid Challenge “Global scientific communities will perform computationally demanding analyses of distributed datasets that will grow by at least 3 orders of magnitude over the next decade, from the 100 Terabyte to the 100 Petabyte scale.” Paul Avery

  18. Data Intensive Physical Sciences • High energy & nuclear physics • Gravity wave searches • LIGO, GEO, VIRGO • Astronomy: Digital sky surveys • Now: Sloan Sky Survey, 2MASS • Future: VISTA, other Gigapixel arrays • “Virtual” Observatories (Global Virtual Observatory) • Time-dependent 3-D systems (simulation & data) • Earth Observation • Climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios Paul Avery

  19. Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Digitizing patient records (ditto) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (3-D, time dependent) • Virtual Population Laboratory (proposed) • Database of populations, geography, transportation corridors • Simulate likely spread of disease outbreaks Craig Venter keynote @SC2001 Paul Avery

  20. Data and Corporations • Corporations and Grids • National, international, global • Business units, research teams • Sales data • Transparent access to distributed databases • Corporate issues • Short term and long term partnerships • Overlapping networks • Manage, control access to data and resources • Security Paul Avery

  21. Example: High Energy Physics “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery

  22. LHC Computing Challenges • “Events” resulting from beam-beam collisions: • Signal event is obscured by 20 overlapping uninteresting collisions in same crossing • CPU time does not scale from previous generations 2000 2007 Paul Avery

  23. LHC: Higgs Decay into 4 muons 109 events/sec, selectivity: 1 in 1013 Paul Avery

  24. LHC Computing Challenges • Complexity of LHC interaction environment & resulting data • Scale: Petabytes of data per year (100 PB by ~2010-12) • GLobal distribution of people and resources 1800 Physicists 150 Institutes 32 Countries Paul Avery

  25. Tier 0 (CERN) 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 4 4 4 4 Global LHC Data Grid Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation • Key ideas: • Hierarchical structure • Tier2 centers Paul Avery

  26. Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS Global LHC Data Grid CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 Experiment ~PBytes/sec Online System ~100 MBytes/sec Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size CERN Computer Center > 20 TIPS Tier 0 +1 HPSS 2.5 Gbits/sec France Center Italy Center UK Center USA Center Tier 1 2.5 Gbits/sec Tier 2 ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute 100 - 1000 Mbits/sec Physics data cache Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Tier 4 Workstations,other portals Paul Avery

  27. Image Data Standards Source Catalogs Specialized Data: Spectroscopy, Time Series, Polarization Information Archives: Derived & legacy data: NED,Simbad,ADS, etc Discovery Tools: Visualization, Statistics Example: Global Virtual Observatory Multi-wavelength astronomy,Multiple surveys Paul Avery

  28. GVO Data Challenge • Digital representation of the sky • All-sky + deep fields • Integrated catalog and image databases • Spectra of selected samples • Size of the archived data • 40,000 square degrees • Resolution < 0.1 arcsec  > 50 trillion pixels • One band (2 bytes/pixel) 100 Terabytes • Multi-wavelength: 500-1000 Terabytes • Time dimension: Many Petabytes • Large, globally distributed database engines • Integrated catalog and image databases • Multi-Petabyte data size • Gbyte/s aggregate I/O speed per site Paul Avery

  29. Sloan Digital Sky Survey Data Grid Paul Avery

  30. INet2 Abilene LSC LSC LSC LSC LSC Tier2 LIGO (Gravity Wave) Data Grid MIT LivingstonObservatory HanfordObservatory OC48 OC3 OC3 OC12 Caltech Tier1 OC48 Paul Avery

  31. Data Grid Projects Paul Avery

  32. Large Data Grid Projects • Funded projects • GriPhyN USA NSF $11.9M + $1.6M 2000-2005 • EU DataGrid EU EC €10M 2001-2004 • PPDG USA DOE $9.5M 2001-2004 • TeraGrid USA NSF $53M 2001-? • iVDGL USA NSF $13.7M + $2M 2001-2006 • DataTAG EU EC €4M 2002-2004 • Proposed projects • GridPP UK PPARC >$15M? 2001-2004 • Many national projects • Initiatives in US, UK, Italy, France, NL, Germany, Japan, … • EU networking initiatives (Géant, SURFNet) Paul Avery

  33. PPDG Middleware Components Resource Management Object- and File-based Application Services (Request Interpreter) File Replication Index Matchmaking Service File Access Service (Request Planner) Cost Estimation File Fetching Service Cache Manager File Mover Mass Storage Manager File Mover Future • OO-collection export • Cache, state tracking • Prediction End-to-End Network Services Security Domain Site Boundary Paul Avery

  34. EU DataGrid Project Paul Avery

  35. GriPhyN: PetaScale Virtual-Data Grids Production Team Individual Investigator Workgroups ~1 Petaflop ~100 Petabytes Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid • Resource • Security and • Other Grid Security and Management • Management • Policy • Services Policy Services Services • Services • Services Services Transforms Distributed resources(code, storage, CPUs,networks) Raw data source Paul Avery

  36. GriPhyN Research Agenda • Virtual Data technologies (fig.) • Derived data, calculable via algorithm • Instantiated 0, 1, or many times (e.g., caches) • “Fetch value” vs “execute algorithm” • Very complex (versions, consistency, cost calculation, etc) • LIGO example • “Get gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last year” • For each requested data value, need to • Locate item location and algorithm • Determine costs of fetching vs calculating • Plan data movements & computations required to obtain results • Execute the plan Paul Avery

  37. Fetch item Virtual Data in Action • Data request may • Compute locally • Compute remotely • Access local data • Access remote data • Scheduling based on • Local policies • Global policies • Cost Major facilities, archives Regional facilities, caches Local facilities, caches Paul Avery

  38. MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM GriPhyN/PPDG Data Grid Architecture Application = initial solution is operational DAG Catalog Services Monitoring Planner Info Services DAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Paul Avery

  39. Transparency wrt materialization Derived Metadata Catalog App-specific-attr id … App specific attr . id … … … i2,i10 i2,i10 id id … … … … Update upon materialization Derived Data Catalog Id Trans F Id Trans Param Param Name … Name … GCMS i1 F X F.X … i1 F X F.X … i2 F Y F.Y … i2 F Y F.Y … i10 G Y P i10 G Y P G(P).Y … G(P).Y … Trans. name Trans. name Transformation Catalog Trans Trans Prog Prog Cost … Cost … F URL:f 10 … F URL:f 10 … G URL:g 20 … G URL:g 20 … URLs for program location URLs for program location Program storage Program storage Catalog Architecture Transparency wrt location Metadata Catalog Metadata Catalog Name Name LObjN LObjN … … … X logO1 … … Y logO2 … … F.X F.X logO3 … … logO3 … … G(1).Y logO4 … … Object Name Object Name GCMS GCMS Logical Container Name Replica Catalog Replica Catalog LCN LCN PFNs PFNs … … logC1 URL1 logC1 URL1 logC2 URL2 URL3 logC2 URL2 URL3 logC3 URL4 logC3 URL4 logC4 URL5 URL6 logC4 URL5 URL6 URLs for physical file location Physical file storage Paul Avery

  40. Early GriPhyN Challenge Problem:CMS Data Reconstruction 2) Launch secondary job on Wisconsin pool; input files via Globus GASS Master Condor job running at Caltech Secondary Condor job on UW pool 5) Secondary reports complete to master Caltech workstation 6) Master starts reconstruction jobs via Globus jobmanager on cluster 3) 100 Monte Carlo jobs on Wisconsin Condor pool 9) Reconstruction job reports complete to master April 2001 • Caltech • NCSA • Wisconsin 4) 100 data files transferred via GridFTP, ~ 1 GB each 7) GridFTP fetches data from UniTree NCSA Linux cluster NCSA UniTree - GridFTP-enabled FTP server 8) Processed objectivity database stored to UniTree Paul Avery

  41. Pre / Simulation Jobs / Post (UW Condor) ooDigis at NCSA ooHits at NCSA Delay due to script error Trace of a Condor-G Physics Run Paul Avery

  42. iVDGL: A World Grid Laboratory • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, Asia, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A facility to perform production exercises for LHC experiments • A laboratory for other disciplines to perform Data Grid tests • US part funded by NSF: Sep. 25, 2001 • $13.65M + $2M “We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.” From NSF proposal, 2001 Paul Avery

  43. iVDGL Summary Information • Principal components • Tier1 sites (laboratories) • Tier2 sites (universities) • Selected Tier3 sites (universities) • Fast networks: US, Europe, transatlantic, transpacific • Grid Operations Center (GOC) • Computer Science support teams (6 UK Fellows) • Coordination, management • Proposed international participants • Initially US, EU, Japan, Australia • Other world regions later • Discussions w/ Russia, China, Pakistan, India, Brazil • Complementary EU project: DataTAG • Transatlantic network from CERN to STAR-TAP (+ people) • Initially 2.5 Gb/s Paul Avery

  44. US iVDGL Proposal Participants • U Florida CMS • Caltech CMS, LIGO • UC San Diego CMS, CS • Indiana U ATLAS, iGOC • Boston U ATLAS • U Wisconsin, Milwaukee LIGO • Penn State LIGO • Johns Hopkins SDSS, NVO • U Chicago CS • U Southern California CS • U Wisconsin, Madison CS • Salish Kootenai Outreach, LIGO • Hampton U Outreach, ATLAS • U Texas, Brownsville Outreach, LIGO • Fermilab CMS, SDSS, NVO • Brookhaven ATLAS • Argonne Lab ATLAS, CS T2 / Software CS support T3 / Outreach T1 / Labs(not funded) Paul Avery

  45. Tier1 (FNAL) Proto-Tier2 Tier3 university Initial US-iVDGL Data Grid SKC BU Wisconsin Michigan BNL PSU Fermilab Indiana Hampton Caltech/UCSD Florida Brownsville Other sites to be added in 2002 Paul Avery

  46. Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link iVDGL Map (2002-2003) Surfnet DataTAG Paul Avery

  47. “Infrastructure” Data Grid Projects • GriPhyN (US, NSF) • Petascale Virtual-Data Grids • http://www.griphyn.org/ • Particle Physics Data Grid (US, DOE) • Data Grid applications for HENP • http://www.ppdg.net/ • European Data Grid (EC, EU) • Data Grid technologies, EU deployment • http://www.eu-datagrid.org/ • TeraGrid Project (US, NSF) • Dist. supercomp. resources (13 TFlops) • http://www.teragrid.org/ • iVDGL + DataTAG (NSF, EC, others) • Global Grid lab & transatlantic network • Collaborations of application scientists & computer scientists • Focus on infrastructure development & deployment • Broad application Paul Avery

  48. Q4 00 GriPhyN approved, $11.9M+$1.6M 4th Grid coordination meeting EU DataGrid approved, $9.3M 2nd Grid coordination meeting 3rd Grid coordination meeting 1st Grid coordination meeting Q1 01 PPDG approved, $9.5M Q2 01 Q3 01 LHC Grid Computing Project Q4 01 TeraGrid approved ($53M) iVDGL approved, $13.65M+$2M DataTAG approved (€4M) Q1 02 Data Grid Project Timeline Paul Avery

  49. Need for Common Grid Infrastructure • Grid computing sometimes compared to electric grid • You plug in to get a resource (CPU, storage, …) • You don’t care where the resource is located • This analogy is more appropriate than originally intended • It expresses a USA viewpoint  uniform power grid • What happens when you travel around the world? Different frequencies 60 Hz, 50 Hz Different voltages 120 V, 220 V Different sockets! USA, 2 pin, France, UK, etc. Want to avoid this situation in Grid computing Paul Avery

  50. Role of Grid Infrastructure • Provide essential common Grid services • Cannot afford to develop separate infrastructures(Manpower, timing, immediate needs, etc.) • Meet needs of high-end scientific & engin’g collaborations • HENP, astrophysics, GVO, earthquake, climate, space, biology, … • Already international and even global in scope • Drive future requirements • Be broadly applicable outside science • Government agencies: National, regional (EU), UN • Non-governmental organizations (NGOs) • Corporations, business networks (e.g., suppliers, R&D) • Other “virtual organizations” (see Anatomy of the Grid) • Be scalable to the Global level Paul Avery

More Related