410 likes | 545 Views
Global Data Grids for 21 st Century Science. Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu. GriPhyN/iVDGL Outreach Workshop University of Texas, Brownsville March 1, 2002. What is a Grid?.
E N D
Global Data Grids for21st Century Science Paul Avery University of Florida http://www.phys.ufl.edu/~avery/ avery@phys.ufl.edu GriPhyN/iVDGL Outreach WorkshopUniversity of Texas, BrownsvilleMarch 1, 2002 Paul Avery
What is a Grid? • Grid: Geographically distributed computing resources configured for coordinated use • Physical resources & networks provide raw capability • “Middleware” software ties it together Paul Avery
What Are Grids Good For? • Climate modeling • Climate scientists visualize, annotate, & analyze Terabytes of simulation data • Biology • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • High energy physics • 3,000 physicists worldwide pool Petaflops of CPU resources to analyze Petabytes of data • Engineering • Civil engineers collaborate to design, execute, & analyze shake table experiments • A multidisciplinary analysis in aerospace couples code and data in four companies From Ian Foster Paul Avery
What Are Grids Good For? • Application Service Providers • A home user invokes architectural design functions at an application service provider… • …which purchases computing cycles from cycle providers • Commercial • Scientists at a multinational toy company design a new product • Cities, communities • An emergency response team couples real time data, weather model, population data • A community group pools members’ PCs to analyze alternative designs for a local road • Health • Hospitals and international agencies collaborate on stemming a major disease outbreak From Ian Foster Paul Avery
Proto-Grid: SETI@home • Community: SETI researchers + enthusiasts • Arecibo radio data sent to users (250KB data chunks) • Over 2M PCs used Paul Avery
More Advanced Proto-Grid:Evaluation of AIDS Drugs • Community • Research group (Scripps) • 1000s of PC owners • Vendor (Entropia) • Common goal • Drug design • Advance AIDS research Paul Avery
Why Grids? • Resources for complex problems are distributed • Advanced scientific instruments (accelerators, telescopes, …) • Storage and computing • Groups of people • Communities require access to common services • Scientific collaborations (physics, astronomy, biology, eng. …) • Government agencies • Health care organizations, large corporations, … • Goal is to build “Virtual Organizations” • Make all community resources available to any VO member • Leverage strengths at different institutions • Add people & resources dynamically Paul Avery
Grids: Why Now? • Moore’s law improvements in computing • Highly functional endsystems • Burgeoning wired and wireless Internet connections • Universal connectivity • Changing modes of working and problem solving • Teamwork, computation • Network exponentials • (Next slide) Paul Avery
Network Exponentials & Collaboration • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010? • Computers: x 60 • Networks: x 4000 Scientific American (Jan-2001) Paul Avery
Grid Challenges • Overall goal: Coordinated sharing of resources • Technical problems to overcome • Authentication, authorization, policy, auditing • Resource discovery, access, allocation, control • Failure detection & recovery • Resource brokering • Additional issue: lack of central control & knowledge • Preservation of local site autonomy • Policy discovery and negotiation important Paul Avery
Application Application Internet Protocol Architecture Transport Internet Link Layered Grid Architecture(Analogy to Internet Architecture) Specialized services:App. specific distributed services User Managing multiple resources:ubiquitous infrastructure services Collective Sharing single resources:negotiating access, controlling use Resource Talking to things:communications, security Connectivity Controlling things locally:Accessing, controlling resources Fabric From Ian Foster Paul Avery
Globus Project and Toolkit • Globus Project™ (Argonne + USC/ISI) • O(40) researchers & developers • Identify and define core protocols and services • Globus Toolkit™ 2.0 • A major product of the Globus Project • Reference implementation of core protocols & services • Growing open source developer community • Globus Toolkit used by all Data Grid projects today • US: GriPhyN, PPDG, TeraGrid, iVDGL • EU: EU-DataGrid and national projects • Recent announcement of applying “web services” to Grids • Keeps Grids in the commercial mainstream • GT 3.0 Paul Avery
Globus General Approach Applications • Define Grid protocols & APIs • Protocol-mediated access to remote resources • Integrate and extend existing standards • Develop reference implementation • Open source Globus Toolkit • Client & server SDKs, services, tools, etc. • Grid-enable wide variety of tools • Globus Toolkit • FTP, SSH, Condor, SRB, MPI, … • Learn about real world problems • Deployment • Testing • Applications Diverse global services Core services Diverse resources Paul Avery
Data Intensive Science: 2000-2015 • Scientific discovery increasingly driven by IT • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Geographically distributed collaboration • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes? How to collect, manage, access and interpret this quantity of data? Drives demand for “Data Grids” to handleadditional dimension of data access & movement Paul Avery
Data Intensive Physical Sciences • High energy & nuclear physics • Including new experiments at CERN’s Large Hadron Collider • Gravity wave searches • LIGO, GEO, VIRGO • Astronomy: Digital sky surveys • Sloan Digital sky Survey, VISTA, other Gigapixel arrays • “Virtual” Observatories (multi-wavelength astronomy) • Time-dependent 3-D systems (simulation & data) • Earth Observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios Paul Avery
Data Intensive Biology and Medicine • Medical data • X-Ray, mammography data, etc. (many petabytes) • Digitizing patient records (ditto) • X-ray crystallography • Bright X-Ray sources, e.g. Argonne Advanced Photon Source • Molecular genomics and related disciplines • Human Genome, other genome databases • Proteomics (protein structure, activities, …) • Protein interactions, drug delivery • Brain scans (3-D, time dependent) • Virtual Population Laboratory (proposed) • Database of populations, geography, transportation corridors • Simulate likely spread of disease outbreaks Craig Venter keynote @SC2001 Paul Avery
Example: High Energy Physics “Compact” Muon Solenoid at the LHC (CERN) Smithsonianstandard man Paul Avery
LHC Computing Challenges • Complexity of LHC interaction environment & resulting data • Scale: Petabytes of data per year (100 PB by ~2010-12) • GLobal distribution of people and resources 1800 Physicists 150 Institutes 32 Countries Paul Avery
Tier 0 (CERN) 3 3 3 3 T2 T2 3 T2 Tier 1 3 3 T2 T2 3 3 3 3 3 3 4 4 4 4 Global LHC Data Grid Tier0 CERNTier1 National LabTier2 Regional Center (University, etc.)Tier3 University workgroupTier4 Workstation • Key ideas: • Hierarchical structure • Tier2 centers Paul Avery
Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS Global LHC Data Grid CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 Experiment ~PBytes/sec Online System ~100 MBytes/sec Bunch crossing per 25 nsecs.100 triggers per secondEvent is ~1 MByte in size CERN Computer Center > 20 TIPS Tier 0 +1 HPSS 2.5 Gbits/sec France Center Italy Center UK Center USA Center Tier 1 2.5 Gbits/sec Tier 2 ~622 Mbits/sec Tier 3 Institute ~0.25TIPS Institute Institute Institute 100 - 1000 Mbits/sec Physics data cache Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Tier 4 Workstations,other portals Paul Avery
Sloan Digital Sky Survey Data Grid Paul Avery
INet2 Abilene LSC LSC LSC LSC LSC Tier2 LIGO (Gravity Wave) Data Grid MIT LivingstonObservatory HanfordObservatory OC48 OC3 OC3 OC12 Caltech Tier1 OC48 Paul Avery
Data Grid Projects • Particle Physics Data Grid (US, DOE) • Data Grid applications for HENP expts. • GriPhyN (US, NSF) • Petascale Virtual-Data Grids • iVDGL (US, NSF) • Global Grid lab • TeraGrid (US, NSF) • Dist. supercomp. resources (13 TFlops) • European Data Grid (EU, EC) • Data Grid technologies, EU deployment • CrossGrid (EU, EC) • Data Grid technologies, EU • DataTAG (EU, EC) • Transatlantic network, Grid applications • Japanese Grid Project (APGrid?) (Japan) • Grid deployment throughout Japan • Collaborations of application scientists & computer scientists • Infrastructure devel. & deployment • Globus based Paul Avery
Coordination of U.S. Grid Projects • Three U.S. projects • PPDG: HENP experiments, short term tools, deployment • GriPhyN: Data Grid research, Virtual Data, VDT deliverable • iVDGL: Global Grid laboratory • Coordination of PPDG, GriPhyN, iVDGL • Common experiments + personnel, management integration • iVDGL as “joint” PPDG + GriPhyN laboratory • Joint meetings (Jan. 2002, April 2002, Sept. 2002) • Joint architecture creation (GriPhyN, PPDG) • Adoption of VDT as common core Grid infrastructure • Common Outreach effort (GriPhyN + iVDGL) • New TeraGrid project (Aug. 2001) • 13MFlops across 4 sites, 40 Gb/s networking • Goal: integrate into iVDGL, adopt VDT, common Outreach Paul Avery
Worldwide Grid Coordination • Two major clusters of projects • “US based” GriPhyN Virtual Data Toolkit (VDT) • “EU based” Different packaging of similar components Paul Avery
GriPhyN = App. Science + CS + Grids • GriPhyN = Grid Physics Network • US-CMS High Energy Physics • US-ATLAS High Energy Physics • LIGO/LSC Gravity wave research • SDSS Sloan Digital Sky Survey • Strong partnership with computer scientists • Design and implement production-scale grids • Develop common infrastructure, tools and services (Globus based) • Integration into the 4 experiments • Broad application to other sciences via “Virtual Data Toolkit” • Strong outreach program • Multi-year project • R&D for grid architecture (funded at $11.9M +$1.6M) • Integrate Grid infrastructure into experiments through VDT Paul Avery
U Florida U Chicago Boston U Caltech U Wisconsin, Madison USC/ISI Harvard Indiana Johns Hopkins Northwestern Stanford U Illinois at Chicago U Penn U Texas, Brownsville U Wisconsin, Milwaukee UC Berkeley UC San Diego San Diego Supercomputer Center Lawrence Berkeley Lab Argonne Fermilab Brookhaven GriPhyN Institutions Paul Avery
GriPhyN: PetaScale Virtual-Data Grids Production Team Individual Investigator Workgroups ~1 Petaflop ~100 Petabytes Interactive User Tools Request Planning & Request Execution & Virtual Data Tools Management Tools Scheduling Tools Resource Other Grid • Resource • Security and • Other Grid Security and Management • Management • Policy • Services Policy Services Services • Services • Services Services Transforms Distributed resources(code, storage, CPUs,networks) Raw data source Paul Avery
GriPhyN Research Agenda • Virtual Data technologies (fig.) • Derived data, calculable via algorithm • Instantiated 0, 1, or many times (e.g., caches) • “Fetch value” vs “execute algorithm” • Very complex (versions, consistency, cost calculation, etc) • LIGO example • “Get gravitational strain for 2 minutes around each of 200 gamma-ray bursts over the last year” • For each requested data value, need to • Locate item location and algorithm • Determine costs of fetching vs calculating • Plan data movements & computations required to obtain results • Execute the plan Paul Avery
Fetch item Virtual Data in Action • Data request may • Compute locally • Compute remotely • Access local data • Access remote data • Scheduling based on • Local policies • Global policies • Cost Major facilities, archives Regional facilities, caches Local facilities, caches Paul Avery
GriPhyN Research Agenda (cont.) • Execution management • Co-allocation of resources (CPU, storage, network transfers) • Fault tolerance, error reporting • Interaction, feedback to planning • Performance analysis (with PPDG) • Instrumentation and measurement of all grid components • Understand and optimize grid performance • Virtual Data Toolkit (VDT) • VDT = virtual data services + virtual data tools • One of the primary deliverables of R&D effort • Technology transfer mechanism to other scientific domains Paul Avery
MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM GriPhyN/PPDG Data Grid Architecture Application = initial solution is operational DAG Catalog Services Monitoring Planner Info Services DAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Paul Avery
Transparency wrt materialization Derived Metadata Catalog App-specific-attr id … App specific attr . id … … … i2,i10 i2,i10 id id … … … … Update upon materialization Derived Data Catalog Id Trans F Id Trans Param Param Name … Name … GCMS i1 F X F.X … i1 F X F.X … i2 F Y F.Y … i2 F Y F.Y … i10 G Y P i10 G Y P G(P).Y … G(P).Y … Trans. name Trans. name Transformation Catalog Trans Trans Prog Prog Cost … Cost … F URL:f 10 … F URL:f 10 … G URL:g 20 … G URL:g 20 … URLs for program location URLs for program location Program storage Program storage Catalog Architecture Transparency wrt location Metadata Catalog Metadata Catalog Name Name LObjN LObjN … … … X logO1 … … Y logO2 … … F.X F.X logO3 … … logO3 … … G(1).Y logO4 … … Object Name Object Name GCMS GCMS Logical Container Name Replica Catalog Replica Catalog LCN LCN PFNs PFNs … … logC1 URL1 logC1 URL1 logC2 URL2 URL3 logC2 URL2 URL3 logC3 URL4 logC3 URL4 logC4 URL5 URL6 logC4 URL5 URL6 URLs for physical file location Physical file storage Paul Avery
iVDGL: A Global Grid Laboratory • International Virtual-Data Grid Laboratory • A global Grid laboratory (US, EU, South America, Asia, …) • A place to conduct Data Grid tests “at scale” • A mechanism to create common Grid infrastructure • A facility to perform production exercises for LHC experiments • A laboratory for other disciplines to perform Data Grid tests • A focus of outreach efforts to small institutions • Funded for $13.65M by NSF “We propose to create, operate and evaluate, over asustained period of time, an international researchlaboratory for data-intensive science.” From NSF proposal, 2001 Paul Avery
iVDGL Components • Computing resources • Tier1, Tier2, Tier3 sites • Networks • USA (TeraGrid, Internet2, ESNET), Europe (Géant, …) • Transatlantic (DataTAG), Transpacific, AMPATH, … • Grid Operations Center (GOC) • Indiana (2 people) • Joint work with TeraGrid on GOC development • Computer Science support teams • Support, test, upgrade GriPhyN Virtual Data Toolkit • Outreach effort • Integrated with GriPhyN • Coordination, interoperability Paul Avery
Current iVDGL Participants • Initial experiments (funded by NSF proposal) • CMS, ATLAS, LIGO, SDSS, NVO • U.S. Universities and laboratories • (Next slide) • Partners • TeraGrid • EU DataGrid + EU national projects • Japan (AIST, TITECH) • Australia • Complementary EU project: DataTAG • 2.5 Gb/s transatlantic network Paul Avery
U.S. iVDGL Proposal Participants • U Florida CMS • Caltech CMS, LIGO • UC San Diego CMS, CS • Indiana U ATLAS, GOC • Boston U ATLAS • U Wisconsin, Milwaukee LIGO • Penn State LIGO • Johns Hopkins SDSS, NVO • U Chicago/Argonne CS • U Southern California CS • U Wisconsin, Madison CS • Salish Kootenai Outreach, LIGO • Hampton U Outreach, ATLAS • U Texas, Brownsville Outreach, LIGO • Fermilab CMS, SDSS, NVO • Brookhaven ATLAS • Argonne Lab ATLAS, CS T2 / Software CS support T3 / Outreach T1 / Labs(funded elsewhere) Paul Avery
Tier1 (FNAL) Proto-Tier2 Tier3 university Initial US-iVDGL Data Grid SKC BU Wisconsin PSU BNL Fermilab Hampton Indiana JHU Caltech UCSD Florida Brownsville Other sites to be added in 2002 Paul Avery
Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link iVDGL Map (2002-2003) Surfnet DataTAG • Later • Brazil • Pakistan • Russia • China Paul Avery
Summary • Data Grids will qualitatively and quantitatively change the nature of collaborations and approaches to computing • The iVDGL will provide vast experience for new collaborations • Many challenges during the coming transition • New grid projects will provide rich experience and lessons • Difficult to predict situation even 3-5 years ahead Paul Avery
Grid References • Grid Book • www.mkp.com/grids • Globus • www.globus.org • Global Grid Forum • www.gridforum.org • TeraGrid • www.teragrid.org • EU DataGrid • www.eu-datagrid.org • PPDG • www.ppdg.net • GriPhyN • www.griphyn.org • iVDGL • www.ivdgl.org Paul Avery