430 likes | 610 Views
Grid Computing. Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster. Seminar at Fermilab, October 31 st , 2001. Grid Computing. Issues I Will Address.
E N D
Grid Computing Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science The University of Chicago http://www.mcs.anl.gov/~foster Seminar at Fermilab, October 31st, 2001
Issues I Will Address • Grids in a nutshell • Problem statement • Major Grid projects • Grid architecture • Globus Project™ and Toolkit™ • HENP data grid projects • GriPhyN, PPDG, iVDGL
The Grid Problem Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations
~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Grid Communities & Applications:Data Grids for High Energy Physics Image courtesy Harvey Newman, Caltech
Grid Communities and Applications:Network for Earthquake Eng. Simulation • NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other • On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC www.neesgrid.org
Presenter mic Presenter camera Ambient mic (tabletop) Audience camera Access Grid • Collaborative work among large groups • ~50 sites worldwide • Use Grid services for discovery, security • www.scglobal.org Access Grid: Argonne, others www.accessgrid.org
Why Grids? • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour • 1,000 physicists worldwide pool resources for petaop analyses of petabytes of data • Civil engineers collaborate to design, execute, & analyze shake table experiments • Climate scientists visualize, annotate, & analyze terabyte simulation datasets • An emergency response team couples real time data, weather model, population data
Why Grids? (contd) • A multidisciplinary analysis in aerospace couples code and data in four companies • A home user invokes architectural design functions at an application service provider • An application service provider purchases cycles from compute cycle providers • Scientists working for a multinational soap company design a new product • A community group pools members’ PCs to analyze alternative designs for a local road
Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
A Little History • Early 90s • Gigabit testbeds, metacomputing • Mid to late 90s • Early experiments (e.g., I-WAY), software projects (e.g., Globus), application experiments • 2001 • Major application communities emerging • Major infrastructure deployments are underway • Rich technology base has been constructed • Global Grid Forum: >1000 people on mailing lists, 192 orgs at last meeting, 28 countries
The Grid World: Current Status • Dozens of major Grid projects in scientific & technical computing/research & education • Deployment, application, technology • Considerable consensus on key concepts and technologies • Membership, security, resource discovery, resource management, … • Global Grid Forum has emerged as a significant force • And first “Grid” proposals at IETF
g g g g g g Selected Major Grid Projects New New
g g g g g g Selected Major Grid Projects New New New New New
g g g g g g Selected Major Grid Projects New New
g g Selected Major Grid Projects New New Also many technology R&D projects: e.g., Condor, NetSolve, Ninf, NWS See also www.gridforum.org
Grid Architecture & Globus Toolkit™ • The question: • What is needed for resource sharing & coordinated problem solving in dynamic virtual organizations (VOs)? • The answer: • Major issues identified: membership, resource discovery & access, …, … • Grid architecture captures core elements, emphasizing pre-eminent role of protocols • Globus Toolkit™ has emerged as de facto standard for major protocols & services
Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) For more info: www.globus.org/research/papers/anatomy.pdf
Grid Services Architecture (1):Fabric Layer • Just what you would expect: the diverse mix of resources that may be shared • Individual computers, Condor pools, file systems, archives, metadata catalogs, networks, sensors, etc., etc. • Few constraints on low-level technology: connectivity and resource level protocols form the “neck in the hourglass” • Globus toolkit provides a few selected components (e.g., bandwidth broker)
Grid Services Architecture (2):Connectivity Layer Protocols & Services • Communication • Internet protocols: IP, DNS, routing, etc. • Security: Grid Security Infrastructure (GSI) • Uniform authentication & authorization mechanisms in multi-institutional setting • Single sign-on, delegation, identity mapping • Public key technology, SSL, X.509, GSS-API (several Internet drafts document extensions) • Supporting infrastructure: Certificate Authorities, key management, etc.
Single sign-on via “grid-id” & generation of proxy cred. Or: retrieval of proxy cred. from online repository Remote process creation requests* GSI-enabled GRAM server Authorize Map to local id Create process Generate credentials Ditto GSI-enabled GRAM server Process Process Communication* Local id Local id Kerberos ticket Restricted proxy Remote file access request* Restricted proxy User Proxy GSI-enabled FTP server Proxy credential Authorize Map to local id Access file * With mutual authentication GSI in Action: “Create Processes at A and B that Communicate & Access Files at C” User Site B (Unix) Site A (Kerberos) Computer Computer Site C (Kerberos) Storage system
Grid Services Architecture (3):Resource Layer Protocols & Services • Resource management: GRAM • Remote allocation, reservation, monitoring, control of [compute] resources • Data access: GridFTP • High-performance data access & transport • Information: MDS (GRRP, GRIP) • Access to structure & state information • & others emerging: catalog access, code repository access, accounting, … • All integrated with GSI
GRAM Resource Management Protocol • Grid Resource Allocation & Management • Allocation, monitoring, control of computations • Secure remote access to diverse schedulers • Current evolution • Immediate and advance reservation • Multiple resource types: manage anything • Recoverable requests, timeout, etc. • Evolve to Web Services • Policy evaluation points for restricted proxies Karl Czajkowski, Steve Tuecke, others
Data Access & Transfer • GridFTP: extended version of popular FTP protocol for Grid data access and transfer • Secure, efficient, reliable, flexible, extensible, parallel, concurrent, e.g.: • Third-party data transfers, partial file transfers • Parallelism, striping (e.g., on PVFS) • Reliable, recoverable data transfers • Reference implementations • Existing clients and servers: wuftpd, nicftp • Flexible, extensible libraries Bill Allcock, Joe Bester, John Bresnahan, Steve Tuecke, others
Grid Services Architecture (4):Collective Layer Protocols & Services • Community membership & policy • E.g., Community Authorization Service • Index/metadirectory/ brokering services • Custom views on community resource collections (e.g., GIIS, Condor Matchmaker) • Replica management and replica selection • Optimize aggregate data access performance • Co-reservation and co-allocation services • End-to-end performance • Etc., etc.
The Grid Information Problem • Large numbers of distributed “sensors” with different properties • Need for different “views” of this information, depending on community membership, security constraints, intended purpose, sensor type
Globus Toolkit Solution: MDS-2 Registration & enquiry protocols, information models, query languages • Provides standard interfaces to sensors • Supports different “directory” structures supporting various discovery/access strategies Karl Czajkowski, Steve Fitzgerald, others
1. CAS request, with user/group CAS resource names membership Does the and operations collective policy resource/collective authorize this 2. CAS reply, with membership request for this capability and resource CA info user? collective policy information Resource 3. Resource request, authenticated with Is this request capability authorized by the local policy capability? information 4. Resource reply Is this request authorized for the CAS? Community Authorization(Prototype shown August 2001) User Laura Pearlman, Steve Tuecke, Von Welch, others
HENP Related Data Grid Projects • Funded (or about to be funded) projects • PPDG I USA DOE $2M 1999-2001 • GriPhyN USA NSF $11.9M + $1.6M 2000-2005 • EU DataGrid EU EC €10M 2001-2004 • PPDG II USA DOE $9.5M 2001-2004 • iVDGL USA NSF $13.7M + $2M 2001-2006 • DataTAG EU EC €4M 2002-2004 • Proposed projects • GridPP UK PPARC >$15M? 2001-2004 • Many national projects of interest to HENP • Initiatives in US, UK, Italy, France, NL, Germany, Japan, … • EU networking initiatives (Géant, SURFNet) • US Distributed Terascale Facility ($53M, 12 TFL, 40 Gb/s net)
Background on Data Grid Projects • They support several disciplines • GriPhyN: CS, HEP (LHC), gravity waves, digital astronomy • PPDG: CS, HEP (LHC + current expts), Nuc. Phys., networking • DataGrid: CS, HEP (LHC), earth sensing, biology, networking • They are already joint projects • Multiple scientific communities • High-performance scientific experiments • International components and connections • They have common elements & foundations • Interconnected management structures • Sharing of people, projects (e.g., GDMP) • Globus infrastructure • Data Grid Reference Architecture • HENP Intergrid Coordination Board
Grid Physics Network (GriPhyN) Enabling R&D for advanced data grid systems, focusing in particular on Virtual Data concept ATLAS CMS LIGO SDSS
Virtual Datain Action • Data request may • Access local data • Compute locally • Compute remotely • Access remote data • Scheduling & execution subject to local & global policies
GriPhyN Status, October 2001 • Data Grid Reference Architecture defined • v1: core services (Feb 2001) • v2: request planning/mgmt, catalogs (RSN) • Progress on ATLAS, CMS, LIGO, SDSS • Requirements statements developed • Testbeds and experiments proceeding • Progress on technology • DAGMAN request management • Catalogs, security, policy • Virtual Data Toolkit v1.0 out soon
MCAT; GriPhyN catalogs MDS MDS GDMP DAGMAN, Kangaroo GSI, CAS Globus GRAM GridFTP; GRAM; SRM GriPhyN/PPDGData Grid Architecture Application = initial solution is operational DAG Catalog Services Monitoring Planner Info Services DAG Repl. Mgmt. Executor Policy/Security Reliable Transfer Service Compute Resource Storage Resource Ewa Deelman, Mike Wilde
Transparency wrt location Transparency wrt materialization Metadata Catalog Derived Metadata Catalog Name Name LObjN … … LObjN … App-specific-attr id … App specific attr . id … X logO1 … … … … i2,i10 i2,i10 Y logO2 … … id id … … F.X F.X logO3 … … logO3 … … … … G(1).Y logO4 … … Update upon materialization Derived Data Catalog Object Name Object Name Id Trans Id Trans F Param Param Name … Name … GCMS GCMS GCMS i1 F X F.X … i1 F X F.X … i2 F Y F.Y … i2 F Y F.Y … Logical Container i10 G Y P i10 G Y P G(P).Y … G(P).Y … Name Replica Catalog Replica Catalog Trans. name Trans. name LCN LCN PFNs PFNs … … Transformation Catalog logC1 URL1 logC1 URL1 Trans Trans Prog Prog Cost … Cost … logC2 URL2 URL3 logC2 URL2 URL3 F URL:f 10 … F URL:f 10 … logC3 URL4 logC3 URL4 G URL:g 20 … G URL:g 20 … logC4 URL5 URL6 logC4 URL5 URL6 URLs for program location URLs for program location URLs for physical file location Program storage Program storage Physical file storage Catalog Architecture
Client Applications D0 Framework C++ codes Python codes, Java codes Web Command line Request Formulator and Planner Request Manager Cache Manager Job Manager Storage Manager Collective Services “Dataset Editor” “Project Master” “Station Master” “Station Master” “File Storage Server” Batch Systems - LSF, FBS, PBS, Condor Job Services SAM Resource Management Data Mover “Optimiser” “Stager” Significant Event Logger Naming Service Catalog Manager Database Manager CORBA UDP Catalog protocols File transfer protocols - ftp, bbftp, rcp GridFTP Mass Storage systems protocols e.g. encp, hpss Connectivity and Resource GSI SAM-specific user, group, node, station registration Bbftp ‘cookie’ Authentication and Security Fabric Tape Storage Elements Disk Storage Elements Resource and Services Catalog Meta-data Catalog LANs and WANs Compute Elements Code Repostory Replica Catalog Indicates component that will be replaced enhanced or added using PPDG and Grid tools Name in “quotes” is SAM-given software component name
Early GriPhyN Challenge Problem:CMS Data Reconstruction 2) Launch secondary job on WI pool; input files via Globus GASS Master Condor job running at Caltech Secondary Condor job on WI pool 5) Secondary reports complete to master Caltech workstation 6) Master starts reconstruction jobs via Globus jobmanager on cluster 3) 100 Monte Carlo jobs on Wisconsin Condor pool 9) Reconstruction job reports complete to master 4) 100 data files transferred via GridFTP, ~ 1 GB each 7) GridFTP fetches data from UniTree NCSA Linux cluster NCSA UniTree - GridFTP-enabled FTP server 8) Processed objectivity database stored to UniTree Scott Koranda, Miron Livny, others
Pre / Simulation Jobs / Post (UW Condor) ooDigis at NCSA ooHits at NCSA Delay due to script error Trace of a Condor-G Physics Run
The 13.6 TF TeraGrid:Computing at 40 Gb/s Site Resources Site Resources 26 HPSS HPSS 4 24 External Networks External Networks 8 5 Caltech Argonne External Networks External Networks NCSA/PACI 8 TF 240 TB SDSC 4.1 TF 225 TB Site Resources Site Resources HPSS UniTree TeraGrid/DTF: NCSA, SDSC, Caltech, Argonne www.teragrid.org
Tier0/1 facility Tier2 facility Tier3 facility 10+ Gbps link 2.5 Gbps link 622 Mbps link Other link International Virtual Data Grid Lab
Grids and Industry • Scientific/technical apps in industry • “IntraGrid”: Aerospace, pharmaceuticals, … • “InterGrid”: multi-company teams • Globus Toolkit provides a starting point • Compaq, Cray, Entropia, Fujitsu, Hitachi, IBM, Microsoft, NEC, Platform, SGI, Sun all porting etc. • Enable resource sharing outside S&TC • “Grid Services”—extend Web Services with capabilities provided by Grid protocols • Focus of IBM-Globus collaboration • Future computing infrastructure based on Grid-enabled xSPs & applications?
Acknowledgments • Globus R&D is joint with numerous people • Carl Kesselman, Co-PI; Steve Tuecke, principal architect at ANL; others acked below • GriPhyN R&D is joint with numerous people • Paul Avery, Co-PI; Mike Wilde, project coordinator; Carl Kesselman, Miron Livny CS leads; numerous others • PPDG R&D is joint with numerous people • Richard Mount, Harvey Newman, Miron Livny, Co-PIs; Ruth Pordes, project coordinator • Numerous other project partners • Support: DOE, DARPA, NSF, NASA, Microsoft
Summary • “Grids”: Resource sharing & problem solving in dynamic virtual organizations • Many projects now working to develop, deploy, apply relevant technologies • Common protocols and services are critical • Globus Toolkit a source of protocol and API definitions, reference implementations • Rapid progress on definition, implementation, and application of Data Grid architecture • First indications of industrial adoption