390 likes | 517 Views
The Anatomy of the Grid Enabling Scalable Virtual Organizations. John DYER TERENA john.dyer@terena.nl. Acknowldgement to: Ian Foster Mathematics and Computer Science Division Argonne National Laboratory. Grids are “hot” …. but what are they really about?. Presentation Agenda.
E N D
The Anatomy of the GridEnabling Scalable Virtual Organizations John DYER TERENA john.dyer@terena.nl Acknowldgement to: Ian Foster Mathematics and Computer Science Division Argonne National Laboratory
Grids are“hot” … but what are they really about?
Presentation Agenda • Problem statement • Architecture • Globus Toolkit • Futures
The Grid Problem Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations
Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … (Cost v Performance) • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
Computational Astrophysics Gig-E 100MB/sec 17 4 2 2 OC-12 line But only 2.5MB/sec) 12 5 5 SDSC IBM SP 1024 procs 5x12x17 =1020 NCSA Origin Array 256+128+128 5x12x(4+2+2) =480 • Solved EEs for gravitational waves • Tightly coupled, communications required • Must communicate 30MB/step between machines
~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids for High Energy Physics Image courtesy Harvey Newman, Caltech
Network for Earthquake Engineering Simulation • NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other • On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
Grid Applications:Mathematicians • Community=an informal collaboration of mathematicians and computer scientists • Condor-G delivers 3.46E8 CPU seconds in 7 days (600E3 seconds real-time) • peak 1009 processorsin U.S. and Italy (8 sites) MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
Grid ArchitectureIsn’t it just the Next Generation Internet , so why bother !
Why Discuss Architecture? • Descriptive • Provide a common vocabulary for use when describing Grid systems • Guidance • Identify key areas in which services are required - FRAMEWORK • Prescriptive • Define standards • But in the existing standards framework • GGF working with IETF, Internet2 etc.
What Sorts of Standards? • Need for interoperability when different groups want to share resources • E.g., IP lets me talk to your computer, but how do we establish & maintain sharing? • How do I discover, authenticate, authorize, describe what I want to do, etc., etc.? • Need for shared infrastructure services to avoid repeated development, installation, e.g. • One port/service for remote access to computing, not one per tool/application • X.509 enables sharing of Certificate Authorities • MIDDLEWARE !
In Defining Grid Architecture, We Must Address . . . • Development of Grid protocols & services • Protocol-mediated access to remote resources • New services: e.g., resource brokering • Mostly (extensions to) existing protocols • Development of Grid APIs & SDKs • Facilitate application development by supplying higher-level abstractions • The model is the Internet and Web
Collaboration Tools Data Mgmt Tools Distributed simulation . . . Information services Resource mgmt Fault detection . . . The Role of Grid Services(Middleware) and Tools net
GRID ArchitectureStatus • No “official” standards exist • But: • Globus Toolkit has emerged as the de facto standard for several important Connectivity, Resource, and Collective protocols • GGF has an architecture working group • Technical specifications are being developed for architecture elements: e.g., security, data, resource management, information • Internet drafts submitted in security area
Application Application Internet Protocol Architecture JOB MANAGEMENT Directory, Discovery, Monitoring Collective NEGOTIATION & CONTROL Sharing resources, controlling Resource COMMS & AUTHENTICATION Single Sign On, Trust . . . Connectivity Transport Internet Fabric Link Layered Grid Architecture DOES THE SCIENCE /…. ALL PHYSICAL RESOURCES Net, CPUs, Storage, Sensors
Toolkits & Components • CONDOR - Harnessing the processing capacity of idle workstations • www.cs.wisc.edu/condor/ • LEGION- developing an object-oriented framework for grid applications • www.cs.virginia.edu/~legion • Globus Toolkit SDK - APIs • www.globus.org/
Architecture: Fabric Layer • Just what you would expect: the diverse mix of resources that may be shared • Individual computers, Condor pools, file systems, archives, metadata catalogs, networks, sensors, etc., etc. • Few constraints on low-level technology: connectivity and resource level protocols • Globus toolkit provides a few selected components (e.g., bandwidth broker)
Architecture: Connectivity • Communication • Internet protocols: IP, DNS, routing, etc. • Security: Grid Security Infrastructure (GSI) • Uniform authentication & authorization mechanisms in multi-institutional setting • Single sign-on, delegation, identity mapping • Public key technology, SSL, X.509, GSS-API (several Internet drafts document extensions) • Supporting infrastructure: Certificate Authorities, key management, etc.
GSI Futures • Scalability in numbers of users & resources • Credential management • Online credential repositories • Account management • Authorization • Policy languages • Community authorization • Protection against compromised resources • Restricted delegation, smartcards
Architecture: Resources • Resource management: Remote allocation, reservation, monitoring, control of [compute] resources - GRAM (access & management • Data access: GridFTP • High-performance data access & transport • Information: • GRIP cf LDAP • GRRP – Registration Protocol • Access to structure & state information • & others emerging: catalog access, code repository access, accounting, … • All integrated with GSI
GRAM Resource Management Protocol • Grid Resource Allocation & Management • Allocation, monitoring, control of computations • Simple HTTP-based RPC • Job request: Returns opaque, transferable “job contact” string for access to job • Job cancel, Job status, Job signal • Event notification (callbacks) for state changes • Protocol/server address robustness (exactly once execution), authentication, authorization • Servers for most schedulers; C and Java APIs
Data Access & Transfer • GridFTP: extended version of popular FTP protocol for Grid data access and transfer • Secure, efficient, reliable, flexible, extensible, parallel, concurrent, e.g.: • Third-party data transfers, partial file transfers • Parallelism, striping (e.g., on PVFS) • Reliable, recoverable data transfers • Reference implementations • Existing clients and servers: wuftpd, nicftp • Flexible, extensible libraries
Architecture: Collective • Bringing the underlying resources together to provide the requested services • Resource brokers (e.g., Condor Matchmaker) • Resource discovery and allocation • Replica management and replica selection • Optimize aggregate data access performance • Co-reservation and co-allocation services • End-to-end performance • Etc., etc.
Globus Toolkit Solution Registration & enquiry protocols, information models, query languages • Provides standard interfaces to sensors • Supports different “directory” structures supporting various discovery/access strategies Karl Czajkowski, Steve Fitzgerald, others
g g g g g g Major Grid Projects New New
g g g g g g Major Grid Projects New New New New New
g g g g g g Major Grid Projects New New
g g Major Grid Projects New New Also many technology R&D projects: e.g., Condor, NetSolve, Ninf, NWS See also www.gridforum.org
The 13.6 TF TeraGrid:Computing at 40 Gb/s Site Resources Site Resources 26 HPSS HPSS 4 24 External Networks External Networks 8 5 Caltech Argonne External Networks External Networks NCSA/PACI 8 TF 240 TB SDSC 4.1 TF 225 TB Site Resources Site Resources HPSS UniTree TeraGrid/DTF: NCSA, SDSC, Caltech, Argonne www.teragrid.org
Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link International Virtual Data Grid Lab U.S. PIs: Avery, Foster, Gardner, Newman, Szalay www.ivdgl.org
Problem Evolution • Past-present: (102) high-end systems; Mb/s networks; centralized (or entirely local) control • I-WAY (1995): 17 sites, week-long; 155 Mb/s • GUSTO (1998): 80 sites, long-term experiment • NASA IPG, NSF NTG: O(10) sites, production • Present: (104-106) data systems, computers; Gb/s networks; scaling, decentralized control • Scalable resource discovery; restricted delegation; community policy; GriPhyN Data Grid: 100s of sites, (104) computers; complex policies • Future: (106-109) data, sensors, computers; Tb/s networks; highly flexible policy, control
The Future • We don’t build or buy “computers” anymore, we borrow or lease required resources • When I walk into a room, need to solve a problem, need to communicate • A “computer” is a dynamically, often collaboratively constructed collection of processors, data sources, sensors, networks • Similar observations apply for software
And Thus … • Reduced barriers to access mean that we do much more computing, and more interesting computing, than today => Many more components (& services); massive parallelism • All resources are owned by others => Sharing (for fun or profit) is fundamental; trust, policy, negotiation, payment • All computing is performed on unfamiliar systems => Dynamic behaviors, discovery, adaptivity, failure
The Global Grid Forum • Merger of (US) GridForum & EuroGRID • Cooperative Forum of Working Groups • Open to all who show up • Meets every four months • Alternate – US and Europe • GGF1 – Amsterdam, NL • GGF2 – Washington, US • GGF3 – Frascatti, IT http://www.gridforum.org
Global Grid Forum History 2001 2002 1998 1999 2000 GF BOF (Orlando) GF1 (San Jose, NASA Ames) GF2 (Chicago, Northwestern) eGrid and GF BOFs (Portland) GGF3 (Rome,INFN) 7-10 October 2001 GGF4 (Toronto, NRC) 17-20 February 2002 GGF5 (Edinburgh) 21-24 July 2002 Jointly with HPDC (24-26 July) GF3 (San Diego, SDSC) eGrid1(Posnan, PSNC) GF4 (Redmond, Microsoft) Asia-Pacific GF Planning (Yokohama) eGrid2 (Munich, Europar) GF5 (Boston, Sun) Global GF BOF (Dallas) GGF-1 (Amsterdam, WTCW) GGF-2 (Washington, DC, DOD-MSRC)
Summary • The Grid problem: Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations • Grid architecture: Emphasize protocol and service definition to enable interoperability and resource sharing • Globus Toolkit a source of protocol and API definitions, reference implementations • See: globus.org, griphyn.org, gridforum.org