730 likes | 905 Views
SKR 5800 Selected Topics in Distributed Computing. Grid Computing: Introduction AZIZOL ABDULLAH, PhD DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORK. Lecture Contents. Why do we have Grid Computing What is Grid Computing Ian Foster’s 3 point checklist Defining Grid Computing
E N D
SKR 5800 Selected Topics in Distributed Computing Grid Computing: Introduction AZIZOL ABDULLAH, PhD DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORK
Lecture Contents • Why do we have Grid Computing • What is Grid Computing • Ian Foster’s 3 point checklist • Defining Grid Computing • What is Grid and Grid Computing? • Why we need grids • Why Now? • The Grid Problems
Why do We Have Grid Computing? • The term was coined in 1996 by Ian Foster and Carl Kesselman • Used to describe software that was needed by the rapidly growing, highly advanced community of high-performance Computing (HPC) • Resources that scale with technologies: • Supercomputers (MFlops in 96, but now using TFlops) • Big and not portable • Large data sets (GB in 96, but now peta-bytes) • Need fast networks to move data around to resources • Need security: • NSF (and other gov agencies) spend money to build infrastructure, so it is hard to get access
What is Grid Computing? • Is it a new, unique idea or the next generation of distributed or meta-computing? Please find and read this paper: Ian Foster Paper “What is the Grid? A Three-point Checklist” http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf
Ian Foster’s 3 point checklist • A Grid is a system that is able to • coordinate “resources that are not subject to centralized control” • Use “standard, open, general-purpose protocols and interfaces” • “to deliver nontrivial qualities of service.” • What does this mean? • We will try to understand this in this course.
Defining Grid Computing • There are several competing definitions for “The Grid” and Grid computing • These definitions tend to focus on: • Implementation of Distributed computing • A common set of interfaces, tools and APIs • Some stress the inter-institutional aspect of grids and Virtual Organizations • “The Virtualization of Resources” abstraction of resources
What is Grid and Grid Computing? • Grid computing promises a standard, ‘complete’ set of distributed computing capabilities • There is a lot of hype around grid computing • Traditional users need to get work done now! • Some CS researchers see it as a fad • But there is real-world value! • In e-science and e-business
What is Grid and Grid Computing? (cont..) • Grid computing must provide basic functions • resource discovery and information collection & publishing • data management on and between resources • process management on and between resources • common security mechanism underlying the above • process and session recording/accounting • Current grid computing tools such as Globus provide most of the above at some level • The current capabilities are incomplete • New web service based-standard will help current tools become interoperable.
The Grid “Resource sharing & coordinated problem solving in dynamic … virtual organizations” Enable integration of distributed service & resources Using general-purpose protocols & infrastructure To achieve useful qualities of service “The Anatomy of the Grid”, Foster, Kesselman, Tuecke, 2001
Grid3: An Operational Grid • 28 sites (2100-2800 CPUs) & growing • 400-1300 concurrent jobs • 8 substantial applications + CS experiments • Running since October 2003 Korea Slide Courtesy of Ian Foster http://www.ivdgl.org/grid3
~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids for High Energy Physics Image courtesy Harvey Newman, Caltech
Grid Physics Network (GriPhyN) Enabling R&D for advanced data grid systems, focusing in particular on Virtual Data concept ATLAS CMS LIGO SDSS www.griphyn.org; Slide from C. Kesselman/Cal(IT)2 presentation
Why Now? • The Internet as infrastructure • Increasing bandwidth, advanced services • Advances in storage capacity • Terabytes, petabytes per site • Increased availability of compute resources • clusters, supercomputers, etc. • Advanced applications • simulation based design, advanced scientific instruments, ...
The Grid Problem • Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location • central control • omniscience • existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.
Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
The Programming Problem • Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. • Some requirements: • Abstractions and models to increase speed/robustness/etc. of development • Tools to ease application development and diagnose common problems, ease deployment • Code/tool sharing to allow reuse of code components developed by others
Grid must suspport computational workflows • Locate “suitable” computers • Authenticate with appropriate sites • Allocate resources on those computers • Initiate computation on those computers • Configure those computations • Select “appropriate” communication methods • Compute with “suitable” algorithms • Access data files, return output • Respond “appropriately” to resource changes
identity & authentication authorization & policy resource/service discovery resource allocation (co-)reservation, workflow remote data access rapid data transfer monitoring intrusion detection resource management accounting fault management system evolution and more… Grid Requirements
Grid Computing - Functions • Grid computing must provide typically these basic functions (Foster/Kesselman) • resource discovery and information collection & publishing • data management on and between resources • process management on and between resources • common security mechanism underlying the above • In addition, it should include: • process and session recording/accounting
The Grid Problem • Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location • central control • omniscience • existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.
Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
The Programming Problem • Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. • Some requirements: • Abstractions and models to increase speed/robustness/etc. of development • Tools to ease application development and diagnose common problems, ease deployment • Code/tool sharing to allow reuse of code components developed by others
Grid Computing Vs Distributed Computing • How does grid computing differ from traditional distributed computing? • Where do grids get their names? • Grid hardware • Grid applications
Distributed Computing: A Quick Review Andrew Tannenbaum: “A distributed system is a collection of independent computers that appear to the users of the system as a single computer.”
Distributed Systems: Hardware • Distributed in the local area • Memory organization: • Shared-memory multiprocessors • Single virtual address space shared by all CPUs • Multicomputers with private memories • Separate address spaces • Interconnection network organization: • Bus-based • A single shared network, backplane, bus or cable • Switch-based • Individual connections between machines
Simplest Hardware: A Bus-based Shared-Memory Multiprocessor Processor Processor Processor • Shared memory • Caches must be kept consistent • Bus bandwidth limits to ~64 processors Memory Cache Cache Cache Bus
Bus-based Distributed Shared-Memory (DSM)Multiprocessor Memory Memory Memory Memory • Each processor contains portion of shared memory • Local accesses fast, remote accesses slow • “NUMA”: non-uniform memory access Cache Cache Cache Cache Processor Processor Processor Processor Bus
Switch-Based Multicomputer: Workstation Cluster Work-station Work-station Ethernet Switch • Workstations share resources: file servers, printers, storage archives • Schedule jobs • Use idle workstations Work-station Work-station Work-station Work-station
Hardware:What is different in a grid? • Heterogeneous hardware environment • computing platforms • network connections • storage systems and caches • Wide-area distribution • Wide-area network latency and bandwidth • Resources in different administration domains • Dynamic environment • Resources enter and leave grid
Software: Issues in Distributed Operating Systems • Communication models • Client-Server Model • Remote procedure call • Group communication • In a grid: • Algorithms must tolerate wide-area latency for message transfers • Avoid large numbers of messages • Typically perform larger transfers, initiate remote jobs rather than procedure calls
Software: Issues in Distributed Operating Systems • Synchronization • Clock synchronization • Election algorithms: determine a coordinator • Atomic transactions • In a grid: • With wide-area latencies, typically perform synchronization on larger grain • Can implement atomic operations
Software: Issues inDistributed Operating Systems • Processes and Processors • Threads • Allocating Processors • Scheduling and co-scheduling resources • Fault tolerance • In a grid: scheduling, allocation, & fault tolerance issues get more complicated in the wide area environment
Software: Issues in a Distributed Operating System • Distributed file systems • File service that reads and writes file, controls access • Creating, deleting & managing directories • Naming • Sharing • Caching and consistency • Replication and updates • In a grid, same issues complicated by wide area distribution, different administrative domains, enormous data sets
Software: Issues for a Distributed Operating System • Distributed Shared Memory • Generally applies to machines in a LAN • Each processor contains memory corresponding to part of the shared memory address space • Each processor caches data from other processors • Many consistency algorithms • In a grid: EASIER! Globus does not support a shared address space • Legion has a single shared object space
Summary: Heterogeneity makes things harder in a grid • Heterogeneous software and hardware • Different administrative domains • Different policies for use and management of local resources • Must do coordinated scheduling • Different security policies • Dynamic environment • Must discover resources • Robust in the presence of network, resource failures
Where do computational grids get their names? • “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities.” • Name (and definition) imply an analogy to the electric power grid • Power inexpensive, universally available • Enabled new devices and industries
An Infrastructure Analogy:The Electric Power Grid • Revolutionary development: transmission and distribution of electricity • Before: power accessible in crude forms • human work • horses • water power • steam engines • Today: cheap, reliable power universally available
Electric Power Grid (cont.) • Power to billions of devices • Efficient • Low-cost • Reliable • North America: 10,000 generators linked to billions of outlets • Heterogeneous components, distributed ownership • Interconnections between regions: share reserve capacity, trade excess power
Electric Power Grid (cont.) • Required more than just technology • Regulatory, political and institutional development • Infrastructure for monitoring and management • Huge social impact • Fundamentally changed work and home life • Huge environmental impact • Consume resources, generate pollution, global warming, …
Based on Infrastructure Analogies: Desired Characteristics of Grids • Pooling of resources • Compute cycles, data, people, sensors • Dependable service • Predictable • Sustained performance • Often high-performance
Grid Characteristics (cont.) • Consistent service • Standard services available • Via standard interfaces • Enable application development • Pervasive • Services always available • Inexpensive • Otherwise not widely accepted and used
A Grid Application Scenario • A distributed simulation involving 10 supercomputers at 10 different locations • How do you know where they are? • How do you identify yourself to each? • How do you get permission to use them? • How do you submit remote jobs? • How do you get access to resources on all the machines simultaneously? • What happens if a machine fails? • How are input/output files managed?
Distributed computing Collab. design Remote control Application Toolkit Layer Data- intensive Remote viz Information Resource mgmt . . . Grid Services Layer Security Data access Fault detection Transport . . . Multicast Grid Fabric Layer Instrumentation Control interfaces QoS mechanisms Grid Services Architecture High-energy physics data analysis Collaborative engineering On-line instrumentation Applications Regional climate studies Parameter studies
Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) Slide courtesy of C. Kessleman Cal(IT)2 Presentation
Layered Grid Architecture • Fabric Layer - provides the local services of a resource: • computational, storage, network • Connective Layer - core communication and authentication protocols • Enables exchange of data between fabric layer resources • Security and authentication important here
Layered Grid Architecture (cont.) • Resource Layer – enables resource sharing • Builds on connectivity layer to control and access resources (Ex: data servers) • Collective Layer - coordinates interactions across multiple resources • Ties multiple resources and services together • (Ex: metacatalogues) • Application Layer - user applications use collective, resource, and connective layers to perform grid operations in a virtual organization
Basic Grid Services • Security • Authentication: both client and server • Authorization: what privileges does the client have? • Access control: Sites want local control of operations that remote users are allowed to perform • Confidential data transfer using encryption
Basic Grid Services (cont.) • Resource management • Mechanism for submitting jobs to remote locations • Local policies for use, management, resource configuration • Scheduling of important resources • Coordinating scarce, expensive resources (e.g., cooperating supercomputers) • Advanced reservations to guarantee: • Quality of service • Completion of operations (e.g., reserve disk space for a large data transfer)
Basic Grid Services (cont.) • Information Services • Register and query information about grid resources • Where are all the Cray T3E’s in the grid? • Where is a storage system with 250 gigabytes of free space that transfers data at 1 gigabit/sec? • Centerpiece for many Grid components • Performance measurement services • What is the current bandwidth of the link from jupiter.isi.edu to apogee.sdsc.edu? • Dynamic environment: assume the information service contains old information