1 / 73

SKR 5800 Selected Topics in Distributed Computing

SKR 5800 Selected Topics in Distributed Computing. Grid Computing: Introduction AZIZOL ABDULLAH, PhD DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORK. Lecture Contents. Why do we have Grid Computing What is Grid Computing Ian Foster’s 3 point checklist Defining Grid Computing

Download Presentation

SKR 5800 Selected Topics in Distributed Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SKR 5800 Selected Topics in Distributed Computing Grid Computing: Introduction AZIZOL ABDULLAH, PhD DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORK

  2. Lecture Contents • Why do we have Grid Computing • What is Grid Computing • Ian Foster’s 3 point checklist • Defining Grid Computing • What is Grid and Grid Computing? • Why we need grids • Why Now? • The Grid Problems

  3. Why do We Have Grid Computing? • The term was coined in 1996 by Ian Foster and Carl Kesselman • Used to describe software that was needed by the rapidly growing, highly advanced community of high-performance Computing (HPC) • Resources that scale with technologies: • Supercomputers (MFlops in 96, but now using TFlops) • Big and not portable • Large data sets (GB in 96, but now peta-bytes) • Need fast networks to move data around to resources • Need security: • NSF (and other gov agencies) spend money to build infrastructure, so it is hard to get access

  4. What is Grid Computing? • Is it a new, unique idea or the next generation of distributed or meta-computing? Please find and read this paper: Ian Foster Paper “What is the Grid? A Three-point Checklist” http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf

  5. Ian Foster’s 3 point checklist • A Grid is a system that is able to • coordinate “resources that are not subject to centralized control” • Use “standard, open, general-purpose protocols and interfaces” • “to deliver nontrivial qualities of service.” • What does this mean? • We will try to understand this in this course.

  6. Defining Grid Computing • There are several competing definitions for “The Grid” and Grid computing • These definitions tend to focus on: • Implementation of Distributed computing • A common set of interfaces, tools and APIs • Some stress the inter-institutional aspect of grids and Virtual Organizations • “The Virtualization of Resources” abstraction of resources

  7. What is Grid and Grid Computing? • Grid computing promises a standard, ‘complete’ set of distributed computing capabilities • There is a lot of hype around grid computing • Traditional users need to get work done now! • Some CS researchers see it as a fad • But there is real-world value! • In e-science and e-business

  8. What is Grid and Grid Computing? (cont..) • Grid computing must provide basic functions • resource discovery and information collection & publishing • data management on and between resources • process management on and between resources • common security mechanism underlying the above • process and session recording/accounting • Current grid computing tools such as Globus provide most of the above at some level • The current capabilities are incomplete • New web service based-standard will help current tools become interoperable.

  9. The Grid “Resource sharing & coordinated problem solving in dynamic … virtual organizations” Enable integration of distributed service & resources Using general-purpose protocols & infrastructure To achieve useful qualities of service “The Anatomy of the Grid”, Foster, Kesselman, Tuecke, 2001

  10. Why we need grids

  11. Grid3: An Operational Grid • 28 sites (2100-2800 CPUs) & growing • 400-1300 concurrent jobs • 8 substantial applications + CS experiments • Running since October 2003 Korea Slide Courtesy of Ian Foster http://www.ivdgl.org/grid3

  12. ~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids for High Energy Physics Image courtesy Harvey Newman, Caltech

  13. Grid Physics Network (GriPhyN) Enabling R&D for advanced data grid systems, focusing in particular on Virtual Data concept ATLAS CMS LIGO SDSS www.griphyn.org; Slide from C. Kesselman/Cal(IT)2 presentation

  14. Why Now? • The Internet as infrastructure • Increasing bandwidth, advanced services • Advances in storage capacity • Terabytes, petabytes per site • Increased availability of compute resources • clusters, supercomputers, etc. • Advanced applications • simulation based design, advanced scientific instruments, ...

  15. The Grid Problem • Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location • central control • omniscience • existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

  16. Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic

  17. The Programming Problem • Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. • Some requirements: • Abstractions and models to increase speed/robustness/etc. of development • Tools to ease application development and diagnose common problems, ease deployment • Code/tool sharing to allow reuse of code components developed by others

  18. Grid must suspport computational workflows • Locate “suitable” computers • Authenticate with appropriate sites • Allocate resources on those computers • Initiate computation on those computers • Configure those computations • Select “appropriate” communication methods • Compute with “suitable” algorithms • Access data files, return output • Respond “appropriately” to resource changes

  19. identity & authentication authorization & policy resource/service discovery resource allocation (co-)reservation, workflow remote data access rapid data transfer monitoring intrusion detection resource management accounting fault management system evolution and more… Grid Requirements

  20. Grid Computing - Functions • Grid computing must provide typically these basic functions (Foster/Kesselman) • resource discovery and information collection & publishing • data management on and between resources • process management on and between resources • common security mechanism underlying the above • In addition, it should include: • process and session recording/accounting

  21. The Grid Problem • Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location • central control • omniscience • existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

  22. Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic

  23. The Programming Problem • Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. • Some requirements: • Abstractions and models to increase speed/robustness/etc. of development • Tools to ease application development and diagnose common problems, ease deployment • Code/tool sharing to allow reuse of code components developed by others

  24. Grid Computing Vs Distributed Computing • How does grid computing differ from traditional distributed computing? • Where do grids get their names? • Grid hardware • Grid applications

  25. Distributed Computing: A Quick Review Andrew Tannenbaum: “A distributed system is a collection of independent computers that appear to the users of the system as a single computer.”

  26. Distributed Systems: Hardware • Distributed in the local area • Memory organization: • Shared-memory multiprocessors • Single virtual address space shared by all CPUs • Multicomputers with private memories • Separate address spaces • Interconnection network organization: • Bus-based • A single shared network, backplane, bus or cable • Switch-based • Individual connections between machines

  27. Simplest Hardware: A Bus-based Shared-Memory Multiprocessor Processor Processor Processor • Shared memory • Caches must be kept consistent • Bus bandwidth limits to ~64 processors Memory Cache Cache Cache Bus

  28. Bus-based Distributed Shared-Memory (DSM)Multiprocessor Memory Memory Memory Memory • Each processor contains portion of shared memory • Local accesses fast, remote accesses slow • “NUMA”: non-uniform memory access Cache Cache Cache Cache Processor Processor Processor Processor Bus

  29. Switch-Based Multicomputer: Workstation Cluster Work-station Work-station Ethernet Switch • Workstations share resources: file servers, printers, storage archives • Schedule jobs • Use idle workstations Work-station Work-station Work-station Work-station

  30. Hardware:What is different in a grid? • Heterogeneous hardware environment • computing platforms • network connections • storage systems and caches • Wide-area distribution • Wide-area network latency and bandwidth • Resources in different administration domains • Dynamic environment • Resources enter and leave grid

  31. Software: Issues in Distributed Operating Systems • Communication models • Client-Server Model • Remote procedure call • Group communication • In a grid: • Algorithms must tolerate wide-area latency for message transfers • Avoid large numbers of messages • Typically perform larger transfers, initiate remote jobs rather than procedure calls

  32. Software: Issues in Distributed Operating Systems • Synchronization • Clock synchronization • Election algorithms: determine a coordinator • Atomic transactions • In a grid: • With wide-area latencies, typically perform synchronization on larger grain • Can implement atomic operations

  33. Software: Issues inDistributed Operating Systems • Processes and Processors • Threads • Allocating Processors • Scheduling and co-scheduling resources • Fault tolerance • In a grid: scheduling, allocation, & fault tolerance issues get more complicated in the wide area environment

  34. Software: Issues in a Distributed Operating System • Distributed file systems • File service that reads and writes file, controls access • Creating, deleting & managing directories • Naming • Sharing • Caching and consistency • Replication and updates • In a grid, same issues complicated by wide area distribution, different administrative domains, enormous data sets

  35. Software: Issues for a Distributed Operating System • Distributed Shared Memory • Generally applies to machines in a LAN • Each processor contains memory corresponding to part of the shared memory address space • Each processor caches data from other processors • Many consistency algorithms • In a grid: EASIER! Globus does not support a shared address space • Legion has a single shared object space

  36. Summary: Heterogeneity makes things harder in a grid • Heterogeneous software and hardware • Different administrative domains • Different policies for use and management of local resources • Must do coordinated scheduling • Different security policies • Dynamic environment • Must discover resources • Robust in the presence of network, resource failures

  37. Where do computational grids get their names? • “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities.” • Name (and definition) imply an analogy to the electric power grid • Power inexpensive, universally available • Enabled new devices and industries

  38. An Infrastructure Analogy:The Electric Power Grid • Revolutionary development: transmission and distribution of electricity • Before: power accessible in crude forms • human work • horses • water power • steam engines • Today: cheap, reliable power universally available

  39. Electric Power Grid (cont.) • Power to billions of devices • Efficient • Low-cost • Reliable • North America: 10,000 generators linked to billions of outlets • Heterogeneous components, distributed ownership • Interconnections between regions: share reserve capacity, trade excess power

  40. Electric Power Grid (cont.) • Required more than just technology • Regulatory, political and institutional development • Infrastructure for monitoring and management • Huge social impact • Fundamentally changed work and home life • Huge environmental impact • Consume resources, generate pollution, global warming, …

  41. Based on Infrastructure Analogies: Desired Characteristics of Grids • Pooling of resources • Compute cycles, data, people, sensors • Dependable service • Predictable • Sustained performance • Often high-performance

  42. Grid Characteristics (cont.) • Consistent service • Standard services available • Via standard interfaces • Enable application development • Pervasive • Services always available • Inexpensive • Otherwise not widely accepted and used

  43. A Grid Application Scenario • A distributed simulation involving 10 supercomputers at 10 different locations • How do you know where they are? • How do you identify yourself to each? • How do you get permission to use them? • How do you submit remote jobs? • How do you get access to resources on all the machines simultaneously? • What happens if a machine fails? • How are input/output files managed?

  44. Distributed computing Collab. design Remote control Application Toolkit Layer Data- intensive Remote viz Information Resource mgmt . . . Grid Services Layer Security Data access Fault detection Transport . . . Multicast Grid Fabric Layer Instrumentation Control interfaces QoS mechanisms Grid Services Architecture High-energy physics data analysis Collaborative engineering On-line instrumentation Applications Regional climate studies Parameter studies

  45. Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) Slide courtesy of C. Kessleman Cal(IT)2 Presentation

  46. Layered Grid Architecture • Fabric Layer - provides the local services of a resource: • computational, storage, network • Connective Layer - core communication and authentication protocols • Enables exchange of data between fabric layer resources • Security and authentication important here

  47. Layered Grid Architecture (cont.) • Resource Layer – enables resource sharing • Builds on connectivity layer to control and access resources (Ex: data servers) • Collective Layer - coordinates interactions across multiple resources • Ties multiple resources and services together • (Ex: metacatalogues) • Application Layer - user applications use collective, resource, and connective layers to perform grid operations in a virtual organization

  48. Basic Grid Services • Security • Authentication: both client and server • Authorization: what privileges does the client have? • Access control: Sites want local control of operations that remote users are allowed to perform • Confidential data transfer using encryption

  49. Basic Grid Services (cont.) • Resource management • Mechanism for submitting jobs to remote locations • Local policies for use, management, resource configuration • Scheduling of important resources • Coordinating scarce, expensive resources (e.g., cooperating supercomputers) • Advanced reservations to guarantee: • Quality of service • Completion of operations (e.g., reserve disk space for a large data transfer)

  50. Basic Grid Services (cont.) • Information Services • Register and query information about grid resources • Where are all the Cray T3E’s in the grid? • Where is a storage system with 250 gigabytes of free space that transfers data at 1 gigabit/sec? • Centerpiece for many Grid components • Performance measurement services • What is the current bandwidth of the link from jupiter.isi.edu to apogee.sdsc.edu? • Dynamic environment: assume the information service contains old information

More Related