SKR 5800 Selected Topics in Distributed Computing

SKR 5800 Selected Topics in Distributed Computing Grid Computing: Introduction AZIZOL ABDULLAH, PhD DEPARTMENT OF COMMUNICATION TECHNOLOGY AND NETWORK

Lecture Contents • Why do we have Grid Computing • What is Grid Computing • Ian Foster’s 3 point checklist • Defining Grid Computing • What is Grid and Grid Computing? • Why we need grids • Why Now? • The Grid Problems

Why do We Have Grid Computing? • The term was coined in 1996 by Ian Foster and Carl Kesselman • Used to describe software that was needed by the rapidly growing, highly advanced community of high-performance Computing (HPC) • Resources that scale with technologies: • Supercomputers (MFlops in 96, but now using TFlops) • Big and not portable • Large data sets (GB in 96, but now peta-bytes) • Need fast networks to move data around to resources • Need security: • NSF (and other gov agencies) spend money to build infrastructure, so it is hard to get access

What is Grid Computing? • Is it a new, unique idea or the next generation of distributed or meta-computing? Please find and read this paper: Ian Foster Paper “What is the Grid? A Three-point Checklist” http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf

Ian Foster’s 3 point checklist • A Grid is a system that is able to • coordinate “resources that are not subject to centralized control” • Use “standard, open, general-purpose protocols and interfaces” • “to deliver nontrivial qualities of service.” • What does this mean? • We will try to understand this in this course.

Defining Grid Computing • There are several competing definitions for “The Grid” and Grid computing • These definitions tend to focus on: • Implementation of Distributed computing • A common set of interfaces, tools and APIs • Some stress the inter-institutional aspect of grids and Virtual Organizations • “The Virtualization of Resources” abstraction of resources

What is Grid and Grid Computing? • Grid computing promises a standard, ‘complete’ set of distributed computing capabilities • There is a lot of hype around grid computing • Traditional users need to get work done now! • Some CS researchers see it as a fad • But there is real-world value! • In e-science and e-business

What is Grid and Grid Computing? (cont..) • Grid computing must provide basic functions • resource discovery and information collection & publishing • data management on and between resources • process management on and between resources • common security mechanism underlying the above • process and session recording/accounting • Current grid computing tools such as Globus provide most of the above at some level • The current capabilities are incomplete • New web service based-standard will help current tools become interoperable.

The Grid “Resource sharing & coordinated problem solving in dynamic … virtual organizations” Enable integration of distributed service & resources Using general-purpose protocols & infrastructure To achieve useful qualities of service “The Anatomy of the Grid”, Foster, Kesselman, Tuecke, 2001

Why we need grids

Grid3: An Operational Grid • 28 sites (2100-2800 CPUs) & growing • 400-1300 concurrent jobs • 8 substantial applications + CS experiments • Running since October 2003 Korea Slide Courtesy of Ian Foster http://www.ivdgl.org/grid3

~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Data Grids for High Energy Physics Image courtesy Harvey Newman, Caltech

Grid Physics Network (GriPhyN) Enabling R&D for advanced data grid systems, focusing in particular on Virtual Data concept ATLAS CMS LIGO SDSS www.griphyn.org; Slide from C. Kesselman/Cal(IT)2 presentation

Why Now? • The Internet as infrastructure • Increasing bandwidth, advanced services • Advances in storage capacity • Terabytes, petabytes per site • Increased availability of compute resources • clusters, supercomputers, etc. • Advanced applications • simulation based design, advanced scientific instruments, ...

The Grid Problem • Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location • central control • omniscience • existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic

The Programming Problem • Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. • Some requirements: • Abstractions and models to increase speed/robustness/etc. of development • Tools to ease application development and diagnose common problems, ease deployment • Code/tool sharing to allow reuse of code components developed by others

Grid must suspport computational workflows • Locate “suitable” computers • Authenticate with appropriate sites • Allocate resources on those computers • Initiate computation on those computers • Configure those computations • Select “appropriate” communication methods • Compute with “suitable” algorithms • Access data files, return output • Respond “appropriately” to resource changes

identity & authentication authorization & policy resource/service discovery resource allocation (co-)reservation, workflow remote data access rapid data transfer monitoring intrusion detection resource management accounting fault management system evolution and more… Grid Requirements

Grid Computing - Functions • Grid computing must provide typically these basic functions (Foster/Kesselman) • resource discovery and information collection & publishing • data management on and between resources • process management on and between resources • common security mechanism underlying the above • In addition, it should include: • process and session recording/accounting

The Grid Problem • Flexible, secure, coordinated sharing of computation among dynamic collections of individuals, institutions, and resources • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location • central control • omniscience • existing trust relationships The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001.

Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic

The Programming Problem • Applications require resources (compute power, storage, data, instruments, displays) at many sites for many users. • Some requirements: • Abstractions and models to increase speed/robustness/etc. of development • Tools to ease application development and diagnose common problems, ease deployment • Code/tool sharing to allow reuse of code components developed by others

Grid Computing Vs Distributed Computing • How does grid computing differ from traditional distributed computing? • Where do grids get their names? • Grid hardware • Grid applications

Distributed Computing: A Quick Review Andrew Tannenbaum: “A distributed system is a collection of independent computers that appear to the users of the system as a single computer.”

Distributed Systems: Hardware • Distributed in the local area • Memory organization: • Shared-memory multiprocessors • Single virtual address space shared by all CPUs • Multicomputers with private memories • Separate address spaces • Interconnection network organization: • Bus-based • A single shared network, backplane, bus or cable • Switch-based • Individual connections between machines

Simplest Hardware: A Bus-based Shared-Memory Multiprocessor Processor Processor Processor • Shared memory • Caches must be kept consistent • Bus bandwidth limits to ~64 processors Memory Cache Cache Cache Bus

Bus-based Distributed Shared-Memory (DSM)Multiprocessor Memory Memory Memory Memory • Each processor contains portion of shared memory • Local accesses fast, remote accesses slow • “NUMA”: non-uniform memory access Cache Cache Cache Cache Processor Processor Processor Processor Bus

Switch-Based Multicomputer: Workstation Cluster Work-station Work-station Ethernet Switch • Workstations share resources: file servers, printers, storage archives • Schedule jobs • Use idle workstations Work-station Work-station Work-station Work-station

Hardware:What is different in a grid? • Heterogeneous hardware environment • computing platforms • network connections • storage systems and caches • Wide-area distribution • Wide-area network latency and bandwidth • Resources in different administration domains • Dynamic environment • Resources enter and leave grid

Software: Issues in Distributed Operating Systems • Communication models • Client-Server Model • Remote procedure call • Group communication • In a grid: • Algorithms must tolerate wide-area latency for message transfers • Avoid large numbers of messages • Typically perform larger transfers, initiate remote jobs rather than procedure calls

Software: Issues in Distributed Operating Systems • Synchronization • Clock synchronization • Election algorithms: determine a coordinator • Atomic transactions • In a grid: • With wide-area latencies, typically perform synchronization on larger grain • Can implement atomic operations

Software: Issues inDistributed Operating Systems • Processes and Processors • Threads • Allocating Processors • Scheduling and co-scheduling resources • Fault tolerance • In a grid: scheduling, allocation, & fault tolerance issues get more complicated in the wide area environment

Software: Issues in a Distributed Operating System • Distributed file systems • File service that reads and writes file, controls access • Creating, deleting & managing directories • Naming • Sharing • Caching and consistency • Replication and updates • In a grid, same issues complicated by wide area distribution, different administrative domains, enormous data sets

Software: Issues for a Distributed Operating System • Distributed Shared Memory • Generally applies to machines in a LAN • Each processor contains memory corresponding to part of the shared memory address space • Each processor caches data from other processors • Many consistency algorithms • In a grid: EASIER! Globus does not support a shared address space • Legion has a single shared object space

Summary: Heterogeneity makes things harder in a grid • Heterogeneous software and hardware • Different administrative domains • Different policies for use and management of local resources • Must do coordinated scheduling • Different security policies • Dynamic environment • Must discover resources • Robust in the presence of network, resource failures

Where do computational grids get their names? • “A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities.” • Name (and definition) imply an analogy to the electric power grid • Power inexpensive, universally available • Enabled new devices and industries

An Infrastructure Analogy:The Electric Power Grid • Revolutionary development: transmission and distribution of electricity • Before: power accessible in crude forms • human work • horses • water power • steam engines • Today: cheap, reliable power universally available

Electric Power Grid (cont.) • Power to billions of devices • Efficient • Low-cost • Reliable • North America: 10,000 generators linked to billions of outlets • Heterogeneous components, distributed ownership • Interconnections between regions: share reserve capacity, trade excess power

Electric Power Grid (cont.) • Required more than just technology • Regulatory, political and institutional development • Infrastructure for monitoring and management • Huge social impact • Fundamentally changed work and home life • Huge environmental impact • Consume resources, generate pollution, global warming, …

Based on Infrastructure Analogies: Desired Characteristics of Grids • Pooling of resources • Compute cycles, data, people, sensors • Dependable service • Predictable • Sustained performance • Often high-performance

Grid Characteristics (cont.) • Consistent service • Standard services available • Via standard interfaces • Enable application development • Pervasive • Services always available • Inexpensive • Otherwise not widely accepted and used

A Grid Application Scenario • A distributed simulation involving 10 supercomputers at 10 different locations • How do you know where they are? • How do you identify yourself to each? • How do you get permission to use them? • How do you submit remote jobs? • How do you get access to resources on all the machines simultaneously? • What happens if a machine fails? • How are input/output files managed?

Distributed computing Collab. design Remote control Application Toolkit Layer Data- intensive Remote viz Information Resource mgmt . . . Grid Services Layer Security Data access Fault detection Transport . . . Multicast Grid Fabric Layer Instrumentation Control interfaces QoS mechanisms Grid Services Architecture High-energy physics data analysis Collaborative engineering On-line instrumentation Applications Regional climate studies Parameter studies

Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) Slide courtesy of C. Kessleman Cal(IT)2 Presentation

Layered Grid Architecture • Fabric Layer - provides the local services of a resource: • computational, storage, network • Connective Layer - core communication and authentication protocols • Enables exchange of data between fabric layer resources • Security and authentication important here

Layered Grid Architecture (cont.) • Resource Layer – enables resource sharing • Builds on connectivity layer to control and access resources (Ex: data servers) • Collective Layer - coordinates interactions across multiple resources • Ties multiple resources and services together • (Ex: metacatalogues) • Application Layer - user applications use collective, resource, and connective layers to perform grid operations in a virtual organization

Basic Grid Services • Security • Authentication: both client and server • Authorization: what privileges does the client have? • Access control: Sites want local control of operations that remote users are allowed to perform • Confidential data transfer using encryption

Basic Grid Services (cont.) • Resource management • Mechanism for submitting jobs to remote locations • Local policies for use, management, resource configuration • Scheduling of important resources • Coordinating scarce, expensive resources (e.g., cooperating supercomputers) • Advanced reservations to guarantee: • Quality of service • Completion of operations (e.g., reserve disk space for a large data transfer)

Basic Grid Services (cont.) • Information Services • Register and query information about grid resources • Where are all the Cray T3E’s in the grid? • Where is a storage system with 250 gigabytes of free space that transfers data at 1 gigabit/sec? • Centerpiece for many Grid components • Performance measurement services • What is the current bandwidth of the link from jupiter.isi.edu to apogee.sdsc.edu? • Dynamic environment: assume the information service contains old information

SKR 5800 Selected Topics in Distributed Computing

SKR 5800 Selected Topics in Distributed Computing

Presentation Transcript

Selected Topics in Propagation

STAFFING – SELECTED TOPICS

Surveillance: Selected Topics

Selected Design Topics

Parallel Algorithms and Computing Selected topics

Selected Advanced Topics

Selected Topics in Transport Phenomena

Selected Topics in Automated Diversity

Selected topics in Transcription

Selected topics in distributed computing

Selected topics in Ant 2002

Selected Topics in

Selected Topics in VLSI Design

Selected Topics in Software Computing Distributed Software Development

Selected topics from

Selected Topics in Software Engineering - Distributed Software Development

Selected Topics in Data Networking

Selected Topics in Software Engineering - Distributed Software Development

Selected Topics in Propagation

Distributed Programming CA107 Topics in Computing Series