560 likes | 758 Views
Grid computing : an introduction Lionel Brunie Institut National des Sciences Appliquées Lyon, France. Hansel and Gretel are lost in the forest of the definitions. Distributed system Parallel system Cluster computing Meta-computing Grid computing Peer to peer Global computing
E N D
Grid computing : an introductionLionel BrunieInstitut National des Sciences AppliquéesLyon, France
Hansel and Gretel are lost in the forest of the definitions • Distributed system • Parallel system • Cluster computing • Meta-computing • Grid computing • Peer to peer • Global computing • Internet Computing • Network computing
Distributed system • N autonomous computers (sites) : n administrators, n data/control flows • an interconnection network • User view : one single (virtual) system • « Traditional » programmer view : client-server
Parallel System • 1 computer, n nodes : one administrator, one scheduler, one power source • memory : it depends • Programmer view : one single machine executing parallel codes. Various programming models (message passing, distributed shared memory, data parallelism…)
Cluster computing • Use of PCs interconnected by a (high performance) network as a parallel (cheap) machine • Two main approaches • dedicated network (based on a high performance network : Myrinet, SCI, Fiber Channel...) • non-dedicated network (based on a (good) LAN)
Network computing • From LAN (cluster) computing to WAN computing • Set of machines distributed over a MAN/WAN that are used to execute parallel loosely coupled codes • Depending on the infrastructure (soft and hard), network computing is derived in Internet computing, P2P, Grid computing, etc.
Visualization Meta computing • Definitions become fuzzy... • A meta computer = set of (widely) distributed (high performance) processing resources that can be associated for processing a parallel not so loosely coupled code • A meta computer = parallel virtual machine over a distributed system SAN LAN Cluster of PCs WAN SAN Supercomputer Cluster of PCs
Grid computing (1) “Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations” (I. Foster)
Grid computing (2) • Information grid : large access to distributed data : the Web • Data grid : management and processing of very large distributed data sets • Computing grid ~ meta computer • Ex : Globus, Legion
Internet computing • Use of (idle) computer interconnected by Internet for processing large throughput applications • Ex : SETI@HOME, Décrypthon, RSA-155 • Programmer view : a single master, n servants
Global computing • Internet computing on a pool of sites • Meta computing with loosely coupled codes • Grid computing with poor communication facilities • Ex : Condor
Peer to peer computing • A site is both client and server : servent • Dynamic servent discovery by « contamination » • 2 approaches : • centralized management : Napster • distributed management : Gnutella, Kazaa • Application : file sharing
Data Intensive Physical Sciences • High energy & nuclear physics • Simulation • Earth observation, climate modeling • Geophysics, earthquake modeling • Fluids, aerodynamic design • Pollutant dispersal scenarios • Astronomy- Digital sky surveys : the planned Large Synoptic Survey Telescope will produce over 10 petabytes per year by 2008 ! • Molecular genomics • Medical images
A Brain is a Lot of Data!(Mark Ellisman, UCSD) And comparisons must be made among many We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns
Performance evolution of computer components • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Disk capacity doubles every 12 months • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
Partial conclusion • It is not a phantasm ! • Real need for very high performance infrasatructures • Basic idea : share computing resources
Back to roots (routes) • Railways, telephone, electricity, roads, bank system • Complexity, standards, distribution, integration (large/small) • Impact on the society : how US grown • Big differences : • clients (the citizens) are NOT providers (State or companies) • small number of actors/providers • small number of applications • strong supervision/control
Computational grid • « HW and SW infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities • Performance criteria : • security • reliability • computing power • latency • services • throughput
Applications • Distributed supercomputing • High throughput computing • On demand (real time) computing • Data intensive computing • Collaborative computing
An Example Virtual Organization: CERN’s Large Hadron Collider 1800 Physicists, 150 Institutes, 32 Countries 100 PB of data by 2010; 50,000 CPUs?
Online System Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Grid Communities & Applications:Data Grids for High Energy Physics ~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 France Regional Centre Germany Regional Centre Italy Regional Centre FermiLab ~4 TIPS ~622 Mbits/sec Tier 2 ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec Tier 4 Physicist workstations www.griphyn.org www.ppdg.net www.eu-datagrid.org
Levels of cooperation • End system (computer, disk, sensor…) • multithreading, local I/O • Cluster (heterogeneous) • synchronous communications, DSM, parallel I/O • parallel processing • Intranet • heterogeneity, distributed admin, distributed FS and databases • low supervision, resource discovery • high throughput • Internet • no control, collaborative systems, (international) WAN • brokers, negotiation
Basic services • Authentication • Authorization • Activity control • Resource information • Resource brokering • Scheduling • Job submission, data access/migration and execution • Accounting
Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) From I. Foster
Aspects of the Problem • Need for interoperability when different groups want to share resources • Diverse components, policies, mechanisms • E.g., standard notions of identity, means of communication, resource descriptions • Need for shared infrastructure services to avoid repeated development, installation • E.g., one port/service/protocol for remote access to computing, not one per tool/application • E.g., Certificate Authorities: expensive to run • A common need for protocols & services From I. Foster
Basic services • Authentication • Authorization • Activity control • Resource information • Resource brokering • Scheduling • Job submission, data access/migration and execution • Accounting
Security :Why Grid Security is Hard • Resources being used may be extremely valuable & the problems being solved extremely sensitive • Resources are often located in distinct administrative domains • Each resource may have own policies & procedures • Users may be different • The set of resources used by a single computation may be large, dynamic, and/or unpredictable • Not just client/server • It must be broadly available & applicable • Standard, well-tested, well-understood protocols • Integration with wide variety of tools
Grid Security : various views User View Resource Owner View 1) Specify local access control 2) Auditing, accounting, etc. 3) Integration w/ local systemKerberos, AFS, license mgr. 4) Protection from compromisedresources 1) Easy to use 2) Single sign-on 3) Run applicationsftp,ssh,MPI,Condor,Web,… 4) User based trust model 5) Proxies/agents (delegation) Developer View API/SDK with authentication, flexible message protection, flexible communication, delegation, ...Direct calls to various security functions (e.g. GSS-API)Or security integrated into higher-level SDKs: E.g. GlobusIO, Condor
Grid security : requirements • Authentication • Authorization and delegation of authority • Assurance • Accounting • Auditing and monitoring • Integrity and confidentiality
Resources • Description • Advertising • Cataloging • Matching • Claiming • Reserving • Checkpointing
Resource layers • Application layer • tasks, resource requests • Application resource management layer • intertask resource management, execution environment • System layer • resource matching, global brokering • Owner layer • owner policy : who may uses what • End-resource layer • end-resource policy (e.g. O.S.)
Resource management (1) • Services and protocols depend on the infrastructure • Some parameters • stability of the infrastructure (same set of resources or not) • freshness of the resource availability information • reservation facilities • multiple resource or single resource brokering • Example request : I need from 10 to 100 CE each with at least 128 MB RAM and a computing power of 50 Mips
Resource management (2) • Figure : the structure of a RMS...
Resource management and scheduling (1) • Levels of scheduling • job scheduling (global level ; perf : throughput) • resource scheduling (perf : fairness, utilization) • application scheduling (perf : response time, speedup, produced data…) • Mapping/scheduling • resource discovery and selection • assignment of tasks to computing resources • data distribution • task scheduling on the computing resources • (communication scheduling) • Individual perfs are not necessarily consistent with the global (system) perf !
Resource management and scheduling (2) • Grid problems • predictions are not definitive : dynamicity ! • Heterogeneous platforms • Checkpointing and migration
Broker Co-allocator A Resource Management System example (Globus) RSL specialization RSL Application Information Service Queries & Info Ground RSL Simple ground RSL Local resource managers GRAM GRAM GRAM LSF Condor NQE
Resource information (1) • What is to be stored ? • Organization, people, computing resources, software packages, communication resources, event producers, devices… • what about data ??? • A key issue in such dynamics environments • A first approach : (distributed) directory (LDAP) • easy to use • tree structure • distribution • static • mostly read ; not efficient updating • hierarchical • poor procedural language
Resource information (2) • But : • dynamicity • complex relationships • frequent updates • complex queries • A second approach : (relational) database
Data management • It was long forgotten !!! • Though it is a key issue ! • Issues : • indexing • retrieval • replication • caching • traceability • (auditing) • And security !!!
The ReplicaManagement Problem • Maintain a mapping between logical names for files and collections and one or more physical locations • Decide where and when a piece of data must be replicated • Important for many applications • Example: CERN high-level trigger data • Multiple petabytes of data per year • Copy of everything at CERN (Tier 0) • Subsets at national centers (Tier 1) • Smaller regional centers (Tier 2) • Individual researchers will have copies • Even more complex with sensitive data like medical data !!!
Programming on the grid : potential programming models • Message passing (PVM, MPI) • Distributed Shared Memory • Data Parallelism (HPF, HPC++) • Task Parallelism (Condor) • Client/server - RPC • Agents • Integration system (Corba, DCOM, RMI)
Program execution : issues • Parallelize the program with the right job structure, communication patterns/procedures, algorithms • Discover the available resources • Select the suitable resources • Allocate or reserve these resources • Migrate the data • Initiate computations • Monitor the executions ; checkpoints ? • React to changes • Collect results
The Legion system • University of Virginia • Object-oriented approach. Objects = data, applications, sensors, computing resources, codes… : all is object ! • Loosely coupled codes • Single naming space • Reuse of existing OS and protocols ; definition of message formats and high level protocols • Core objects : naming, binding, object creation/activation/desactivation/destruction • Methods : description via an IDL • Security : in the hands of the users • Resource allocation : a site can define its own policy
The Globus toolkit • A set of integrated executable management (GEM) services for the Grid • Services • resource management (GRAM-DUROC) • communication (NEXUS - MPICH-G2, globus_io) • information (MDS) • data management (replica catalog) • security (GSI) • monitoring (HBM) • remote data access (GASS - GridFTP - RIO) • executable management (GEM) • execution • Commodity Grid Kits (Java, Python, Corba, Matlab…)
High-Throughput Computing: Condor • High-throughput computing platform for mapping many tasks to idle computers • Since 1986 ! • Major components • A central manager manages pool(s) of [distributively owned or dedicated] computers. A CM = scheduler + coordinator • DAGman manages user task pools • Matchmaker schedules tasks to computers using classified ads • Checkpointing and process migration • No simple communications • Parameter studies, data analysis • Condor married Globus : Condor-G • More than 150 Condor pools in the world ; or on your machine !
Job A Job B Job C Job D Defining a DAG • A DAG is defined by a .dagfile, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D • Each node will run the Condor job specified by its accompanying Condor submit file From Condor tutorial
Conclusion • Just a new toy for scientists or a revolution ? • Complexity from heterogeneity, wide distribution, security, dynamicity • Many approaches • Still much work to do !!! • A global framework for grid computing, pervasive computing and Web services ?
Application MetadataService Planner: Data location, Replica selection, Selection of compute and storage nodes Replica Location Service Information Services Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute Resources Storage Resources Functional View of Grid Data Management Location based on data attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions
MDS2 WS-Index (OGSI) Components in Globus Toolkit 3.0 GSI WU GridFTP JAVA WS Core (OGSI) Pre-WS GRAM WS-Security RFT (OGSI) OGSI C Bindings WS GRAM (OGSI) RLS Security Data Management Resource Management Information Services WS Core