570 likes | 678 Views
Distributed Systems Management. Presented by: Rajesh Kumar Gabriela Jacques da Silva. Motivation. “A layer of indirection solves every problem in computer science” ~ David Wheeler. Problem. Distributed infrastructures are popular and are here to stay.
E N D
Distributed Systems Management Presented by: Rajesh Kumar Gabriela Jacques da Silva
Motivation “A layer of indirection solves every problem in computer science” ~ David Wheeler
Problem Distributed infrastructures are popular and are here to stay. How to efficiently and smoothly manage operations in distributed systems?
Metacomputing “A networked environment of resources connected by high-speed links functioning as one huge virtual super-computer” • Isn’t that grid-computing?
Virtual Organizations Source: Wikipedia
Why metacomputing? • It’s the money, …… • A particular configuration is required rarely • Challenge of executing high-performance applications • Resources are hard and expensive to replicate • Still valid in few contexts
Applications • Desktop supercomputing • Visualization capability linked with supercomputers and databases • Smart instruments • Instruments linked with supercomputers for real-time data processing and actuation • Collaborative environments • Link virtual environments for remote user interaction • Distributed supercomputing • Connect multiple computers to tackle hard problems • Which one would be the most prevalent?
Metasystem characteristics • Scale and distributed selection • Heterogeneous across all levels • Unpredictable structure • Dynamic and unpredictable behavior • Multiple administrative domains
Globus • Goal: to address the problems of configuration and performance optimization in metacomputing environments • Developing low-level mechanisms to support high-level services • Techniques that allow higher-level services to observe and guide the operation of these mechanisms
Low-level modules • Resource (al)location • Communications • Unified resource information service • Authentication • Process creation • Data access
Support for AWARE Services Adaptive Wide Area Resource Environment is one of the long-term goals of Globus. It is supported through the following mechanisms: • Rule-based selection • Resource property inquiry • Notification or call-back mechanism
Communications • Based on Nexus Communication Library • Five concepts: node, thread, context, global pointer, Remote Service Request (RSR)
Metacomputing Directory Service • Information provided includes: • Resource configuration details • Real-time performance information • Application-specific information • Information can be aggregated from multiple sources such as NIS, SNMP • Information represented and accessed using interface defined by LDAP
Authentication • Use the Generic Security System (GSS) which defines a standard procedure and API for obtaining credentials (password or certificates), for mutual authentication (client and server), and for message-oriented encryption and decryption. • GSS is independent of any particular security mechanism and can be layered on top of different security methods, such as Kerberos and SSL.
Data Access Services Source: Original paper
Python Runtime C Runtime Java Runtime Available in High-Quality Open Source Software … Globus Toolkit v4 www.globus.org Data Replication CredentialMgmt Replica Location Grid Telecontrol Protocol Delegation Data Access & Integration Community Scheduling Framework WebMDS Reliable File Transfer CommunityAuthorization Workspace Management Trigger Authentication Authorization GridFTP Grid Resource Allocation & Management Index Security Data Mgmt Execution Mgmt Info Services CommonRuntime I. Foster, Globus Toolkit Version 4: Software for Service-Oriented Systems, LNCS 3779, 2-13, 2005
Higher-level services • Parallel Programming Interfaces • Numerous PPI have been adapted to use Globus for low-level services. • Unified Certificate-based Authentication • Define a global, public key-based authentication space for all users and resources. • Provide a centralized authority that defines system-wide names (“accounts”) for users and resources. • Basic implementation to show how it can be done
Discussion • Does it support executing multiple applications on the grid? • How would it handle conflicting requirements • Inter-domain job scheduling • Job migration? • Fault-tolerance and error recovery
Condor and the Grid Douglas Thain, Todd Tannenbaum, Miron Livny
Condor • Batch-scheduler for high-throughput computing • University of Wisconsin - 1988 • Steal idle cycles from any workstations - opportunistic computing • resource finder • batch queue manager • scheduler • Well deployed • base for commercial systems - LoadLeveler
Key Mechanisms • ClassAds • resource matching • Job checkpoint and migration • failure • workstation is now in use • Remote system call • mobile sandbox • redirection of I/O
Condor-G - Condor + Globus • Globus • Widely used • Speaks with foreign batch systems • Interdomain authentication • No error recovery • Condor • Reliable submission • Job management
Building Computing Communities Matchmaker Problem Solver User Agent Resource Shadow Sandbox Job
Condor Pool Resource Agent Matchmaker Resource Resource
Collaboration between Pools Pool B Pool A R R R M M R R R A
Pools in Condor-G Foreign batch sched Foreign batch sched R R R R R R M Q Q GRAM GRAM A
Pools in Condor-G Personal Condor Pool R R R R R R M Q Q A
Pools in Condor-G Personal Condor Pool R R R R R R M Q Q A
Planning and Scheduling • Different administrative domains different scheduling policies • Planning where • Scheduling when • Matchmaker • Agents and resources - classified advertisements (ClassAds) • Pairing based on constraints • Matching Notification Claiming
Classified Advertisements • Submitting a job • creates job ClassAds [ Type = “Job”; Owner = “gjsilva”; Cmd = “my_computation”; WantRemoteSyscalls = 1; WantCheckpoint 1; Args = “-example args”; Constraint = other.Type == “Machine” && Arch == “INTEL” && OpSys == “LINUX” && other.Memory >= 1000; ]
Classified Advertisements • Announcing resources [ Type = “Machine”; Activity = “idle”; KeyboardIdle = 36000; // seconds Disk = 20; // GB Memory = 512; Arch = “INTEL”; OpSys = “LINUX; State = “Unclaimed”; Friends = { “gjsilva”, “rkumar8”} Constraint = member(other.Owner, Friends}; ]
Problem Solvers • Provide programming models • How to execute job - application layer • Master-Worker • Embarrassingly parallel programs • Jobs are independent • Directed Acyclic Graph Manager • Enforce order on job completion
Split Execution • Guarantee job isolation and correct execution • Shadow • User • Provides executable, arguments, environment, input files • Sandbox • Safe place to execute • Create appropriate environment • Fetch files through RPCs
Discussion • Very reliable, stable system • Most issues have been taken care of in recent versions • Problem solvers • Simple models • Fail to provide solution that explore locality • Checkpointing • Single independent job easy task • How to do migration for jobs with connections? MPI applications? • No single point of failure • Hard to find information about failure handling in critical components • Is it possible to be selfish? • set a low ranking • allow matchmaking but deny resource claim • Firewall issues?
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic
What is common any way? • Both build infrastructures that enable federated, extensible and secure resource • Across distributed trust domains
Why compare? • Some functionality can be transferred • Some functionality might be complementary • Synergistic evolution is possible
Some issues though • Both are active projects; hence compare existing and planned functionality • Projects are complementary • Globus is a software toolkit with many deployments; Planetlab is a single deployment with some software
PlanetLab - recap • Infrastructure testbed especially for network services • Best suited for services that need dispersed nodes • Experimental and production use • Designed to run on dedicated hosts • Uses virtualization • Low level system abstraction • The user sees a distributed set of virtual containers • Higher value services are built on top • Currently: 753 nodes on 363 sites
Similar but not quite • User communities • Application characteristics • Resources • Resource Ownership
PlanetLab CS researchers Minimal functionality Users build their own Duplicated effort But competitive Globus Heterogeneous set Rich functionality Standardized Can be further built upon User communities
PlanetLab Network-intensive Experimentation Distributedness is an objective Globus Computation-intensive High-performance Distributedness is a necessary “evil” Application Characteristics
PlanetLab Trend is towards standardization of resources An economic necessity Globus Supports diverse devices and platforms A feature Resources