250 likes | 440 Views
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/condor May 2001. Outline. Hi-throughput computing and Condor Resource Management in distributed systems Matchmaking Current research/Misc. Power of Computing environments.
E N D
Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, raj@cs.wisc.edu http://www.cs.wisc.edu/condor May 2001
Outline • Hi-throughput computing and Condor • Resource Management in distributed systems • Matchmaking • Current research/Misc.
Power of Computing environments • Power = Work / Time • High Performance Computing • Fixed amount of work; how much time? • Traditional Performance metrics: FLOPS, MIPS • Response time/latency oriented • High Throughput Computing • Fixed amount of time; how much work? • Application specific performance metrics • Throughput oriented
In other words … • HPC - Enormous amounts of computing power over relatively short periods of time (+) Good for applications under sharp time constraint • HTC - Large amounts of computing power for lengthy periods (+) What if u want to simulate 1000 applications on ur latest DSP chip design over the next 3 months??
The Condor Project • Goal - To develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources
More about Condor • Started in late 80’s • Principal Investigator - Prof.Miron Livny • Latest version 6.3.0 released • Supports 14 different platforms (OS + Arch) including Linux, Solaris and WinNT • Currently employs over 20 students and 5 staff • We write code, debug, port, publish papers and YES, we also provide support !!!
Distributed ownership of resources • Underutilized - 70% of CPU cycles in a cluster go waste • Fragmented - Resources owned by different people • Use these resources to provide HTC, BUT without impacting QOS available to owner • Achieved by allowing the user to set access policy using control expressions
Access policy • Current state of the resource (eg, keyboard idle for 15 minutes or load average less than 0.2) • Characteristics of the request (run only jobs of research associates) • Time of day/night that jobs can be run
What happens when u submit a job Central Manager 2. Submitting machine sends Classad of the job Resources announce their properties periodically 3.Matchmaker Notifies parties of a match Submitting machine Available resource 4.Parties negotiate 1. User submits a job
Condor Architecture • Manager • Collector: Database of resources • Negotiator: Matchmaker • Accountant: Priority maintenance • Startds ( Represent owners of resources) • Implement owner's access control policy • Schedds ( Represent customers of the system) • Maintain persistent queues of resource requests
Power of Condor • Solves NUG30 Quadratic assignment problem, posed in 1968 over a period of 6.9 days, delivering over 96,000 CPU hours by commandeering an average of 650 machines !!! • Compare this with the RSA-155 problem posed in 1977 and solved using 300 computers (over a period of 7 months) in the last 90s. If you were to use the same amount of resources as that used to solve NUG30, this could’ve been done in 2 weeks !!! • “It (Chorus production) was done in parallel on machines in the computer center running XXX, and on the office machines under Condor. The latter did about 90% of the work!” - - Helge MEINHARD (EP division, CERN)
Resource management using Matchmaking • Opportunistic Resource Exploitation • Resource availability is unpredictable • Exploit resources as soon as they are available • Matchmaking performed continuously • As against a centralized scheduler which would’ve to deal with - • Heterogeneity of resources • Distributed Ownership - widely varying allocation policies • Dynamic nature of the cluster
Classified Advertisements • A simple language used by resource providers and customers to express their properties/requirements to the Collector • Uses a semi-structured data model => no specific schema is required by the matchmaker, allowing it to work naturally in a heterogeneous env • Language folds query language into the data model. Constraints may be expressed as attributes of the classad • Should conform to advertising protocol
Matchmaking with Classads • 4 steps to managing resources - • Parties requiring matchmaking advertise their characteristics, preferences, constraints, etc. • Advertisements matched by a Matchmaker • Matched entities are notified • Matched entities establish an allocation through a claiming process - could include authentication, constraint verification, negotiation of terms etc • Method is symmetric
Sample classad of a workstation [ Type = “Machine”; OpSys = “Linux”; Arch = “INTEL”; Memory = 256 M; Constraint = true; ] Classad example Sample classad of a Job [ Type = “Job”; Owner = “run_sim”; Constraint = other.Type ==“Machine” && Arch == “INTEL && Opsys == “Solaris251” && Other.Memory >= Memory; ]
Example Classad (workstation) [ Type = “Machine”; Activity = “Idle”; Name = “crow.cs.wisc.edu”; Arch = “INTEL”; OpSys = “Solaris251”; Kflops = 21893; Memory = 64; Disk = 323496; //KB DayTime = 36107;
Example Classad (contd.) ResearchGrp = {“miron”, “thain”, “john”}; Untrusted = {“bgates”, “lalooyadav”, “thief” }; Rank = member(other.Owner, ResearchGrp)*10; Constraint = !member(other.Owner, Untrusted) && Rank >= 10 ?true : false //To prevent malicious users ]
Example Classad (Submitted job) [ Type = “Job”; QDate = 886799469; Owner = “raman”; Cmd = run_sim; Iwd = /usr/raman/sim2; Memory = 31; Rank = Kflops/1e3 + other.Memory/32; Constraint = other.Type == “Machine” && OpSys == “Solaris251”&& Disk >= 10000 && other.Memory >= self.Memory; ]
Matchmaking • Evaluates expressions in an environment that allows each classad to access attributes of the other • Other.Memory >= self.Memory; • References to non-existent attribute evaluates to undefined • Considers pairs of ads incompatible unless their Constraint expressions both evaluate to true • Rank is then then used to choose among compatible matches • Both parties are notified about the match - could generate and hand-off session key for authentication and security
Separation of Matching and Claiming • Weak consistency requirements - Claiming allows provider and customer to verify their constraints with respect to their current state • Claiming protocol could use cryptographic techniques (authentication) • Principals involved in a match are themselves responsible for establishing, maintaining and servicing a match
Work outside the Condor kernel- New challenges • Mulitlateral Matchmaking - Gangmatching • IO regulation and Disk allocation - Kangaroo • User interfaces - ClassadView • Grid applications - Globus • Security
Summary • Matchmaking provides a scalable and robust resource management solution for HTC environments • Classads are used by workstations and jobs • Matchmaker forms the match and informs the parties, who in turn invoke the claiming protocol • The parties are responsible for establishing, maintaining and servicing a match • Questions ?
Gangmatch request [ Type = “Job”; Owner = “raj”; Cmd = run_sim; Ports = { [ Label = “cpu”; ImageSize = 28 M; //Rank and constraints ], [Label = “License”; Host = cpu.Name; //Rank and constraints ] } ]