200 likes | 216 Views
Learn how Condor optimizes CPU usage by matching idle workstations with tasks, facilitating research with split execution and resource sharing.
E N D
Condor and the Grid D. Thain, T. Tannenbaum, M. Livny Christopher M. Moretti 23 February 2007
Problem & Opportunity • Users need CPUs • Scientific computing • Mathematical modeling • Data mining • Many CPU cycles are unused • Personal workstations • General use laboratories • Research machines
Solution: Condor • “A hunter of idle workstations” • Keeps track of resources • needed and available • Determines and assigns matches • Monitors progress • Cleans up and reports results
Architecture • Three principals: • Agent: machine needing resources • Matchmaker • Resource: machine lending resources • Three phases: • Advertising • Matching/Claiming • Deploying/Executing
Advertising Does Y satisfy X? MatchMaker I need X I have Y Agent Lender needy.cse.nd.edu idle.cse.nd.edu
Matching & Claiming MatchMaker Use idle.cse.nd.edu Listen for needy.cse.nd.edu Agent Lender Are you still available? Yes. needy.cse.nd.edu idle.cse.nd.edu
Deploying / Executing Agent Lender Fork! Fork! Shadow Sandbox Run job J. J I need file /tmp/foo. Split Execution needy.cse.nd.edu idle.cse.nd.edu
Matching • How are matches determined? • Policy • ClassAds • Why independently claim a match? • What if the Matchmaker dies?
MyType=“Job” TargetType=“Machine” Requirements= ((other.Arch==“INTEL”&&other.OpSys==“LINUX” && KeyboardIdle>600)) Cmd=“/tmp/a.out” Owner=“cmoretti” MyType=“Machine” TargetType=“Job” Machine= “dustpuppy.cse.nd.edu” Requirements= (( KeyboardIdle>600 )) Arch=“INTEL” OpSys=“LINUX” ClassAds
Flocking • Using another pool’s resources • Utilize more total resources • Find resources that match needs • Two methods • Gateway flocking • Direct flocking
Gateway Flocking • Each pool has a known “gateway” • Gateways negotiate sharing • Advertise resources and needs • Transmit requests to local matchmaker • Pool-level granularity • Accounting • Policy • Now obsolete
Gateway Flocking R 1 MM A R 2 Gateway Gateway 3 4 R 5 MM R R 5 R R
Direct Flocking • Agents report to other matchmakers • No gateways • Equivalent to being in multiple pools? • Now the preferred (only) method
Gateway Flocking R MM A 1 R 2 R MM R R 3 R R
Transparency Fosters organization-level sharing Poor accounting Complicated No gateways Individual relationships supported Non-transparent Fewer organization-level agreements Flocking Comparison Gateway Flocking Direct Flocking
Things Aren’t Perfect • What happens if (when) … • Matchmaker goes down • Network or Agent fails during deploy • Resource or App fails during compute • Non-dedicated machines. • How do we keep owners happy? • What happens when an owner reclaims a resource?
2376456 (100%) CPU-Hours Total 281003 (11%) CPU-Hours Consumed by Owner at Keyboard 934277 (39%) CPU-Hours Totally Unused 1161176 (48%) CPU-Hours Harnessed by Condor Total Consumption in 2006 Condor at Notre Dame http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html “Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006”, Douglas Thain
Current Donors Feb 2007 “Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006”, Douglas Thain
CPU History “Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006”, Douglas Thain
Recap • Condor facilitates distributed computation on dedicated or scavenged CPUs arranged by a matchmaker using ClassAds. • Split Execution is necessary to fit the job’s needs to the environment. • An agent can advertise to multiple matchmakers to examine more potential matches.