250 likes | 461 Views
Condor. High Throughput Computing System Sean Blackbourn Teresia Djunaedi. Outline. What is Condor ? Fault Tolerance (MW & DAGMan ) Resource Discovery (Matchmaking) Job Deployment (Universe) Communication (Remote System Calls) Applications Contributions Critique. What is Condor?.
E N D
Condor High Throughput Computing System Sean Blackbourn TeresiaDjunaedi
Outline • What is Condor? • Fault Tolerance (MW & DAGMan) • Resource Discovery (Matchmaking) • Job Deployment (Universe) • Communication (Remote System Calls) • Applications • Contributions • Critique
What is Condor? • High Throughput Computing • “Deliver large amounts of processing capacity over long periods of time.” (HTCondor) • Developed at the University of Wisconsin around 1983. • Goal: utilize as many idle resources as possible in order to gain increased performance. • Renamed to HTCondor in 2012
Condor Philosophy • Let communities grow naturally. • Plan accordingly, but don’t be too overly concerned about choosing a perfect match. • Let the owner retain control. • Lend expertise to the research community while integrating knowledge from other sources. • Build on top of previous research.
Core Condor Components Matchmaker (central manager) Problem Solver (DAGMan) (Master-Worker) Resource (startd) Agent (schedd) User Sandbox (starter) Shadow (shadow) Job
Master-Worker • Workers can leave at any time during computation • Machines can arrive at any time and suspend/resume computation • Checkpoint state of computation on user-defined frequency • Master manages a set of user-defined tasks and a pool of workers. • Match tasks to worker
(Directed Acyclic Graph Manager) DAGMan • Meta-scheduler • Jobs do not start until their parent has finished. • Each node requires its own HTCondor submit description file. • Responsible for scheduling, recovery, and reporting
Resource Discovery • Jobs are submitted to an Agent, which is responsible for remembering jobs and managing user policies. (schedd) • Agents must find a Resource capable of executing the job. Resources contain submission site policies. (startd) • Agents and Resources are matched according to a Matchmaker, who manages community policies.
Matchmaking Step 1: Agents and resources advertise themselves to the matchmaker. Step 2: Matchmaker finds potential matches and informs the respective candidates. Step 3: Agent and resource contact each other to confirm match. R M R 1 2 1 A R 3
ClassAd Agents and resources advertise themselves using schema-free classified advertisements (ClassAds) ClassAds contain attributes that use three-value logic, in which expressions may evaluate to true, false, or undefined. Matchmaking algorithm places importance on two particular attributes. • Requirements - conditions for appropriate match. • Rank - arbitrary number used to choose among potential matches.
Gateway Flocking • Retain existing community policies enforced by established matchmakers • Not necessarily bidirectional • Transparent to participants - allow cross-pool matches between adjacent pools • Prevents a user from joining multiple communities • Complex
Direct Flocking • Jobs are not required to be assigned to a single community; may execute if resources are available • Agent may report itself to multiple matchmakers • Only benefits user who takes initiative • Easier for users to understand & deploy
Gliding • Allows user to create personal Condor pool from remote resources • Accessible via Globus GRAM protocol
Job Deployment Once connection has been agreed upon by agent and resource, two major components are needed: Shadow - Represents the user; provides the resource all of the arguments it needs to successfully complete the job. Sandbox - Provides the job with the environmental resources it needs, and protects it from malicious use.
Split Execution • Matched shadows and sandboxes are called universes. • I/O is handled through Secure RPC. • Condor C Library converts local system calls into remote procedural calls. • Both sandbox and Condor Library must gain shadow’s permission before making decisions.
Two-Phase Open 2: Where is file ‘alpha’? 3: compress:remote:/data/newalpha.gz 4: Open ‘/data/newlapha.gz’ 5: Success 6: Success 1: Open ‘alpha’
Applications • Scientific community research • Dreamworks Animation - rendering farms • C.O.R.E. Digital Pictures
Contributions • Clearly outlines the philosophies, goals, and main focal points of HTCondor. • Provides case studies that offer insight on how Condor has been used to increase productivity and efficiency. • Offers performance analysis on real-world problems, such as NUG30 (10+ years vs 1 week).
Critique Drawbacks Suggestions • Security – prone to attacks • Current applications do not extend far beyond the scientific research community. • Include more performance comparisons to similar systems, such as Globus, Legion, PVM, etc. • Include more tutorials in order to alleviate difficult learning curve.