Condor

Condor High Throughput Computing System Sean Blackbourn TeresiaDjunaedi

Outline • What is Condor? • Fault Tolerance (MW & DAGMan) • Resource Discovery (Matchmaking) • Job Deployment (Universe) • Communication (Remote System Calls) • Applications • Contributions • Critique

What is Condor? • High Throughput Computing • “Deliver large amounts of processing capacity over long periods of time.” (HTCondor) • Developed at the University of Wisconsin around 1983. • Goal: utilize as many idle resources as possible in order to gain increased performance. • Renamed to HTCondor in 2012

Condor Philosophy • Let communities grow naturally. • Plan accordingly, but don’t be too overly concerned about choosing a perfect match. • Let the owner retain control. • Lend expertise to the research community while integrating knowledge from other sources. • Build on top of previous research.

Core Condor Components Matchmaker (central manager) Problem Solver (DAGMan) (Master-Worker) Resource (startd) Agent (schedd) User Sandbox (starter) Shadow (shadow) Job

Master-Worker • Workers can leave at any time during computation • Machines can arrive at any time and suspend/resume computation • Checkpoint state of computation on user-defined frequency • Master manages a set of user-defined tasks and a pool of workers. • Match tasks to worker

(Directed Acyclic Graph Manager) DAGMan • Meta-scheduler • Jobs do not start until their parent has finished. • Each node requires its own HTCondor submit description file. • Responsible for scheduling, recovery, and reporting

Resource Discovery • Jobs are submitted to an Agent, which is responsible for remembering jobs and managing user policies. (schedd) • Agents must find a Resource capable of executing the job. Resources contain submission site policies. (startd) • Agents and Resources are matched according to a Matchmaker, who manages community policies.

Matchmaking Step 1: Agents and resources advertise themselves to the matchmaker. Step 2: Matchmaker finds potential matches and informs the respective candidates. Step 3: Agent and resource contact each other to confirm match. R M R 1 2 1 A R 3

ClassAd Agents and resources advertise themselves using schema-free classified advertisements (ClassAds) ClassAds contain attributes that use three-value logic, in which expressions may evaluate to true, false, or undefined. Matchmaking algorithm places importance on two particular attributes. • Requirements - conditions for appropriate match. • Rank - arbitrary number used to choose among potential matches.

ClassAd Example

Gateway Flocking • Retain existing community policies enforced by established matchmakers • Not necessarily bidirectional • Transparent to participants - allow cross-pool matches between adjacent pools • Prevents a user from joining multiple communities • Complex

Direct Flocking • Jobs are not required to be assigned to a single community; may execute if resources are available • Agent may report itself to multiple matchmakers • Only benefits user who takes initiative • Easier for users to understand & deploy

Gliding • Allows user to create personal Condor pool from remote resources • Accessible via Globus GRAM protocol

Job Deployment Once connection has been agreed upon by agent and resource, two major components are needed: Shadow - Represents the user; provides the resource all of the arguments it needs to successfully complete the job. Sandbox - Provides the job with the environmental resources it needs, and protects it from malicious use.

Split Execution • Matched shadows and sandboxes are called universes. • I/O is handled through Secure RPC. • Condor C Library converts local system calls into remote procedural calls. • Both sandbox and Condor Library must gain shadow’s permission before making decisions.

Two-Phase Open 2: Where is file ‘alpha’? 3: compress:remote:/data/newalpha.gz 4: Open ‘/data/newlapha.gz’ 5: Success 6: Success 1: Open ‘alpha’

Applications • Scientific community research • Dreamworks Animation - rendering farms • C.O.R.E. Digital Pictures

Contributions • Clearly outlines the philosophies, goals, and main focal points of HTCondor. • Provides case studies that offer insight on how Condor has been used to increase productivity and efficiency. • Offers performance analysis on real-world problems, such as NUG30 (10+ years vs 1 week).

Critique Drawbacks Suggestions • Security – prone to attacks • Current applications do not extend far beyond the scientific research community. • Include more performance comparisons to similar systems, such as Globus, Legion, PVM, etc. • Include more tutorials in order to alleviate difficult learning curve.

Questions?

Condor

Condor

Presentation Transcript

Condor Overview

Condor BirdBath SOAP Interface to Condor

Condor RoadMap Condor Week 2007

Extending Condor Condor Week 2010

Condor Cluster

California Condor

Condor- G

Condor Administration

CONDOR

Condor

CONDOR

Condor NT Condor ported to Win32

The Condor “RoadMap” Condor Week 2003

Iron Condor

Condor

Condor@Cardiff

CONDOR

Condor-G Making Condor Grid Enabled

(Local) Condor

Condor BirdBath SOAP Interface to Condor

Condor RoadMap Condor Week 2007