180 likes | 415 Views
CONDOR. CISC 879 Parallel Computation Spring 2003 Preethi Natarajan. Outline. Condor – Goals & Overview Components Matchmaking - ClassAds RPC in Condor Checkpoint/Restart Glance @ APIs. Condor – Objectives.
E N D
CONDOR CISC 879 Parallel Computation Spring 2003 Preethi Natarajan
Outline • Condor – Goals & Overview • Components • Matchmaking - ClassAds • RPC in Condor • Checkpoint/Restart • Glance @ APIs
Condor – Objectives • Condor ‘s goal is to hunt for idle resources that can be exploited by user applications • Performance Vs. Throughput • High Performance Computing • CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this machine?” • High Throughput Computing • CPU cycles/day (week, month, year?) under non-ideal circumstances. “How many times can I run simulation X in the next month using all available machines?” • How much computing power is available to me? • Condor converts collections of distributively owned workstations (different platforms) and dedicated clusters into a distributed high-throughput computing facility
Condor - Overview • Customers advertise their job requirements to Condor – Resource Requests • Resource owners advertise their resource descriptions – Resource Offers • Condor provides • Matchmaking between jobs and resources • Notification of Matches • Transparent access to job’s files during execution • Opportunistic Scheduling – Schedule resources when there is an opportunity • Checkpoint (save) job state when current resource needs to be preempted • Restart job from checkpointed state in another available resource Condor Central Manager Resource found appropriate for the job Site at which job submitted
Condor Components CUSTOMER AGENT • Submits Resource Requests (job requirements) in an application queue ordered by a priority scheme • Implementation is called the Scheduling daemon – schedd Accountant Collector Negotiator Notify Match Resource Requests RESOURCE AGENT • Periodically extracts resources’ state information and updates its Resource Offers • Implementation is called the startd Resource Offers startd schedd Job submission Customer Agent Resource Agent
Condor Components (Cont.) • CENTRAL MANAGER • Is the condor “kernel” of the condor pool • Collector - Periodically collects • Resource Offers from startds • Resource Requests Schedds • Negotiator • Matchmaking between Resource Requests and Offers • Notification about the match to the entities of the matched pair • Claiming Protocol followed between the respective Customer and Resource Agents • Accountant – Logs resource(s) usage by jobs
ClassAds • Classified Advertisement is a flexible and extensible data model used to represent • Resource Offers - Resource services available • Resource Requests - Job Requirements • Access Policies - Constraints on resource allocations & requirements • Is a mapping from attribute names to expressions – defines semantics for evaluating the attributes
ClassAds - Access Policies Policy Specification Example • Resource access policy specifies • Who may use resource • How they may use resource • When they may use resource • Access Policy Specification in Condor is done using the following ClassAd Attributes
Matchmaking • ClassAd Specification • ClassAds describing Resource Requests and Resource Offers with attributes like Type, Rank, Requirements, Vacate etc • Advertising Protocol • Entity periodically communicates the ClassAd and “contact address” to the Central Manager (Matchmaker) • Matchmaking Algorithm • Matches based on Requirements specified in the Resource Requests and Offers. • Match with the highest Rank is selected. • Use of past resource usage (log) for fair scheduling
Matchmaking (cont. ) • Matchmaking Protocol • Match notified to the two parties that were matched @ their “contact address” along with the matched ClassAd • (Possible) Authentication via hand-off of a session-key • Claiming Protocol • Match was a mutual introduction of the 2 parties • Customer contacts Resource directly to negotiate regarding resource allocation
After Match Notification… • Schedd on the Initiating (Submit) machine first spawns a shadow process. Shadow process acts as the shadow of the job that will be executed on the remote machine • Shadow negotiates with Startd of remote machine to run the job • If successful, Startd on the remote, spawns Starter which • Starts the remote job by spawning • Manages the execution of the remote job by communicating with the Shadow.
Exploiting RPC • Remote Machine agrees to run submit machine’s job at its workstation. But the job’s files are physically located at the submit machine. • open(), read(), write() calls in the job’s code are executed at the submit machine as RPCs • condor_syscall_lib has to be linked to these jobs • If files can be accessed via NFS/AFS then it is preferred over RPC if it will be efficient. The open() routine in the condor_syscall_lib talks with the shadow at submit machine and makes these decisions Starter process for the remote job Local File System spawns Remote Job’s process … Call to open(jobfile1) Shadow process for the job Access ‘jobfile1’ via NFS/AFS or RPC Remote Machine Submit Machine
Checkpoint • To checkpoint an executing program is to take a snapshot of its current state in such a way that the program can be restarted from that state at a later time possibly at a different resource • Provides • Preemptive-Resume scheduling • Fault Tolerance – when checkpointing is done periodically • In Condor, checkpointing running jobs is optional. If it is needed, source should be linked with condor_syscall_lib
Checkpointing in Condor • Implemented in condor_syscall_lib as a signal handler • When condor sends a signal to checkpoint, the handler saves process’state information in a checkpoint file • From Core - contents of process’s uarea, data and stack segments • From Executable – symbol and debugging info, initialized data, text
Checkpointing & Restart • Shadow sends the latest checkpoint file to the new Starter during restart • The starter, reads the job state from the checkpoint file and the execution continues • Starter periodically sends a checkpoint signal to the executing job • Condor_syscall_lib makes job dump core and saves job state in the checkpoint file • Checkpoint file temporarily stored @ Remote Machine • Starter transfers latest checkpoint file to shadow when job vacated Starter process for the remote job Local File System Checkpoint file transferred when job restarted Checkpoint signal Checkpoint file Checkpoint file transferred when job vacated Shadow process for the job Code in condor_syscall_lib saves process state information Remote Machine Submit Machine
CONDOR APIs - Glance • Compile as a condor job gcc –c hello.c –o hello.o condor_compile gcc hello.o –o hello • Submit a condor job cat > submit.hello Executable = hello Universe = standard Output = hello.out Log = hello.log Queue condor_submit submit.hello – creates Job ClassAd
CONDOR APIs (Cont. ) • Condor_master – starts other daemons • Condor_vacate – vacate jobs running on specified hosts • Condor_status – display status of condor pool • Condor_rm – remove a condor job from queue • More commands @ http://www.cs.wisc.edu/condor/manual/v6.4/
REFERENCES • Condor Project Home Page http://www.cs.wisc.edu/condor/ • Research Publications on Condor http://www.cs.wisc.edu/condor/publications.html