1.3k likes | 1.44k Views
Using Condor An Introduction ICE 2011. The Condor Project (Established ‘85). Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. Condor is a batch computing System . High Throughput (HTC), Not High Performance (HPC)
E N D
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
Condor is a batch computing System • High Throughput (HTC), • Not High Performance (HPC) • Originated from desktop cycle scavanging
Cycle Scavanging • Good metaphor even for clusters which are dedicated
Cycles Are Cheap! • Amazon.com EC2: 10 cents/hour • Academic computing: 4 cents/hour • Opportunistic computing: even cheaper
Total Usage between 2011-07-21 and 2011-07-22 • Group Usage Summary User Hours Pct Demand • -- ------------------------------ ---------- ----- ------ • 1 Physics_Balantekin 6224.8 16.8% 46.4% • 2 ChE_dePablo 5932.3 16.0% 100.0% • 3 Astronomy_Friedman 5764.0 15.5% 0.0% • 4 Economics_Traczynski 4218.4 11.4% 61.1% • 5 Chemistry_Skinner 4186.5 11.3% 45.4% • 6 BMRB 1731.5 4.7% 15.6% • 7 Physics_Petriello 1708.3 4.6% 7.1% • 8 CMS 1494.6 4.0% 31.8% • 9 LMCG 1444.4 3.9% 27.3% • 10 Biochem_Sussman 996.3 2.7% 3.6% • 11 Atlas 847.9 2.3% 79.9% • 12 MSE 812.5 2.2% 2.9% • --------------------------------- ---------- ------ ------ • TOTAL 37126.7 100.0% 100.0%
HTC in a nutshell • Work is divided into “jobs” • Cluster of machines is divided into “machine” • HTC runs jobs on machines.
Definitions • Job • The Condor representation of your work • Machine • The Condor representation of computers and that can perform the work • Match Making • Matching a job with a machine “Resource”
Job Jobs state their requirements and preferences: I need a Linux/x86 platform I need the machine at least 500 Mb I prefer a machine with more memory
Machine Machines state their requirements and preferences: Run jobs only when there is no keyboard activity I prefer to run Frieda’s jobs I am a machine in the econ department Never run jobs belonging to Dr. Smith
The Magic of Matchmaking • Jobs and machines state their requirements and preferences • Condor matches jobs with machines based on requirements and preferences
Getting Started:Submitting Jobs to Condor • Overview: • Choose a “Universe” for your job • Make your job “batch-ready” • Create a submit description file • Run condor_submitto put your job in the queue
1. Choose the “Universe” • Controls how Condor handles jobs • Choices include: • Vanilla • Standard • Grid • Java • Parallel • VM
Using the Vanilla Universe • The Vanilla Universe: • Allows running almost any “serial” job • Provides automatic file transfer, etc. • Like vanilla ice cream • Can be used in just about any situation
2. Make your job batch-ready Must be able to run in the background • No interactive input • No GUI/window clicks • No music ;^)
Make your job batch-ready (continued)… • Job can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Similar to UNIX (or DOS) shell: • $ ./myprogram <input.txt >output.txt
3. Create a Submit Description File • A plain ASCII text file • Condor does not care about file extensions • Tells Condor about your job: • Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) • Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.
Simple Submit Description File # Simple condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = my_job Output = output.txt Queue
4. Run condor_submit • You give condor_submit the name of the submit file you have created: • condor_submit my_job.submit • condor_submit: • Parses the submit file, checks for errors • Creates a “ClassAd” that describes your job(s) • Puts job(s) in the Job Queue
The Job Queue • condor_submit sends your job’s ClassAd(s) to the schedd • The schedd (more details later): • Manages the local job queue • Stores the job in the job queue • Atomic operation, two-phase commit • “Like money in the bank” • View the queue with condor_q
Examplecondor_submit and condor_q % condor_submit my_job.submit Submitting job(s). 1 job(s) submitted to cluster 1. % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job 1 jobs; 1 idle, 0 running, 0 held %
Input, output & error files • Controlled by submit file settings • You can define the job’s standard input, standard output and standard error: • Read job’s standard input from “input_file”: • Input = input_file • Shell equivalent: program <input_file • Write job’s standard ouput to “output_file”: • Output = output_file • Shell equivalent: program >output_file • Write job’s standard error to “error_file”: • Error = error_file • Shell equivalent: program 2>error_file
Email about your job • Condor sends email about job events to the submitting user • Specify “notification” in your submit file to control which events: Notification = complete Notification = never Notification = error Notification = always Default
Feedback on your job • Create a log of job events • Add to submit description file: log = sim.log • Becomes the Life Story of a Job • Shows all events in the life of a job • Always have a log file
Sample Condor User Log 000 (0001.000.000) 05/25 19:10:03 Job submitted from host: <128.105.146.14:1816> ... 001 (0001.000.000) 05/25 19:12:17 Job executing on host: <128.105.146.14:1026> ... 005 (0001.000.000) 05/25 19:13:06 Job terminated. (1) Normal termination (return value 0) ...
Example Submit Description File With Logging # Example condor_submit input file # (Lines beginning with # are comments) # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/frieda/condor/my_job.condor Log = my_job.log·Job log (from Condor) Input = my_job.in·Program’s standard input Output = my_job.out ·Program’s standard output Error = my_job.err·Program’s standard error Arguments = -a1 -a2·Command line arguments InitialDir = /home/frieda/condor/run Queue
Let’s run a job • First, need a terminal emulator • http://www.putty.org • (or similar) • Login to chopin.cs.wisc.edu as • cguserXX, and the given password
Logged In? source /scratch/setup.sh mkdir /scratch/your_name cd /scratch/your_name • condor_q • condor_status
Create submit file • nanosubmit.your_initials • universe = vanilla • executable = /bin/echo • Arguments = hello world • Should_transfer_files = yes • When_to_transfer_output = on_exit • Output = out • Log = log • queue
And submit it… • condor_submit submit.your_initials • (wait… remember the HTC bit?) • Condor_q xx • cat output
A matlab example • #!/s/std/bin/octave –qf • printf “Hello World\n”; • Save as Hello.o • Chmod 0755 Hello.o • ./Hello.o
submit file • nanosubmit.your_initials • universe = vanilla • executable = Hello.o • Should_transfer_files = yes • When_to_transfer_output = on_exit • Output = out • Log = log • queue
“Clusters” and “Processes” • If your submit file describes multiple jobs, we call this a “cluster” • Each cluster has a unique “cluster number” • Each job in a cluster is called a “process” • Process numbers always start at zero • A Condor “Job ID” is the cluster number, a period, and the process number (i.e. 2.1) • A cluster can have a single process • Job ID = 20.0 ·Cluster 20, process 0 • Or, a cluster can have more than one process • Job ID: 21.0, 21.1, 21.2 ·Cluster 21, process 0, 1, 2
Submit File for a Cluster # Example submit file for a cluster of 2 jobs # with separate input, output, error and log files Universe = vanilla Executable = my_job Arguments = -x 0 log = my_job_0.log Input = my_job_0.in Output = my_job_0.out Error = my_job_0.err Queue·Job 2.0 (cluster 2, process 0) Arguments = -x 1 log = my_job_1.log Input = my_job_1.in Output = my_job_1.out Error = my_job_1.err Queue·Job 2.1 (cluster 2, process 1)
Submitting The Job % condor_submit my_job.submit-file Submitting job(s). 2 job(s) submitted to cluster 2. % condor_q -- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2 2.0 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 0 2.1 frieda 4/15 06:56 0+00:00:00 I 0 0.0 my_job –x 1 3 jobs; 2 idle, 1 running, 0 held %
Organize your files and directories for big runs • Create subdirectories for each “run” • run_0, run_1, … run_599 • Create input files in each of these • run_0/simulation.in • run_1/simulation.in • … • run_599/simulation.in • The output, error & log files for each job will be created by Condor from your job’s output
Submit Description File for 600 Jobs # Cluster of 600 jobs with different directories Universe = vanilla Executable = sim Log = simulation.log ... Arguments = -x 0 InitialDir = run_0·Log, input, output & error files -> run_0 Queue ·Job 3.0 (Cluster 3, Process 0) Arguments = -x 1 InitialDir = run_1·Log, input, output & error files -> run_1 Queue ·Job 3.1 (Cluster 3, Process 1) ·Do this 598 more times…………
Submit File for a BigCluster of Jobs • We just submitted 1 cluster with 600 processes • All the input/output files will be in different directories • The submit file is pretty unwieldy (over 1200 lines) • Isn’t there a better way?
Submit File for a BigCluster of Jobs (the better way) #1 • We can queue all 600 in 1 “Queue” command • Queue 600 • Condor provides $(Process) and $(Cluster) • $(Process) will be expanded to the process number for each job in the cluster • 0, 1, … 599 • $(Cluster) will be expanded to the cluster number • Will be 4 for all jobs in this cluster
Submit File for a BigCluster of Jobs (the better way) #2 • The initial directory for each job can be specified using $(Process) • InitialDir = run_$(Process) • Condor will expand these to “run_0”, “run_1”, … “run_599” directories • Similarly, arguments can be variable • Arguments = -x $(Process) • Condor will expand these to “-x 0”, “-x 1”, … “-x 599”
Better Submit File for 600 Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different directories Universe = vanilla Executable = my_job Log = my_job.log Input = my_job.in Output = my_job.out Error = my_job.err Arguments = –x $(Process) ·–x 0, -x 1, … -x 599 InitialDir = run_$(Process) ·run_0 … run_599 Queue 600·Jobs 4.0 … 4.599
Now, we submit it… $ condor_submit my_job.submit Submitting job(s) ............................................................................................................................................................................................................................................................... Logging submit event(s) ............................................................................................................................................................................................................................................................... 600 job(s) submitted to cluster 4.
And, Check the queue $condor_q -- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0 4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1 4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2 4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3 ... 4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598 4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599 600 jobs; 599 idle, 1 running, 0 held
Removing jobs • If you want to remove a job from the Condor queue, you use condor_rm • You can only remove jobs that you own • Privileged user can remove any jobs • “root” on UNIX • “administrator” on Windows
Removing jobs (continued) • Remove an entire cluster: • condor_rm 4 ·Removes the whole cluster • Remove a specific job from a cluster: • condor_rm 4.0 ·Removes a single job • Or, remove all of your jobs with “-a” • condor_rm -a ·Removes all jobs / clusters
Submit cluster of 10 jobs • nano submit • universe = vanilla • executable = /bin/echo • Should_transfer_files = yes • When_to_transfer_output = on_exit • Arguments = hello world $(PROCESS) • Output = out.$(PROCESS) • Log = log • Queue 10
And submit it… • condor_submit submit • (wait…) • Condor_q xx • cat log • cat output.yy
My new jobs run for 20 days… • What happens when a job is forced off it’s CPU? • Preempted by higher priority user or job • Vacated because of user activity • How can I add fault tolerance to my jobs?