760 likes | 975 Views
Condor by Example. Outline. Overview Submitting Jobs, Getting Feedback Setting Requirements with ClassAds Using LOTS of Machines Which Universe? Conclusion. What is Condor?. Condor converts a collection of unrelated workstations into a high-throughput computing facility.
E N D
Outline • Overview • Submitting Jobs, Getting Feedback • Setting Requirements with ClassAds • Using LOTS of Machines • Which Universe? • Conclusion
What is Condor? • Condor converts a collection of unrelated workstations into a high-throughput computing facility. • Condor uses matchmaking to make sure that everyone is happy.
What is High-Throughput Computing? • High-performance: CPU cycles/second under ideal circumstances. • “How fast can I run simulation X on this machine?” • High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. • “How many times can I run simulation X in the next week using all available machines?”
What is High-Throughput Computing? • Condor does whatever it takes to run your jobs, even if some machines… • Crash! • Are disconnected • Run out of disk space • Are removed or added from the pool • Are put to other uses
What is Matchmaking? • Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.”
“What can Condordo for me?” Condor can… • …increase your throughput. • …do your housekeeping. • …improve reliability. • …give performance feedback.
How many machines now? • The map is out of date! • The system is always changing. • First example: What machines (and of what kind) are in the pool now?
First Things First • Set your path: • setenv PATH /library/condor_nfs/XXX/bin • XXX should be your system: • OSF1, LINUX, SOLARIS26, HPUX10 …
How Many Machines? % condor_status Name OpSys Arch State Activity LoadAv Mem lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle 0.000 30 axpd21.pd.inf OSF1 ALPHA Owner Idle 0.266 96 vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy 0.000 256 . . . Machines Owner Claimed Unclaimed Matched Preempting ALPHA/OSF1 115 67 46 1 0 1 INTEL/LINUX 53 18 0 35 0 0 INTEL/LINUX-GLIBC 16 7 0 9 0 0 SUN4u/SOLARIS251 1 1 0 0 0 0 SUN4u/SOLARIS26 6 2 0 4 0 0 SUN4u/SOLARIS27 1 1 0 0 0 0 SUN4x/SOLARIS26 2 1 0 1 0 0 Total 194 97 46 50 0 1
Machine States • Most machines will be: • Owner: • The machine’s owner is busy at the console, so no Condor jobs may run. • Claimed: • Condor has selected the machine to run jobs for other users.
Machine States • Only a few should be: • Unclaimed: • The owner is gone, but Condor has not yet selected the machine. • Matched: • Between claimed and unclaimed. • Preempting: • Condor is busy removing a job.
More Examples % condor_status -help % condor_status –avail % condor_status –run % condor_status –total % condor_status –pool condor.cs.wisc.edu
Steps to Running a Job • Re-link for Condor. • Submit the job. • Watch the progess. • Receive email when done.
Example Job Compute the nth Fibonnaci number. Fib(40) takes about one minute to compute on an Alpha. % ./fib 40 fib(40) = 102334155
#include <stdio.h> #include <stdlib.h> int fib( int x ) { if( x<=0 ) return 0; if( x==1 ) return 1; return fib(x-1) + fib(x-2); } int main(int argc, char *argv[]) { int n; n = atoi(argv[1]); printf ("fib(%d) = %d\n",n,fib(n)); return 0; }
Re-link for Condor • Normal compile: • gcc –c fib.c –o fib.o • Normal link: • gcc fib.o –o fib • Use the same command, but add condor_compile: • condor_compile gcc fib.o –o fib
Submit the Job • Create a submit file: • vi fib.submit • Submit the job: • condor_submit fib.submit Executable = fib Arguments = 40 Output = fib.out Log = fib.log queue
Watch the Progress % condor_q -- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 thain 6/21 12:40 0+00:00:15 R 0 2.5 fib 40 Each job gets a unique number. Status: Unexpanded, Running or Idle Size of program image (MB)
Receive E-mail When Done This is an automated email from the Condor system on machine "axpbo8.bo.infn.it". Do not reply. Your condor job /tmp_mnt/usr/users/ccl/thain/test/fib 40 exited with status 0. Submitted at: Wed Jun 21 14:24:42 2000 Completed at: Wed Jun 21 14:36:36 2000 Real Time: 0 00:11:54 Run Time: 0 00:06:52 Committed Time: 0 00:01:37 . . .
Running Many Processes • 100 processes are almost as easy as !. • Each condor_submit makes one cluster of one or more processes. • Add the number of processes to run to the Queue statement. • Use the $(PROCESS) variable to give each process slightly different instructions.
Running Many Processes • Compute Fib(1) through Fib(50) • Output goes in fib.out.1, fib.out.2, and so on… Executable = fib Arguments = $(PROCESS) Output = fib.out.$(PROCESS) Log = fib.log Queue 50
Running Many Processes • Another approach: Each process gets its own directory (dir1, dir2, …) and sends the output to dirX/fib.out. Executable = fib Arguments = $(PROCESS) Initial_Dir = dir$(PROCESS) Output = fib.out Log = fib.log Queue 50
Running Many Processes % condor_q -- Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.3 thain 6/23 10:47 0+00:05:40 R 0 2.5 fib 3 9.6 thain 6/23 10:47 0+00:05:11 R 0 2.5 fib 6 9.7 thain 6/23 10:47 0+00:05:09 R 0 2.5 fib 7 . . . 21 jobs; 2 idle, 19 running, 0 held Cluster number Process number
Where Are They Running? • condor_q –run • Submitter: axpbo8.bo.infn.it : <131.154.10.29:1038> : ID OWNER SUBMITTED RUN_TIME HOST(S) 9.47 thain 6/23 10:47 0+00:07:03 ax4bbt.bo.infn.it 9.48 thain 6/23 10:47 0+00:06:51 pewobo1.bo.infn.it 9.49 thain 6/23 10:47 0+00:06:30 osde01.pd.infn.it Current Location
Help! I’m buried in Email! • By default, Condor sends one email for each completed process. • Add these to your submit file: • notification = error • notification = never • To send it to someone else: • notify_user = mazzanti@bo.infn.it
Removing Processes • Remove one process: • condor_rm 9.47 • Remove a whole cluster: • condor_rm 9 • Remove everything! • condor_rm -a
What have I done? • The user log file (fib.log) shows a chronological list of everything important that happened to a job. 001 (007.035.000) 06/21 17:03:44 Job executing on host: <140.105.6.155:2219> 004 (007.035.000) 06/21 17:04:58 Job was evicted. 009 (007.035.000) 06/21 17:05:10 Job was aborted by the user.
What have I done? % condor_history ID OWNER SUBMITTED CPU_USAGE ST COMPLETED CMD 9.3 thain 6/23 10:47 0+00:00:00 C 6/23 10:58 fib 3 9.40 thain 6/23 10:47 0+00:00:24 C 6/23 10:59 fib 40 9.10 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 10 9.47 thain 6/23 10:47 0+00:05:45 C 6/23 11:01 fib 47 9.7 thain 6/23 10:47 0+00:00:00 C 6/23 11:01 fib 7
Brief I/O Summary % condor_q –io -- Schedd: c01.cs.wisc.edu : <128.105.146.101:2016> ID OWNER READ WRITE SEEK XPUT BUFSIZE BLKSIZE 756.15 joe 244.9 KB 379.8 KB 71 1.3 KB/s 512.0 KB 32.0 KB 758.24 joe 198.8 KB 219.5 KB 78 45.0 B /s 512.0 KB 32.0 KB 758.26 joe 44.7 KB 22.1 KB 2727 13.0 B /s 512.0 KB 32.0 KB 3 jobs; 0 idle, 3 running, 0 held
Complete I/O Summaryin Email Your condor job "/usr/joe/records.remote input output" exited with status 0. Total I/O: 104.2 KB/s effective throughput 5 files opened 104 reads totaling 411.0 KB 316 writes totaling 1.2 MB 102 seeks I/O by File: buffered file /usr/joe/input opened 2 times 100 reads totaling 398.6 KB 311 write totaling 1.2 MB 101 seeks (Only since Condor Version 6.1.11)
Complete I/O Summaryin Email • The summary helps identify performance problems. Even advanced users don't know exactly how their programs and libraries operate.
Complete I/O Summary in Email • Example: • CMSSIM - physics analysis program. • “Why is this job so slow?” • Data summary: • read 250 MB from 20 MB file. • Very high SEEK total -> random access. • Solution: Increase buffer to 20 MB.
Who Uses Condor? % condor_q –global -- Schedd: to02xd.to.infn.it : <192.84.137.2:1030> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 127.0 garzelli 6/21 18:45 1+14:18:16 R 0 17.2 tosti2trisdn -- Schedd: quark.ts.infn.it : <140.105.6.101:3908> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 600.0 dellaric 4/10 14:57 55+09:20:31 R 0 9.1 john p2.dat 665.0 dellaric 6/2 11:14 20+03:27:30 R 0 9.2 john p1.dat 788.0 pamela 6/20 09:27 3+04:41:43 R 0 15.4 montepamela
Who uses Condor? % condor_status –submitters Name Machine Running IdleJobs MaxJobsRunning rebuzzin@pv.infn.it decux1.pv. 22 34 200 pamela@ts.infn.it quark.ts.i 6 1 200 giunti@to.infn.it to05xd.to. 21 49 200 . . . RunningJobs IdleJobs cattaneo@pv.infn.it 0 1 pamela@ts.infn.it 6 1 rebuzzin@pv.infn.it 22 34 Total 59 86
Who Uses Condor? % condor_userprio Last Priority Update: 6/23 16:27 Effective User Name Priority ------------------------------ --------- meucci@pv.infn.it 0.50 longof@ts.infn.it 0.50 thain@bo.infn.it 0.50 dellaric@ts.infn.it 2.00 clueoff@pd.infn.it 3.00 pamela@ts.infn.it 5.81 rebuzzin@pv.infn.it 18.18 giunti@to.infn.it 19.72 ------------------------------ --------- Number of users shown: 8
Who Uses Condor? • The user priority is computed by Condor to estimate how much of the pool’s CPU resources have been used by each submitter. • Lighter users receive a lower priority: they will be allocated CPUs before heavy users. • Users consuming the same amount of CPU will be allocated an equal amount.
Measuring Goodput • Goodput is the amount of time a workstation spends making forward progress on work assigned by Condor. • This is a big topic all by itself: http://www.cs.wisc.edu/condor/goodput
Measuring Goodput % condor_q –goodput -- Submitter: coral.cs.wisc.edu : <128.105.175.116:45697> : coral.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME GOODPUT CPU_UTIL Mb/s 719.74 thain 6/23 07:35 2+20:47:59 100.0% 87.6% 0.00 719.75 thain 6/23 07:35 2+20:38:45 40.5% 99.8% 0.00 719.76 thain 6/23 07:35 2+20:38:16 96.9% 98.7% 0.00 719.77 thain 6/23 07:35 2+21:10:06 100.0% 99.8% 0.00
Setting Requirements • We believe that Condor must allow both users (jobs) and owners (machines) to set requirements. • This is an absolute necessity in order to convince people to participate in the community.
ClassAds • ClassAds are a simple language for describing both the properties and the requirements of jobs and machines. • Condor stores nearly everything in ClassAds -- use the –l option to condor_q and condor_submit to get the full details.
ClassAd for a Machine • condor_status –l axpbo8 MyType = "Machine" TargetType = "Job" Name = "axpbo8.bo.infn.it" START = TRUE VirtualMemory = 342696 Disk = 28728536 Memory = 160 Cpus = 1 Arch = "ALPHA" OpSys = "OSF1“
ClassAd for a Job • condor_q –l 9.49 MyType = "Job" TargetType = "Machine" Owner = "thain" Cmd = "/tmp_mnt/usr/users/ccl/thain/test/fib" Out = “fib.out.49” Args = “49” ImageSize = 2544 DiskUsage = 2544 Requirements = (Arch == "ALPHA") && (OpSys == "OSF1") && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)
Default Requirements • By default, Condor assumes the requirements for your job are: “I need a machine with…” • The same operating system and architecture as my workstation. • Enough disk to store the program. • Enough virtual memory to run the program.
Default Requirements • Expressed in ClassAds as: Requirements = (Arch ==“ALPHA”) && (OpSys==“OSF1”) && (Disk >= DiskUsage) && (VirtualMemory >= ImageSize)
ClassAd Requirements • Similar to C/C++/Java expressions: • Symbols: Arch, OpSys, Memory, Mips • Values: 15, 6.5, “LINUX” • Operators: • ==, <, >, <=, >= • &&, || • ( )