1.89k likes | 1.91k Views
Condor Users Tutorial National e-Science Centre Edinburgh, Scotland October 2003. The Condor Project (Established ‘85). Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. The Condor Project (Established ‘85).
E N D
Condor Users Tutorial National e-Science CentreEdinburgh, ScotlandOctober 2003
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: • face software engineering challenges in a distributed UNIX/Linux/NT environment • are involved in national and international grid collaborations, • actively interact with academic and commercial users, • maintain and support large distributed production environments, • and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
A Multifaceted Project • Harnessing the power of clusters - opportunistic and/or dedicated (Condor) • Job management services for Grid applications (Condor-G, Stork) • Fabric management services for Grid resources (Condor, GlideIns, NeST) • Distributed I/O technology (Parrot, Kangaroo, NeST) • Job-flow management (DAGMan, Condor, Hawk) • Distributed monitoring and management (HawkEye) • Technology for Distributed Systems (ClassAD, MW) • Packaging and Integration (NMI, VDT)
Some software produced by the Condor Project • MW • NeST • Stork • Parrot • Condor-G • And others… all as open source • Condor System • ClassAd Library • DAGMan • Fault Tolerant Shell (FTSH) • Hawkeye
Fault Tolerant Shell (FTSH) • The Grid is a hard environment. • FTSH • The ease of scripting with very precise error semantics. • Exception-like structure allows scripts to be both succinct and safe. • A focus on timed repetition simplifies the most common form of recovery in a distributed system. • A carefully-vetted set of language features limits the "surprises" that haunt system programmers.
Simple Bourne script… #!/bin/sh cd /work/foo rm –rf data cp -r /fresh/data . What if ‘/work/foo’ is unavailable??
Getting Grid Ready… #!/bin/sh for attempt in 1 2 3 cd /work/foo if [ ! $? ] then echo "cd failed, trying again..." sleep 5 else break fi done if [ ! $? ] then echo "couldn't cd, giving up..." return 1 fi
Or with FTSH #!/usr/bin/ftsh try 5 times cd /work/foo rm -rf bar cp -r /fresh/data . end
Or with FTSH #!/usr/bin/ftsh try for 3 days or 100 times cd /work/foo rm -rf bar cp -r /fresh/data . end
Or with FTSH #!/usr/bin/ftsh try for 3 days every 1 hour cd /work/foo rm -rf bar cp -r /fresh/data . end
Another quick example… hosts="mirror1.wisc.edu mirror2.wisc.edu mirror3.wisc.edu" forany h in ${hosts} echo "Attempting host ${h}" wget http://${h}/some-file end echo "Got file from ${h}"
FTSH • All the usual constructs • Redirection, loops, conditionals, functions, expressions, nesting, … • And more • Logging • Timeouts • Process Cancellation • Complete parsing at startup • File cleanup • Used on Linux, Solaris, Irix, Cygwin, … • Simplify your life!
More Software… • HawkEye • A monitoring tool • MW • Framework to create a master-worker style application in a opportunistic environment • NeST • Flexible Network Storage appliance • “Lots” : reserved space • Stork • A scheduler for grid data placement activities • Treat data movement as a “first class citizen”
More Software, cont. • Parrot • Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!). • Works with any program % gv /gsiftp/www.cs.wisc.edu/condor/doc/usenix_1.92.ps % grep Yahoo /http/www.yahoo.com
What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughputcomputing (HTC) facility. • Condor manages both resources (machines) and resource requests (jobs) • Condor has several unique mechanisms such as : • ClassAd Matchmaking • Process checkpoint/ restart / migration • Remote System Calls • Grid Awareness
Condor can manage a large number of jobs • Managing a large number of jobs • You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress • Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. • Condor can handle inter-job dependencies (DAGMan) • Condor users can set job priorities • Condor administrators can set user priorities
Condor can manage Dedicated Resources… • Dedicated Resources • Compute Clusters • Manage • Node monitoring, scheduling • Job launch, monitor & cleanup
…and Condor can manage non-dedicated resources • Non-dedicated resources examples: • Desktop workstations in offices • Workstations in student labs • Non-dedicated resources are often idle --- ~70% of the time! • Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources
Mechanisms in Condor used to harness non-dedicated workstations • Transparent Process Checkpoint / Restart • Transparent Process Migration • Transparent Redirection of I/O (Condor’s Remote System Calls)
What else is Condor Good For? • Robustness • Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion • If an execute machine crashes, you only lose work done since the last checkpoint • Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover
What else is Condor Good For? (cont’d) • Giving you access to more computing resources • Dedicated compute cluster workstations • Non-dedicated workstations • Resources at other institutions • Remote Condor Pools via Condor Flocking • Remote resources via Globus Grid protocols
What is ClassAd Matchmaking? • Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.” • Semi-structured data --- no fixed schema
Some HTC Challenges • Condor does whatever it takes to run your jobs, even if some machines… • Crash (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & managed by someone else
The Condor System • Unix and NT • Operational since 1986 • More than 400 pools installed, managing more than 17000 CPUs worldwide. • More than 1800 CPUs in 10 pools on our campus • Software available free on the web • Open license • Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, CORE… )
Globus Toolkit • The Globus Toolkit is an open source implementation of Grid-related protocols & middleware services designed by the Globus Project and collaborators • Remote job execution, security infrastructure, directory services, data transfer, …
The Condor Project and the Grid … • Close collaboration and coordination with the Globus Project – joint development, adoption of common protocols, technology exchange, … • Partner in major national Grid R&D2 (Research, Development and Deployment) efforts (GriPhyN, iVDGL, IPG, TeraGrid) • Close collaboration with Grid projects in Europe (EDG, GridLab, e-Science)
Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B
Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B
Condor-G A Grid-enabled version of Condor that provides robust job management for Globus clients. • Robust replacement for globusrun • Provides extensive fault-tolerance • Can provide scheduling across multiple Globus sites • Brings Condor’s job management features to Globus jobs
Remote Resource Access: Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B
User/Application Grid Fabric (processing, storage, communication)
Condor Globus Toolkit Condor User/Application Grid Fabric (processing, storage, communication)
Condor-G Globus Toolkit Condor Pool User/Application Grid Fabric (processing, storage, communication)
The Idea Computing power is everywhere,we try to make it usable by anyone.
Meet Frieda. She is a scientist. But she has a big problem.
Frieda’s Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB
Install a Personal Condor!
Installing Condor • Download Condor for your operating system • Available as a free download from http://www.cs.wisc.edu/condor • Stable –vs- Developer Releases • Naming scheme similar to the Linux Kernel… • Available for most Unix platforms and Windows NT
So Frieda Installs Personal Condor on her machine… • What do we mean by a “Personal” Condor? • Condor on your own workstation, no root access required, no system administrator intervention needed • So after installation, Frieda submits her jobs to her Personal Condor…
personal Condor your workstation 600 Condor jobs
Personal Condor?!What’s the benefit of a Condor “Pool” with just one user and one machine?
Your Personal Condor will ... • … keep an eye on your jobs and will keep you posted on their progress • … implement your policy on the execution order of the jobs • … keep a log of your job activities • … add fault tolerance to your jobs • … implement your policy on when the jobs can run on your workstation
Getting Started: Submitting Jobs to Condor • Choosing a “Universe” for your job • Just use VANILLA for now • Make your job “batch-ready” • Creating a submit description file • Run condor_submiton your submit description file
Making your job ready • Must be able to run in the background: no interactive input, windows, GUI, etc. • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files
Creating a Submit Description File • A plain ASCII text file • Tells Condor about your job: • Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) • Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.