1.89k likes | 1.91k Views
Explore the innovative Condor Project established in 1985, delving into its research, software, and funding while learning about its distributed computing advancements and unique mechanisms like ClassAd Matchmaking and fault-tolerant features. Discover how Condor converts clusters into high-throughput computing facilities, managing both resources and job requests efficiently.
E N D
Condor Users Tutorial National e-Science CentreEdinburgh, ScotlandOctober 2003
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: • face software engineering challenges in a distributed UNIX/Linux/NT environment • are involved in national and international grid collaborations, • actively interact with academic and commercial users, • maintain and support large distributed production environments, • and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
A Multifaceted Project • Harnessing the power of clusters - opportunistic and/or dedicated (Condor) • Job management services for Grid applications (Condor-G, Stork) • Fabric management services for Grid resources (Condor, GlideIns, NeST) • Distributed I/O technology (Parrot, Kangaroo, NeST) • Job-flow management (DAGMan, Condor, Hawk) • Distributed monitoring and management (HawkEye) • Technology for Distributed Systems (ClassAD, MW) • Packaging and Integration (NMI, VDT)
Some software produced by the Condor Project • MW • NeST • Stork • Parrot • Condor-G • And others… all as open source • Condor System • ClassAd Library • DAGMan • Fault Tolerant Shell (FTSH) • Hawkeye
Fault Tolerant Shell (FTSH) • The Grid is a hard environment. • FTSH • The ease of scripting with very precise error semantics. • Exception-like structure allows scripts to be both succinct and safe. • A focus on timed repetition simplifies the most common form of recovery in a distributed system. • A carefully-vetted set of language features limits the "surprises" that haunt system programmers.
Simple Bourne script… #!/bin/sh cd /work/foo rm –rf data cp -r /fresh/data . What if ‘/work/foo’ is unavailable??
Getting Grid Ready… #!/bin/sh for attempt in 1 2 3 cd /work/foo if [ ! $? ] then echo "cd failed, trying again..." sleep 5 else break fi done if [ ! $? ] then echo "couldn't cd, giving up..." return 1 fi
Or with FTSH #!/usr/bin/ftsh try 5 times cd /work/foo rm -rf bar cp -r /fresh/data . end
Or with FTSH #!/usr/bin/ftsh try for 3 days or 100 times cd /work/foo rm -rf bar cp -r /fresh/data . end
Or with FTSH #!/usr/bin/ftsh try for 3 days every 1 hour cd /work/foo rm -rf bar cp -r /fresh/data . end
Another quick example… hosts="mirror1.wisc.edu mirror2.wisc.edu mirror3.wisc.edu" forany h in ${hosts} echo "Attempting host ${h}" wget http://${h}/some-file end echo "Got file from ${h}"
FTSH • All the usual constructs • Redirection, loops, conditionals, functions, expressions, nesting, … • And more • Logging • Timeouts • Process Cancellation • Complete parsing at startup • File cleanup • Used on Linux, Solaris, Irix, Cygwin, … • Simplify your life!
More Software… • HawkEye • A monitoring tool • MW • Framework to create a master-worker style application in a opportunistic environment • NeST • Flexible Network Storage appliance • “Lots” : reserved space • Stork • A scheduler for grid data placement activities • Treat data movement as a “first class citizen”
More Software, cont. • Parrot • Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!). • Works with any program % gv /gsiftp/www.cs.wisc.edu/condor/doc/usenix_1.92.ps % grep Yahoo /http/www.yahoo.com
What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughputcomputing (HTC) facility. • Condor manages both resources (machines) and resource requests (jobs) • Condor has several unique mechanisms such as : • ClassAd Matchmaking • Process checkpoint/ restart / migration • Remote System Calls • Grid Awareness
Condor can manage a large number of jobs • Managing a large number of jobs • You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress • Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. • Condor can handle inter-job dependencies (DAGMan) • Condor users can set job priorities • Condor administrators can set user priorities
Condor can manage Dedicated Resources… • Dedicated Resources • Compute Clusters • Manage • Node monitoring, scheduling • Job launch, monitor & cleanup
…and Condor can manage non-dedicated resources • Non-dedicated resources examples: • Desktop workstations in offices • Workstations in student labs • Non-dedicated resources are often idle --- ~70% of the time! • Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources
Mechanisms in Condor used to harness non-dedicated workstations • Transparent Process Checkpoint / Restart • Transparent Process Migration • Transparent Redirection of I/O (Condor’s Remote System Calls)
What else is Condor Good For? • Robustness • Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion • If an execute machine crashes, you only lose work done since the last checkpoint • Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover
What else is Condor Good For? (cont’d) • Giving you access to more computing resources • Dedicated compute cluster workstations • Non-dedicated workstations • Resources at other institutions • Remote Condor Pools via Condor Flocking • Remote resources via Globus Grid protocols
What is ClassAd Matchmaking? • Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.” • Semi-structured data --- no fixed schema
Some HTC Challenges • Condor does whatever it takes to run your jobs, even if some machines… • Crash (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & managed by someone else
The Condor System • Unix and NT • Operational since 1986 • More than 400 pools installed, managing more than 17000 CPUs worldwide. • More than 1800 CPUs in 10 pools on our campus • Software available free on the web • Open license • Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, CORE… )
Globus Toolkit • The Globus Toolkit is an open source implementation of Grid-related protocols & middleware services designed by the Globus Project and collaborators • Remote job execution, security infrastructure, directory services, data transfer, …
The Condor Project and the Grid … • Close collaboration and coordination with the Globus Project – joint development, adoption of common protocols, technology exchange, … • Partner in major national Grid R&D2 (Research, Development and Deployment) efforts (GriPhyN, iVDGL, IPG, TeraGrid) • Close collaboration with Grid projects in Europe (EDG, GridLab, e-Science)
Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B
Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B
Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B
Condor-G A Grid-enabled version of Condor that provides robust job management for Globus clients. • Robust replacement for globusrun • Provides extensive fault-tolerance • Can provide scheduling across multiple Globus sites • Brings Condor’s job management features to Globus jobs
Remote Resource Access: Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B
User/Application Grid Fabric (processing, storage, communication)
Condor Globus Toolkit Condor User/Application Grid Fabric (processing, storage, communication)
Condor-G Globus Toolkit Condor Pool User/Application Grid Fabric (processing, storage, communication)
The Idea Computing power is everywhere,we try to make it usable by anyone.
Meet Frieda. She is a scientist. But she has a big problem.
Frieda’s Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB
Install a Personal Condor!
Installing Condor • Download Condor for your operating system • Available as a free download from http://www.cs.wisc.edu/condor • Stable –vs- Developer Releases • Naming scheme similar to the Linux Kernel… • Available for most Unix platforms and Windows NT
So Frieda Installs Personal Condor on her machine… • What do we mean by a “Personal” Condor? • Condor on your own workstation, no root access required, no system administrator intervention needed • So after installation, Frieda submits her jobs to her Personal Condor…
personal Condor your workstation 600 Condor jobs
Personal Condor?!What’s the benefit of a Condor “Pool” with just one user and one machine?
Your Personal Condor will ... • … keep an eye on your jobs and will keep you posted on their progress • … implement your policy on the execution order of the jobs • … keep a log of your job activities • … add fault tolerance to your jobs • … implement your policy on when the jobs can run on your workstation
Getting Started: Submitting Jobs to Condor • Choosing a “Universe” for your job • Just use VANILLA for now • Make your job “batch-ready” • Creating a submit description file • Run condor_submiton your submit description file
Making your job ready • Must be able to run in the background: no interactive input, windows, GUI, etc. • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files
Creating a Submit Description File • A plain ASCII text file • Tells Condor about your job: • Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) • Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.