Condor Project Overview: Distributed High-Throughput Computing

Condor Users Tutorial National e-Science CentreEdinburgh, ScotlandOctober 2003

The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: • face software engineering challenges in a distributed UNIX/Linux/NT environment • are involved in national and international grid collaborations, • actively interact with academic and commercial users, • maintain and support large distributed production environments, • and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …

A Multifaceted Project • Harnessing the power of clusters - opportunistic and/or dedicated (Condor) • Job management services for Grid applications (Condor-G, Stork) • Fabric management services for Grid resources (Condor, GlideIns, NeST) • Distributed I/O technology (Parrot, Kangaroo, NeST) • Job-flow management (DAGMan, Condor, Hawk) • Distributed monitoring and management (HawkEye) • Technology for Distributed Systems (ClassAD, MW) • Packaging and Integration (NMI, VDT)

Some software produced by the Condor Project • MW • NeST • Stork • Parrot • Condor-G • And others… all as open source • Condor System • ClassAd Library • DAGMan • Fault Tolerant Shell (FTSH) • Hawkeye

Fault Tolerant Shell (FTSH) • The Grid is a hard environment. • FTSH • The ease of scripting with very precise error semantics. • Exception-like structure allows scripts to be both succinct and safe. • A focus on timed repetition simplifies the most common form of recovery in a distributed system. • A carefully-vetted set of language features limits the "surprises" that haunt system programmers.

Simple Bourne script… #!/bin/sh cd /work/foo rm –rf data cp -r /fresh/data . What if ‘/work/foo’ is unavailable??

Getting Grid Ready… #!/bin/sh for attempt in 1 2 3 cd /work/foo if [ ! $? ] then echo "cd failed, trying again..." sleep 5 else break fi done if [ ! $? ] then echo "couldn't cd, giving up..." return 1 fi

Or with FTSH #!/usr/bin/ftsh try 5 times cd /work/foo rm -rf bar cp -r /fresh/data . end

Or with FTSH #!/usr/bin/ftsh try for 3 days or 100 times cd /work/foo rm -rf bar cp -r /fresh/data . end

Or with FTSH #!/usr/bin/ftsh try for 3 days every 1 hour cd /work/foo rm -rf bar cp -r /fresh/data . end

Another quick example… hosts="mirror1.wisc.edu mirror2.wisc.edu mirror3.wisc.edu" forany h in ${hosts} echo "Attempting host ${h}" wget http://${h}/some-file end echo "Got file from ${h}"

FTSH • All the usual constructs • Redirection, loops, conditionals, functions, expressions, nesting, … • And more • Logging • Timeouts • Process Cancellation • Complete parsing at startup • File cleanup • Used on Linux, Solaris, Irix, Cygwin, … • Simplify your life!

More Software… • HawkEye • A monitoring tool • MW • Framework to create a master-worker style application in a opportunistic environment • NeST • Flexible Network Storage appliance • “Lots” : reserved space • Stork • A scheduler for grid data placement activities • Treat data movement as a “first class citizen”

More Software, cont. • Parrot • Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!). • Works with any program % gv /gsiftp/www.cs.wisc.edu/condor/doc/usenix_1.92.ps % grep Yahoo /http/www.yahoo.com

What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughputcomputing (HTC) facility. • Condor manages both resources (machines) and resource requests (jobs) • Condor has several unique mechanisms such as : • ClassAd Matchmaking • Process checkpoint/ restart / migration • Remote System Calls • Grid Awareness

Condor can manage a large number of jobs • Managing a large number of jobs • You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress • Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. • Condor can handle inter-job dependencies (DAGMan) • Condor users can set job priorities • Condor administrators can set user priorities

Condor can manage Dedicated Resources… • Dedicated Resources • Compute Clusters • Manage • Node monitoring, scheduling • Job launch, monitor & cleanup

…and Condor can manage non-dedicated resources • Non-dedicated resources examples: • Desktop workstations in offices • Workstations in student labs • Non-dedicated resources are often idle --- ~70% of the time! • Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources

Mechanisms in Condor used to harness non-dedicated workstations • Transparent Process Checkpoint / Restart • Transparent Process Migration • Transparent Redirection of I/O (Condor’s Remote System Calls)

What else is Condor Good For? • Robustness • Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion • If an execute machine crashes, you only lose work done since the last checkpoint • Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover

What else is Condor Good For? (cont’d) • Giving you access to more computing resources • Dedicated compute cluster workstations • Non-dedicated workstations • Resources at other institutions • Remote Condor Pools via Condor Flocking • Remote resources via Globus Grid protocols

What is ClassAd Matchmaking? • Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.” • Semi-structured data --- no fixed schema

Some HTC Challenges • Condor does whatever it takes to run your jobs, even if some machines… • Crash (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & managed by someone else

The Condor System • Unix and NT • Operational since 1986 • More than 400 pools installed, managing more than 17000 CPUs worldwide. • More than 1800 CPUs in 10 pools on our campus • Software available free on the web • Open license • Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, CORE… )

Globus Toolkit • The Globus Toolkit is an open source implementation of Grid-related protocols & middleware services designed by the Globus Project and collaborators • Remote job execution, security infrastructure, directory services, data transfer, …

The Condor Project and the Grid … • Close collaboration and coordination with the Globus Project – joint development, adoption of common protocols, technology exchange, … • Partner in major national Grid R&D2 (Research, Development and Deployment) efforts (GriPhyN, iVDGL, IPG, TeraGrid) • Close collaboration with Grid projects in Europe (EDG, GridLab, e-Science)

Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B

Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B

Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B

Condor-G A Grid-enabled version of Condor that provides robust job management for Globus clients. • Robust replacement for globusrun • Provides extensive fault-tolerance • Can provide scheduling across multiple Globus sites • Brings Condor’s job management features to Globus jobs

Remote Resource Access: Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B

User/Application Grid Fabric (processing, storage, communication)

Condor Globus Toolkit Condor User/Application Grid Fabric (processing, storage, communication)

Condor-G Globus Toolkit Condor Pool User/Application Grid Fabric (processing, storage, communication)

The Idea Computing power is everywhere,we try to make it usable by anyone.

Meet Frieda. She is a scientist. But she has a big problem.

Frieda’s Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

I have 600simulations to run.Where can I get help?

Install a Personal Condor!

Installing Condor • Download Condor for your operating system • Available as a free download from http://www.cs.wisc.edu/condor • Stable –vs- Developer Releases • Naming scheme similar to the Linux Kernel… • Available for most Unix platforms and Windows NT

So Frieda Installs Personal Condor on her machine… • What do we mean by a “Personal” Condor? • Condor on your own workstation, no root access required, no system administrator intervention needed • So after installation, Frieda submits her jobs to her Personal Condor…

personal Condor your workstation 600 Condor jobs

Personal Condor?!What’s the benefit of a Condor “Pool” with just one user and one machine?

Your Personal Condor will ... • … keep an eye on your jobs and will keep you posted on their progress • … implement your policy on the execution order of the jobs • … keep a log of your job activities • … add fault tolerance to your jobs • … implement your policy on when the jobs can run on your workstation

Getting Started: Submitting Jobs to Condor • Choosing a “Universe” for your job • Just use VANILLA for now • Make your job “batch-ready” • Creating a submit description file • Run condor_submiton your submit description file

Making your job ready • Must be able to run in the background: no interactive input, windows, GUI, etc. • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files

Creating a Submit Description File • A plain ASCII text file • Tells Condor about your job: • Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) • Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.

Condor Project Overview: Distributed High-Throughput Computing

Condor Project Overview: Distributed High-Throughput Computing

Presentation Transcript

EDINBURGH, SCOTLAND

CLRC e-Science Centre

Scotland Workshop Edinburgh

National e-Science Centre, Edinburgh 27/11/06

Summer In Edinburgh Scotland

Mapping Edinburgh, Scotland

EMME/2 Users Conference — October 2003

National e-Science Centre Local Developments

WELCOME TO EDINBURGH AND SCOTLAND!

Peter Burnhill Director, EDINA National Data Centre, University of Edinburgh, Scotland UK

The National e-Science Centre

EGEE Dissemination and Training Mike Mineter Training team National e-Science Centre, Edinburgh

Condor Users Tutorial National e-Science Centre Edinburgh, Scotland October 2003

The Condor “RoadMap” Condor Week 2003

Plans for the UKLight Dark Fibre Network UKLIGHT Town Meeting National e-Science Centre, Edinburgh

Hamza Mehammed Senior Trainer National e-Science Centre (Edinburgh) 15.04.2009

Prof. Richard O. Sinnott National e-Science Centre University of Glasgow, Scotland

National e-Science Centre Glasgow e-Science Hub Opening: Remarks NeSC’s Role

Edinburgh, Scotland

Edinburgh – the capital of Scotland

Peter Burnhill Director, EDINA National Data Centre, University of Edinburgh, Scotland UK