High-Throughput Computing With Condor

High-Throughput Computing With Condor

Who Are We?

The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces: • software engineering challenges in a Unix/Linux/NT environment, • active interaction with users and collaborators, • daily maintenance and support challenges of a distributed production environment, • and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School .

The Condor System

The Condor System • Unix and NT • Operational since 1986 • More than 1300 CPUs at UW-Madison • Available on the web • More than 150 clusters worldwide in academia and industry

What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a high-throughput computing facility. • Condor uses matchmaking to make sure that everyone is happy.

What is High-Throughput Computing? • High-performance: CPU cycles/second under ideal circumstances. • “How fast can I run simulation X on this machine?” • High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. • “How many times can I run simulation X in the next month using all available machines?”

What is High-Throughput Computing? • Condor does whatever it takes to run your jobs, even if some machines… • Crash! (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & admin’ed by someone else

What is Matchmaking? • Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

“What can Condordo for me?” Condor can… • …do your housekeeping. • …improve reliability. • …give performance feedback. • …increase your throughput!

Some Numbers: UW-CS Pool 6/98-6/004,000,000 hours ~450 years “Real” Users 1,700,000 hours ~260 years CS-Optimization 610,000 hours CS-Architecture 350,000 hours Physics 245,000 hours Statistics 80,000 hours Engine Research Center 38,000 hours Math 90,000 hours Civil Engineering 27,000 hours Business 970 hours “External” Users 165,000 hours ~19 years MIT 76,000 hours Cornell 38,000 hours UCSD 38,000 hours CalTech 18,000 hours

Condor & Physics

Current CMS Activity • Simulation (CMSIM) for CalTech • provided >135,000 CPU hours to date • peak day ~ 4000 CPU hours • via NCSA Alliance, Condor has allocated 1,000,000 hours total to CalTech • Simulation and Reconstruction (CMSIM + ORCA) for HEP group at UW-Madison

INFN Condor Pool - Italy • Italian National Institute for Research in Nuclear and Subnuclear Physics • 19 locations, each running a Condor pool • as few as 1 CPU -- to >100 CPUs • each locally controlled • each “flocks” jobs to other pools when available

Particle Physics Data Grid • The PPDG Project is... • a software engineering effort to design, implement, experiment, evaluate, and prototype HEP-specific data-transfer and caching software tools for Grid environments • For example...

Condor PPDG Work • Condor Data Manager • technology to automate & coordinatedata movement from a variety of long-term repositories to available Condor computing resources & back again • keeping the pipeline full! • SRB (SDSC), SAM (Fermi), PPDG HRM

PPDG Collaborators

National Grid Efforts • GriPhyN (Grid Physics Network) • National Technology Grid - NCSA Alliance (NSF-PACI) • Information Power Grid - IPG (NASA) • close collaboration with the Globus project

I have 600simulations to run.How can Condorhelp me?

My Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

Step I - get organized! • Write a script that creates 600 input files for each of the (x,y,z) combinations • Write a script that will collect the data from the 600 output files • Turn your workstation into a “Personal Condor” • Submit a cluster of 600 jobs to your personal Condor • Go on a long vacation … (2.5 months)

personal Condor your workstation 600 Condor jobs

Step II - build your personal Grid • Install Condor on the desktop machine next door • …and on the machines in the classroom. • Install Condor on the department’s Linux cluster or the O2K in the basement. • Configure these machines to be part of your Condor pool. • Go on a shorter vacation ...

personal Condor Group Condor your workstation 600 Condor jobs

Step III - take advantage of your friends • Get permission from “friendly” Condor pools to access their resources • Configure your personal Condor to “flock” to these pools • reconsider your vacation plans ...

personal Condor Group Condor your workstation 600 Condor jobs friendly Condor

Think BIG. Go to the Grid.

Upgrade to Condor-G A Grid-enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal Condor • Easy to use on different platforms • Robust • Supports SMPs & dedicated schedulers

Step IV - Go for the Grid • Get access (account(s) + certificate(s)) to a “Computational” Grid • Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor • Take the rest of the afternoon off ...

personal Condor Globus Grid Group Condor your workstation 600 Condor jobs LSF PBS 599 glide-ins friendly Condor Condor

What Have We Done with the Grid Already? • NUG30 • quadratic assignment problem • 30 facilities, 30 locations • minimize cost of transferring materials between them • posed in 1968 as challenge, long unsolved • but with a good pruning algorithm & high-throughput computing...

NUG30 Personal Condor Grid For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

NUG30 - Solved!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

Conclusion Computing power is everywhere,we try to make it usable by anyone.

Need more info? • Condor Web Page(http://www.cs.wisc.edu/condor) • Peter Couvares (pfc@cs.wisc.edu)

High-Throughput Computing With Condor