1 / 35

High-Throughput Computing With Condor

High-Throughput Computing With Condor. Who Are We?. The Condor Project (Established ‘85). Distributed systems CS research performed by a team that faces: software engineering challenges in a Unix/Linux/NT environment, active interaction with users and collaborators,

odessa
Download Presentation

High-Throughput Computing With Condor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Throughput Computing With Condor

  2. Who Are We?

  3. The Condor Project (Established ‘85) Distributed systems CS research performed by a team that faces: • software engineering challenges in a Unix/Linux/NT environment, • active interaction with users and collaborators, • daily maintenance and support challenges of a distributed production environment, • and educating and training students. Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School .

  4. The Condor System

  5. The Condor System • Unix and NT • Operational since 1986 • More than 1300 CPUs at UW-Madison • Available on the web • More than 150 clusters worldwide in academia and industry

  6. What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a high-throughput computing facility. • Condor uses matchmaking to make sure that everyone is happy.

  7. What is High-Throughput Computing? • High-performance: CPU cycles/second under ideal circumstances. • “How fast can I run simulation X on this machine?” • High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. • “How many times can I run simulation X in the next month using all available machines?”

  8. What is High-Throughput Computing? • Condor does whatever it takes to run your jobs, even if some machines… • Crash! (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & admin’ed by someone else

  9. What is Matchmaking? • Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

  10. “What can Condordo for me?” Condor can… • …do your housekeeping. • …improve reliability. • …give performance feedback. • …increase your throughput!

  11. Some Numbers: UW-CS Pool 6/98-6/004,000,000 hours ~450 years “Real” Users 1,700,000 hours ~260 years CS-Optimization 610,000 hours CS-Architecture 350,000 hours Physics 245,000 hours Statistics 80,000 hours Engine Research Center 38,000 hours Math 90,000 hours Civil Engineering 27,000 hours Business 970 hours “External” Users 165,000 hours ~19 years MIT 76,000 hours Cornell 38,000 hours UCSD 38,000 hours CalTech 18,000 hours

  12. Condor & Physics

  13. Current CMS Activity • Simulation (CMSIM) for CalTech • provided >135,000 CPU hours to date • peak day ~ 4000 CPU hours • via NCSA Alliance, Condor has allocated 1,000,000 hours total to CalTech • Simulation and Reconstruction (CMSIM + ORCA) for HEP group at UW-Madison

  14. INFN Condor Pool - Italy • Italian National Institute for Research in Nuclear and Subnuclear Physics • 19 locations, each running a Condor pool • as few as 1 CPU -- to >100 CPUs • each locally controlled • each “flocks” jobs to other pools when available

  15. Particle Physics Data Grid • The PPDG Project is... • a software engineering effort to design, implement, experiment, evaluate, and prototype HEP-specific data-transfer and caching software tools for Grid environments • For example...

  16. Condor PPDG Work • Condor Data Manager • technology to automate & coordinatedata movement from a variety of long-term repositories to available Condor computing resources & back again • keeping the pipeline full! • SRB (SDSC), SAM (Fermi), PPDG HRM

  17. PPDG Collaborators

  18. National Grid Efforts • GriPhyN (Grid Physics Network) • National Technology Grid - NCSA Alliance (NSF-PACI) • Information Power Grid - IPG (NASA) • close collaboration with the Globus project

  19. I have 600simulations to run.How can Condorhelp me?

  20. My Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

  21. Step I - get organized! • Write a script that creates 600 input files for each of the (x,y,z) combinations • Write a script that will collect the data from the 600 output files • Turn your workstation into a “Personal Condor” • Submit a cluster of 600 jobs to your personal Condor • Go on a long vacation … (2.5 months)

  22. personal Condor your workstation 600 Condor jobs

  23. Step II - build your personal Grid • Install Condor on the desktop machine next door • …and on the machines in the classroom. • Install Condor on the department’s Linux cluster or the O2K in the basement. • Configure these machines to be part of your Condor pool. • Go on a shorter vacation ...

  24. personal Condor Group Condor your workstation 600 Condor jobs

  25. Step III - take advantage of your friends • Get permission from “friendly” Condor pools to access their resources • Configure your personal Condor to “flock” to these pools • reconsider your vacation plans ...

  26. personal Condor Group Condor your workstation 600 Condor jobs friendly Condor

  27. Think BIG. Go to the Grid.

  28. Upgrade to Condor-G A Grid-enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal Condor • Easy to use on different platforms • Robust • Supports SMPs & dedicated schedulers

  29. Step IV - Go for the Grid • Get access (account(s) + certificate(s)) to a “Computational” Grid • Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor • Take the rest of the afternoon off ...

  30. personal Condor Globus Grid Group Condor your workstation 600 Condor jobs LSF PBS 599 glide-ins friendly Condor Condor

  31. What Have We Done with the Grid Already? • NUG30 • quadratic assignment problem • 30 facilities, 30 locations • minimize cost of transferring materials between them • posed in 1968 as challenge, long unsolved • but with a good pruning algorithm & high-throughput computing...

  32. NUG30 Personal Condor Grid For the run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 here at Argonne.

  33. NUG30 - Solved!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

  34. Conclusion Computing power is everywhere,we try to make it usable by anyone.

  35. Need more info? • Condor Web Page(http://www.cs.wisc.edu/condor) • Peter Couvares (pfc@cs.wisc.edu)

More Related