1 / 189

Condor Users Tutorial National e-Science Centre Edinburgh, Scotland October 2003

Condor Users Tutorial National e-Science Centre Edinburgh, Scotland October 2003. The Condor Project (Established ‘85). Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. The Condor Project (Established ‘85).

riehle
Download Presentation

Condor Users Tutorial National e-Science Centre Edinburgh, Scotland October 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor Users Tutorial National e-Science CentreEdinburgh, ScotlandOctober 2003

  2. The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

  3. The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: • face software engineering challenges in a distributed UNIX/Linux/NT environment • are involved in national and international grid collaborations, • actively interact with academic and commercial users, • maintain and support large distributed production environments, • and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …

  4. A Multifaceted Project • Harnessing the power of clusters - opportunistic and/or dedicated (Condor) • Job management services for Grid applications (Condor-G, Stork) • Fabric management services for Grid resources (Condor, GlideIns, NeST) • Distributed I/O technology (Parrot, Kangaroo, NeST) • Job-flow management (DAGMan, Condor, Hawk) • Distributed monitoring and management (HawkEye) • Technology for Distributed Systems (ClassAD, MW) • Packaging and Integration (NMI, VDT)

  5. Some software produced by the Condor Project • MW • NeST • Stork • Parrot • Condor-G • And others… all as open source • Condor System • ClassAd Library • DAGMan • Fault Tolerant Shell (FTSH) • Hawkeye

  6. Fault Tolerant Shell (FTSH) • The Grid is a hard environment. • FTSH • The ease of scripting with very precise error semantics. • Exception-like structure allows scripts to be both succinct and safe. • A focus on timed repetition simplifies the most common form of recovery in a distributed system. • A carefully-vetted set of language features limits the "surprises" that haunt system programmers.

  7. Simple Bourne script… #!/bin/sh cd /work/foo rm –rf data cp -r /fresh/data . What if ‘/work/foo’ is unavailable??

  8. Getting Grid Ready… #!/bin/sh for attempt in 1 2 3 cd /work/foo if [ ! $? ] then echo "cd failed, trying again..." sleep 5 else break fi done if [ ! $? ] then echo "couldn't cd, giving up..." return 1 fi

  9. Or with FTSH #!/usr/bin/ftsh try 5 times cd /work/foo rm -rf bar cp -r /fresh/data . end

  10. Or with FTSH #!/usr/bin/ftsh try for 3 days or 100 times cd /work/foo rm -rf bar cp -r /fresh/data . end

  11. Or with FTSH #!/usr/bin/ftsh try for 3 days every 1 hour cd /work/foo rm -rf bar cp -r /fresh/data . end

  12. Another quick example… hosts="mirror1.wisc.edu mirror2.wisc.edu mirror3.wisc.edu" forany h in ${hosts} echo "Attempting host ${h}" wget http://${h}/some-file end echo "Got file from ${h}"

  13. FTSH • All the usual constructs • Redirection, loops, conditionals, functions, expressions, nesting, … • And more • Logging • Timeouts • Process Cancellation • Complete parsing at startup • File cleanup • Used on Linux, Solaris, Irix, Cygwin, … • Simplify your life!

  14. More Software… • HawkEye • A monitoring tool • MW • Framework to create a master-worker style application in a opportunistic environment • NeST • Flexible Network Storage appliance • “Lots” : reserved space • Stork • A scheduler for grid data placement activities • Treat data movement as a “first class citizen”

  15. More Software, cont. • Parrot • Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!). • Works with any program % gv /gsiftp/www.cs.wisc.edu/condor/doc/usenix_1.92.ps % grep Yahoo /http/www.yahoo.com

  16. What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughputcomputing (HTC) facility. • Condor manages both resources (machines) and resource requests (jobs) • Condor has several unique mechanisms such as : • ClassAd Matchmaking • Process checkpoint/ restart / migration • Remote System Calls • Grid Awareness

  17. Condor can manage a large number of jobs • Managing a large number of jobs • You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress • Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. • Condor can handle inter-job dependencies (DAGMan) • Condor users can set job priorities • Condor administrators can set user priorities

  18. Condor can manage Dedicated Resources… • Dedicated Resources • Compute Clusters • Manage • Node monitoring, scheduling • Job launch, monitor & cleanup

  19. …and Condor can manage non-dedicated resources • Non-dedicated resources examples: • Desktop workstations in offices • Workstations in student labs • Non-dedicated resources are often idle --- ~70% of the time! • Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources

  20. Mechanisms in Condor used to harness non-dedicated workstations • Transparent Process Checkpoint / Restart • Transparent Process Migration • Transparent Redirection of I/O (Condor’s Remote System Calls)

  21. What else is Condor Good For? • Robustness • Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion • If an execute machine crashes, you only lose work done since the last checkpoint • Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover

  22. What else is Condor Good For? (cont’d) • Giving you access to more computing resources • Dedicated compute cluster workstations • Non-dedicated workstations • Resources at other institutions • Remote Condor Pools via Condor Flocking • Remote resources via Globus Grid protocols

  23. What is ClassAd Matchmaking? • Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners. • Users (jobs) have constraints: • “I need an Alpha with 256 MB RAM” • Owners (machines) have constraints: • “Only run jobs when I am away from my desk and never run jobs owned by Bob.” • Semi-structured data --- no fixed schema

  24. Some HTC Challenges • Condor does whatever it takes to run your jobs, even if some machines… • Crash (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & managed by someone else

  25. The Condor System • Unix and NT • Operational since 1986 • More than 400 pools installed, managing more than 17000 CPUs worldwide. • More than 1800 CPUs in 10 pools on our campus • Software available free on the web • Open license • Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, CORE… )

  26. Globus Toolkit • The Globus Toolkit is an open source implementation of Grid-related protocols & middleware services designed by the Globus Project and collaborators • Remote job execution, security infrastructure, directory services, data transfer, …

  27. The Condor Project and the Grid … • Close collaboration and coordination with the Globus Project – joint development, adoption of common protocols, technology exchange, … • Partner in major national Grid R&D2 (Research, Development and Deployment) efforts (GriPhyN, iVDGL, IPG, TeraGrid) • Close collaboration with Grid projects in Europe (EDG, GridLab, e-Science)

  28. Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B

  29. Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B

  30. Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B

  31. Remote Resource Access: Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B

  32. Condor-G A Grid-enabled version of Condor that provides robust job management for Globus clients. • Robust replacement for globusrun • Provides extensive fault-tolerance • Can provide scheduling across multiple Globus sites • Brings Condor’s job management features to Globus jobs

  33. Remote Resource Access: Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B

  34. User/Application Grid Fabric (processing, storage, communication)

  35. Condor Globus Toolkit Condor User/Application Grid Fabric (processing, storage, communication)

  36. Condor-G Globus Toolkit Condor Pool User/Application Grid Fabric (processing, storage, communication)

  37. The Idea Computing power is everywhere,we try to make it usable by anyone.

  38. Meet Frieda. She is a scientist. But she has a big problem.

  39. Frieda’s Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

  40. I have 600simulations to run.Where can I get help?

  41. Install a Personal Condor!

  42. Installing Condor • Download Condor for your operating system • Available as a free download from http://www.cs.wisc.edu/condor • Stable –vs- Developer Releases • Naming scheme similar to the Linux Kernel… • Available for most Unix platforms and Windows NT

  43. So Frieda Installs Personal Condor on her machine… • What do we mean by a “Personal” Condor? • Condor on your own workstation, no root access required, no system administrator intervention needed • So after installation, Frieda submits her jobs to her Personal Condor…

  44. personal Condor your workstation 600 Condor jobs

  45. Personal Condor?!What’s the benefit of a Condor “Pool” with just one user and one machine?

  46. Your Personal Condor will ... • … keep an eye on your jobs and will keep you posted on their progress • … implement your policy on the execution order of the jobs • … keep a log of your job activities • … add fault tolerance to your jobs • … implement your policy on when the jobs can run on your workstation

  47. Getting Started: Submitting Jobs to Condor • Choosing a “Universe” for your job • Just use VANILLA for now • Make your job “batch-ready” • Creating a submit description file • Run condor_submiton your submit description file

  48. Making your job ready • Must be able to run in the background: no interactive input, windows, GUI, etc. • Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices • Organize data files

  49. Creating a Submit Description File • A plain ASCII text file • Tells Condor about your job: • Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later) • Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.

More Related