1 / 9

Reliability and Troubleshooting with Condor

Reliability and Troubleshooting with Condor. Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002. Condor Reliability. Condor was designed for idle machines: Reclaim, reboot, crash, out of memory... Sounds much like the grid! US-CMS testbed

marilynna
Download Presentation

Reliability and Troubleshooting with Condor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reliability and Troubleshootingwith Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002

  2. Condor Reliability • Condor was designed for idle machines: • Reclaim, reboot, crash, out of memory... • Sounds much like the grid! • US-CMS testbed • Distributed ownership, control, and resources. • (War stories abound.) • Condor tools add controlled reliability. • Not absolute reliability, but: • A finite amount of retry. • A notification/recovery strategy. • Logging and book-keeping. • Known state after a failure.

  3. Private Network Private Network Private Network US-CMS Physical Structure MOP Master Workers Head Node Workers Head Node Public Internet Workers Head Node

  4. US-CMS Logical Structure Master Site Worker Impala Globus MOP Condor DAGMan Real Work Condor-G Red items expect a reliable environment. Green items create a reliable environment.

  5. Run Run Run Run Idle Idle End-User Tools Condor-G (transaction interface) Job Managers Head Node Condor-G Submitter Gatekeeper Job Log System Log Job Queue Local Resource Manager Grid Managers GRAM GAHP-Server

  6. Condor-G deals with system failures, DAGMan deals with app and user failures. PRE and POST may be used to validate inputs and outputs. “Rescue DAG” describes what is left unexecuted. DAG nodes may themselves be DAGs. pre.pl C post.pl Directed Acyclic Graph Manager (DAGMan) A B D

  7. Standard shell scripts are very error-prone. FTSH adds time limits, retry, logging, and clean termination. “Exceptions for scripts:” unexpected errors cannot accidentally be ignored. try 10 times try for 15 minutes globus_url_copy A B end try for 1 hour run-simulation < B > C gzip < C >D end try for 15 minutes globus_url_copy D E end end Fault Tolerant Shell (FTSH)

  8. Hawkeye Hawkeye Manager (Example Hawkeye Page) Policy Manager Trigger Exprs ClassAd Queries ClassAd Data Probe Modules Probe Modules Probe Modules Submit Repair Job Contact Sysadmin Log Event

  9. For More Info... • Condor-G • http://www.cs.wisc.edu/condor/condorg • DAGMan • http://www.cs.wisc.edu/condor/dagman • Fault Tolerant Shell • http://www.cs.wisc.edu/~thain/research/ftsh • Hawkeye • http://www.cs.wisc.edu/condor/hawkeye • Philosophy of Error Management • http://www.cs.wisc.edu/condor/doc/error-scope.pdf • The Condor Project • http://www.cs.wisc.edu/condor

More Related