High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium

  2. Topics • What is Condor? • What is High Throughput Computing? • Why Condor? Why not Condor? • Condor at Purdue • Submitting and managing jobs • Suitable jobs

  3. What is Condor? • A product of the University of Wisconsin-Madison • A job scheduler • A resource manager • A workflow management system • Focused on High Throughput Computing

  4. What is High Throughput Computing (HTC)? • Large amounts of processing • Long period of time

  5. HTC v. HPC • FLOPS extracted v. FLOPS • Distributed Ownership v. Central Ownership • Capturing Idle Cycles v. Losing Idle Cycles • Throughput v. Response Time • Distributed Memory v. Tightly-coupled Memory • 1,000 Jobs v. 1 Job

  6. Why Condor? • Wasted compute cycles • Scheduling of related jobs • Access to more cores

  7. Advantages of Condor • Many tasks running at once • Access to more powerful computers • Using wasted cycles • Minimal impact on remote computers • Security • Little or no code modification

  8. Disadvantages of Condor • Compete for access • Task may take longer to complete • Processing can be lost • Parallel jobs aren’t available • Large files can impact the remote computer • Heterogeneity of the remote computers • Few compatible compilers

  9. Condor at Purdue • Installed on large cyberinfrastructure clusters • Installed in distributed desktops • Used as a scavenger of free cycles • Parallel jobs not supported • ~27K Linux cores and 1K Windows cores • Several more kilocores at DiaGrid partner sites

  10. Condor at Purdue • Jobs are vacated when a PBS job starts • Long running jobs may never complete • Common home directory across clusters • Scratch directories roughly per-cluster • ~7 TB of checkpoint storage for standard universe jobs

  11. Job Universes • Vanilla universe • Doesn't require a recompile • No native checkpoint mechanism • Standard universe • Streams I/O (can overload the submit node) • Supports checkpointing • No fork(), shared memory, pipes

  12. File transfer • A vanilla universe feature • Allows jobs to flow to other sites

  13. Compiling for Condor • A standard universe requirement • The condor_compile command wraps a limited compiler set. • Links against Condor libraries to add support for I/O streaming and checkpointing

  14. Checkpointing • Saves all state information • Transfers state information to Condor management • Deletes job from processor • Restarts interrupted job on another unused processor

  15. Job lifecycle • Job is submitted • Scheduler process contacts negotiator process • Negotiator matches job to an available slot • If no slots are available, scheduler contacts remote negotiator • Execute node runs job • If job gets evicted, scheduler process contacts negotiator process again

  16. Submitting a job • Create a submit file: # Simple Condor job file Executable = bin/simpletest Arguments = 600 Universe = standard Log = log/$(Cluster).$(Process).log Error = log/$(Cluster).$(Process).err Output = log/$(Cluster).$(Process).out +TGProject = TG-STA060013N Queue 10

  17. Submitting a job • With file transfer: # Simple Condor job file Executable = bin/process_files.sh Universe = vanilla ShouldTransferFiles = if_needed Transfer_input_files = input.dat Transfer_output_files = output.png Log = log/$(Cluster).$(Process).log +TGProject = TG-STA060013N Queue

  18. Submitting a job • Job submitted with the condor_submit command: condor_submit myjobfile.condor

  19. Managing jobs • Get all jobs in queue: condor_q • Get only user's jobs: condor_q user • Why isn't my job running? • condor_q -better-analyze jobid • Remove a job: condor_rm jobid

  20. Getting the most cores: Requirements = ... • Condor tries to be helpful by inserting automatic job requirements • OpSys • Arch • FileSystemDomain • Memory >= ImageSize • This sometimes over-constrains jobs

  21. Getting the most cores: Requirements = ... • The Requirements attribute gives you the flexibility to add or remove execute nodes • Example: job files are in your home directory • Requirements = regexp(“rcac.purdue.edu”,FilesystemDomain) • Example: job executable is a Windows binary • Requirements = (OpSys==“WINNT61”)

  22. A special note about Memory • Condor sometimes overestimates the memory usage of a job • Condor reports totalmemory/cores, but jobs are not memory constrained • It’s best to put a dummy memory requirement in the submission file

  23. Getting the most out of your cores: Rank = ... • You can prefer a job land on particular nodes • Example: prefer 64-bit nodes with lots of memory • Rank = (ARCH==“X86_64”)*1000 + Memory

  24. Workflow management with DAGman • Directed Acyclic Graph Manager • Defines parent-child relationships among jobs • Allows pre- and post-execution hooks • Submit with condor_submit_dag

  25. Diamond DAG A B1 B2 C

  26. Diamond DAG # Diamond-shaped DAG Job First p_00060.A.sub Job Second_1 p_00060.B1.sub Job Second_2 p_00060.B2.sub Job Third p_00060.C.sub PARENT First CHILD Second_1 Second_2 PARENT Second_1 Second_2 CHILD Third

  27. More complex DAGs

  28. Who Benefits from Condor? • Monte Carlo simulations • Parameter sweeps • “Embarrassingly parallel” jobs

  29. Purdue’s Condor Users • Structural Biology • Education • Chemical Engineering • Bioinformatics • Climate Visualization • Distributed Rendering • High Energy Physics

  30. For more information • University of Wisconsin website: • http://research.cs.wisc.edu/condor • Email: • bcotton@purdue.edu • rcac-help@purdue.edu

