1 / 23

Basic Grid Projects – Condor Part II

Learn about Condor Checkpointing process, including overview, details, limitations, restart summary, universes, commands, and DAGMan. Explore Condor File System and Transfer Mechanism. References included.

carpenterg
Download Presentation

Basic Grid Projects – Condor Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages

  2. Checkpointing • Checkpointing is used to vacate job from one idle workstation to another • A Condor checkpoint library linked with the program’s code • Checkpoint library installs signal handler for handling SIGSTP signal. • Checkpoints either stored on local disk of submitting machine or on checkpoint servers • Stores unix process’ states including text, stack, data segments, files, pointers etc. • Condor also provides periodic checkpointing

  3. Checkpointing Overview • When startd daemon detects policy violations, sends a signal to the process • The signal handler in the process is invoked, process state is checkpointed • Checkpoints sent to shadow process which stores it • When a new machine is chosen, the executable and checkpoint is sent to remote machine • When the job is started on the remote machine, it detects that it is a restart; reads the checkpoint; some manipulations done such that process state at the time of checkpoint is restored. • It appears to the user code that the process has just returned from the signal handler

  4. Checkpointing Details (Refer to postscript file) • Preserving and restoring text area (same executable), data area (using sbrk(0)) and stack • Preserving stack state consists of storing and restoring 2 parts – stack context and stack space • Stack context stored by setjmp and restored by longjmp • Stack space replacement is tricky – performed by using a secure data region for stack • Open files • state saved by augmenting open calls • lseek performed during checkpointing to obtain offset information • Signals – sigaction, sigispending

  5. Checkpoint summary • Checkpoint library installs signal handler called checkpoint() • Then calls main() • At the time of checkpoint, SIGSTP signal sent, checkpoint() invoked • checkpoint() • Write open files, signals, stack context to data area • Stores data and stack segments

  6. Restart Summary • restore() • Overwrites data segment with that in checkpoint • Restores file and signal information • Switches to a temporary location in data segment, replaces its stack space • Performs longjmp() pointing to checkpoint() signal handler • Checkpoint routine returns and restores CPU registers

  7. Limitations • Cannot checkpoint fork()/exec() or multi-process • Can checkpoint only on homogeneous systems • Cannot checkpoint communicating multi-processes

  8. Condor Universes • Universe specified during job submission • Types: • Standard • System calls transferred to submit machines • Provides for checkpointing and migration • Relink program with condor_compile • Vanilla • For programs that cannot be relinked • Does not provide for checkpointing and migration – WHY? • For accessing to files, use Condor File Transfer mechanism • Scheduler • For job that should act as metascheduler • Mpi, pvm, java,globus

  9. Condor Commands • condor_compile • Relinks source or object files with condor libraries • Condor library provides checkpointing, migration, remote system calls • condor_submit - Takes as input submit description file and produces a job classAd for further processing by central manager • condor_status – to view about various machines in the Condor pool • condor_q – for viewing job status

  10. DAGMan • Meta scheduler for Condor • Manages dependencies between jobs at a higher level • Sits on top of Condor • Input of one program depends on the other • condor_ submit_dagDAGInputFileName • DAG within a DAG is supported

  11. Example input file for DAGMan # Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor PARENT A CHILD B C PARENT B C CHILD D Retry C 3

  12. Condor File System and File Transfer Mechanism • Applicable for only vanilla jobs • By default a shared file system is assumed between submitting machine and executing machine • Machine classAd attributes – FileSystemDomain and UidDomain • To bypass default: say something like: Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu''

  13. Condor File System and File Transfer Mechanism • If machines do not share file systems or the file systems not explicitly specified, enable Condor File Transfer Mechanism: should_transfer_files = YES when_to_transfer_output = ON_EXIT • Any files that are generated or modified in the remote working directory are transferred back to the submit machine

  14. References / Sources / Credits • Condor manual • Condor web pages • Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997. • James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10) San Francisco, California, August 7-9, 2001. • Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL. • Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988.

  15. Submit description files • Directs queuing of jobs • Contains • Executable location • Command line arguments to job • stdin, stderr, stdout • Initial working directory • should_transfer_files = <YES | NO | IF_NEEDED >. NO disables condor file transfer mechanism • when_to_transfer_output = < ON_EXIT | ON_EXIT_OR_EVICT >

  16. Submit description file • requirements = <ClassAd Boolean Expression> • By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are set • requirements = <ClassAd Boolean Expression> • +<attribute> = <value>

  17. Machine ClassAd Attributes • Activity • Arch • CondorLoadAvg, ConsoleIdle, Disk, Cpus, KeyboardIdle, LoadAvg, KFlops, Mips, Memory, OpSys, • FileSystemDomain, Requirements, StartdIpAddr • ClientMachine, CurrentRank, RemoteOwner, LastPeriodicCheckpoint

  18. Job ClassAd Attributes • CompletionDate, RemoteIwd

  19. Heterogeneous job submission • Works well with the vanilla universe since checkpoint is not taken. • For standard universe, # Added by Condor CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) Requirements = (<user specified policy>) && $(CkptRequirements)

  20. Submission steps • Job preparation • Choosing a universe • Submit description file • condor_submit

  21. Job Migration • SIGSTP and signal handler in standard universe • SIGTERM in vanilla

  22. Condor Security • Schedd starts shadow with the effective UID of job owner • Different methods like Kherberos and GSI for authentication, different encryption mechanisms, authorization are supported between client and daemons • Sockets and ports – condor collector and negotiator start on well known ports. Other daemons start on ephermeral ports.

  23. Checkpointing • CkptArch, CkptOpSys, LastCkptServer, LastCkptTime, NumCkpts classAds generated automatically for job

More Related