110 likes | 311 Views
Process Management Working Group Process Management “Meatball”. Dallas November 28, 2001. Subcomponents. Process/Job management What it includes and what it doesn’t include Current status of interface definition Demo Monitoring Examples Relationship to process management Checkpointing
E N D
Process Management Working GroupProcess Management “Meatball” Dallas November 28, 2001
Subcomponents • Process/Job management • What it includes and what it doesn’t include • Current status of interface definition • Demo • Monitoring • Examples • Relationship to process management • Checkpointing • Is this a component? • Relationship to process management
One Meatball Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Process Queue Manager Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment
Process Manager Responsibilities • Starts processes (and therefore knows hosts and pids) • Delivers arguments, environment, limits • (between fork and exec) • Starts other processes that need to know pids • Monitoring (e.g. Paradyn) • Debugging (e.g. TotalView) • Other (e.g. Myrinet monitor) • Kills jobs • Signals processes • May be part of checkpointing • Report on job start/termination • Provides return codes (job/process) • Handles stdio as directed • Service application runtime layer • Implements PMI (put/get/barrier/spawn, others as discovered)
P.M. Non-Responsibilities • Policy • Real-Time resource usage monitoring
Process Manager Component Interface to Other Components • Defined (I.e. proposed XML schema exists) • Start-job • Start-job response • Kill-job • Kill-job response • To do • Suspend-job, resume-job • Signal-job in general • Asynchronous notifications • Job started • Job terminated • Others
The Process Manager Interface to Application Libraries • A Prototype: PMI (formerly known as BNR) • Used by application libraries (e.g. MPI implementations, UPC implementations, common runtime systems for multiple languages and libraries) • Provided by process managers • Simple and general • Find out rank and size • Put and get into keyval space • Barrier • Spawn • Currently used by MPICH, provided by MPD
The Chiba City Testbed • Dedicated to scalability research in computer science rather than to applications • Currently 256 dual-processor nodes • Designed to promote experimentation with system software • SciDAC projects can get accounts: • Web form at http://www-accounts.mcs.anl.gov • Specify SCIDAC as Project Group • Specify closest Argonne SciDAC person as contact (Rusty or Narayan for SSS) • Future plans • 1000 nodes, 8000 virtual nodes • Vmware • User-mode Linux
A Demo • Start Service Directory component • Start Process Manager component • It registers itself with Service Directory • Start Proto-scheduler component • It queries Service Directory for access location (host,port) of process manager • It sends job-start requests from hard-coded queue to process manager • Process manager runs parallel jobs • All components communicate using XML • Use XML schema for process-manager requests, responses • Prototypes written in Python with built-in XML parser
A Modest Proposal • Multiple Wire Protocols are allowed. • Components declare a WP associated with a port when they register with the service directory. (They can register multiple ports.) • Other components learn the WP associated with a port when they find out the port. • The default protocol is the “basic” protocol. • TCP • A message consists of a complete XML document • After sending, the sender does shutdown on the socket, providing EOF to the receiver to signal the end of the message, but leaving the socket half-open to receive the response. • All components are required to support at least the basic protocol.
Advantages • Something easy to start with • No “framing problem” • No other software required • Does not preclude other protocols, which include security, streaming, etc. • Can be used to bootstrap switches of protocol.