110 likes | 262 Views
Process Management & Monitoring WG. Quarterly Report January 15, 2004. Components. Process Management Process Manager Checkpoint Manager Monitoring Job Monitor System/Node Monitors Meta Monitoring. Group Progress. SSS-OSCAR release Still continue development on all three components
E N D
Process Management & Monitoring WG Quarterly Report January 15, 2004
Components • Process Management • Process Manager • Checkpoint Manager • Monitoring • Job Monitor • System/Node Monitors • Meta Monitoring
Group Progress • SSS-OSCAR release • Still continue development on all three components • Largely proceeding on points requiring little interaction • Real deployment/testing of components • Chiba city, xtorc, others?
Component Progress • Checkpoint Manager (LBNL) • Monitoring (NCSA) • Process Manager (ANL)
Checkpoint ManagerWork at LBNL • Basic design • Pre-emption • cid = Suspend(pgid) • Resume(cid) • Checkpointing • cid = Checkpoint(pgid) • Restart(cid, where)
Checkpoint ManagerWork at LBNL • Basic design (continued) • Migration • Migrate(pgid, where) • Checkpoint file management • list = List() • Delete(cid) • Other TBD • Query “can I restart <this> job <here>?”
Checkpoint ManagerWork at LBNL • Progress since September ’03 mtg • Ported to RH9 kernel (@%^$#*!) • Wrote lampd to run checkpointable LAM/MPI jobs via MPD • Software released at SC2004 • SSS-OSCAR and stand-alone • Seeing deployment w/ LAM/MPI • Suspend/resume interface working with the Queue Manager qsig CLI
Checkpoint ManagerWork at LBNL • Current outstanding issues • Still need to design restart-time interaction(s) • Need to implement a full interface • Restriction syntax (“suspend Al’s jobs”) • Event generation • Error reporting • Have basic ideas on file management • Think of ls and rm
MonitoringWork at NCSA • warehouse in SSS-OSCAR • Scalability work • Implementing thread pool model for server • Internal protocol changes for dealing better with larger messages • Other • Correct Service Directory connections • Documentation
Process ManagerWork at ANL • Already a deployed component • Exit codes • Improved queries • Fixed several bugs at/for SC03
Plans for Near Future • Integration becomes main focus • Systems like Chiba have forced the issue • Stable(?) software makes this possible • Software development continues