1 / 11

Process Management & Monitoring WG

Process Management & Monitoring WG. Quarterly Report January 15, 2004. Components. Process Management Process Manager Checkpoint Manager Monitoring Job Monitor System/Node Monitors Meta Monitoring. Group Progress. SSS-OSCAR release Still continue development on all three components

apollo
Download Presentation

Process Management & Monitoring WG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Process Management & Monitoring WG Quarterly Report January 15, 2004

  2. Components • Process Management • Process Manager • Checkpoint Manager • Monitoring • Job Monitor • System/Node Monitors • Meta Monitoring

  3. Group Progress • SSS-OSCAR release • Still continue development on all three components • Largely proceeding on points requiring little interaction • Real deployment/testing of components • Chiba city, xtorc, others?

  4. Component Progress • Checkpoint Manager (LBNL) • Monitoring (NCSA) • Process Manager (ANL)

  5. Checkpoint ManagerWork at LBNL • Basic design • Pre-emption • cid = Suspend(pgid) • Resume(cid) • Checkpointing • cid = Checkpoint(pgid) • Restart(cid, where)

  6. Checkpoint ManagerWork at LBNL • Basic design (continued) • Migration • Migrate(pgid, where) • Checkpoint file management • list = List() • Delete(cid) • Other TBD • Query “can I restart <this> job <here>?”

  7. Checkpoint ManagerWork at LBNL • Progress since September ’03 mtg • Ported to RH9 kernel (@%^$#*!) • Wrote lampd to run checkpointable LAM/MPI jobs via MPD • Software released at SC2004 • SSS-OSCAR and stand-alone • Seeing deployment w/ LAM/MPI • Suspend/resume interface working with the Queue Manager qsig CLI

  8. Checkpoint ManagerWork at LBNL • Current outstanding issues • Still need to design restart-time interaction(s) • Need to implement a full interface • Restriction syntax (“suspend Al’s jobs”) • Event generation • Error reporting • Have basic ideas on file management • Think of ls and rm

  9. MonitoringWork at NCSA • warehouse in SSS-OSCAR • Scalability work • Implementing thread pool model for server • Internal protocol changes for dealing better with larger messages • Other • Correct Service Directory connections • Documentation

  10. Process ManagerWork at ANL • Already a deployed component • Exit codes • Improved queries • Fixed several bugs at/for SC03

  11. Plans for Near Future • Integration becomes main focus • Systems like Chiba have forced the issue • Stable(?) software makes this possible • Software development continues

More Related