Process Management & Monitoring WG

Process Management & Monitoring WG Quarterly Report August 26, 2004

Components • Process Management • Process Manager • Checkpoint Manager • Monitoring • Job Monitor • System/Node Monitors • Meta Monitoring

Component Progress • Checkpoint Manager (LBNL) • Monitoring (NCSA) • Process Manager (ANL)

Checkpoint Manager:BLCR Status • Full save and restore of • CPU registers • Memory • Signals (handlers & pending signals) • PID, PGID, etc • Files (w/ limitations) • Communication (via LAM/MPI)

Checkpoint Manager:BLCR Status • Files • Files unmodified between checkpoint and restart • Files appended to between checkpoint and restart • Pipes between processes

Checkpoint Manager:BLCR Status • LAM/MPI over TCP (and GM) • Handles in flight data (drains) • Linear scaling • Migratable

Checkpoint Manager:BLCR Status • Linux only • “Stock” 2.4.X • RedHat 7.2, 7.3, 8.0, 9 • SuSE 7.? and 9 • RHEL3/CentOS nearly ready • 2.6.x port has begun “in background” • X86 only • Alpha, PPC may be 95% ready • IA64 and X86_64 possible

Checkpoint Manager:BLCR Future Work • More on files • Mutable files • Directories • Misc. • Process groups and Sessions • Terminal characteristics

Checkpoint Manager:SSS Work • Rudimentary Checkpoint Manager • Works with Bamboo and MPDPM • Long delayed plans for “next gen” • Upgraded interface spec (what syntax?) • Management of “context files” • lampd • mpirun replacement for running LAM/MPI jobs under MPD

Process ManagerProgress • Continued daily use on Chiba City, along with other components • At Brett’s request, addition of option to signal entire (Unix) process group of a user process or just the process itself. • Default is just the top-level user process • Example: <signal-process-group scope=‘global’ signal=‘SIGINT’> <process-group user=‘desai’> </signal-process-group> • Miscellaneous hardening of MPD system, particularly in error conditions, prompted by Intel use.

MonitoringWork at NCSA • A major fix has been implemented in warehouse. Before, there was a threshold of network bad-ness that if exceeded, would cause none of the nodes to be monitored at all (due to messages being stacked up in the incoming sockets). The code has been fixed so that multiple messages can be monitored per pass, which means that if the above threshold is exceeded, the nodes will just be monitored more slowly. This code was tested in the "good" realm against Dave, Scott, Brett in July, before having another release of the RMAP suite. It has not been tested in the "bad" realm, because that's a dedicated test. The bad news is that upon coming back from vacation in Britain, the hard drive on my desktop had had a complete hardware failure. I had been backing up warehouse religiously, and since I had transported code down to xtorc to create new rpms, I lost nothing on warehouse. I did, however, lose a bunch of work on the SSSRMAP wire protocol. Unfortunately, this included a bunch of annotated code that I would have liked to have had. Fortunately, most of what I'd done was figuring stuff out, and some of that carried over in memory so that reconstructing the second time is much easier. So I've been working feverishly on trying to get back on track with that project. I am at the point that it will be useful for me to sit down with Narayan and Dave, and ask "where does this go" install/Makefile sorts of questions.

Process Management & Monitoring WG