170 likes | 274 Views
Process Management & Monitoring WG. Quarterly Report January 25, 2005. Components. Process Management Process Manager Checkpoint Manager Monitoring Job Monitor System/Node Monitors Meta Monitoring. Component Progress. Checkpoint Manager (LBNL) BLCR Process Manager (ANL) MPDPM
E N D
Process Management & Monitoring WG Quarterly Report January 25, 2005
Components • Process Management • Process Manager • Checkpoint Manager • Monitoring • Job Monitor • System/Node Monitors • Meta Monitoring
Component Progress • Checkpoint Manager (LBNL) • BLCR • Process Manager (ANL) • MPDPM • Monitoring (NCSA) • Warehouse
Checkpoint Manager:BLCR Status • Full save and restore of • CPU registers • Memory • Signals (handlers & pending signals) • PID, PGID, etc • Files (w/ limitations) • Communication (via MPI)
Checkpoint Manager:BLCR Status (Files) • Files • Files unmodified between checkpoint and restart • Files appended to between checkpoint and restart • Pipes between processes
Checkpoint Manager:BLCR Status (Comms) • LAM/MPI 7.x over TCP and GM • Handles in-flight data (drains) • Linear scaling of time w/ job size • Migratable • OpenMPI • Will inherit LAM/MPI’s support • ChaMPIon/Pro (Verari)
Checkpoint Manager:BLCR Status (Ports) • Linux only • “Stock” 2.4.X RedHat 7.2 – 9 • SuSE 7.2 – 9.0 RHEL3/CentOS 3.1 • 2.6.x port in progress (FC2 & SuSE 9.2) • x86 (IA32) only today • x86_64 (Opteron) will follow 2.6.x port • Alpha, PPC and PPC64 may be trivial • No IA64 (Itanium) plans
Checkpoint Manager:BLCR Future Work • Additional coverage • Process groups and Sessions (next priority) • Terminal characteristics • Interval timers • Queued RT signals • More on files • Mutable files • Directories
Checkpoint Manager:SSS Integration • Rudimentary Checkpoint Manager • Works with Bamboo, Maui and MPDPM • Long delayed plans for “next gen” • Upgraded interface spec (using LRS) • Management of “context files” • lampd • mpirun replacement for running LAM/MPI jobs under MPD
Checkpoint Manager:Non-SSS Integration • Grid Engine • DONE by 3rd party (online howto) • Verari Command Center • In testing • PBS family • Torque: Cluster Resources interested • PBSPro: Altair Engineering interested (if funded) • SLURM • Mo Jette of LLNL interested (if funded) • LoadLeveler • IBM may publish our URL in support documents
Process ManagerProgress (ANL) • Continued daily use on Chiba City, along with other components • Miscellaneous hardening of MPD implementation of PM, particularly with respect to error conditions, prompted by Intel use and Chiba experience • Conversion to LRS, in preparation for presentation of interface at this meeting • Preparation for BG/L
Monitoring at NCSAWarehouse Status • Network code has been revamped; that code is in cvs in oscar sss • Connections are now retried • Starting to monitor does not wait for all connections to finish • Connection and monitoring thread pools are independent • No full reset (if lots of nodes are down, continues blindly) • Any component can be restarted. Restart no longer depends on start order. • Features intended for sss-oscar 1.0 (SC2004), didn't make it, made it into 1.01
Monitoring at NCSAWarehouse Testing • Warehouse run on former Platinum cluster at NCSA • Node count kept dropping • 400 nodes originally • 200 nodes in post-cluster configuration • 120 available for testing • Ran on 120 nodes with no problems • Have head node, but cannot have whole cluster • So didn't try sss-oscar
Monitoring at NCSAWarehouse Testing (2) • "Infinite" Itanium cluster (Infiniband development machine) • Have root access • Will run warehouse for sure, for long range testing • Might try whole suite (semi-production) • T2 cluster (Dell Xeon 500+ nodes) • May run warehouse across (Mike Showerman says) • Anecdote: • Went to test new warehouse_monitor on xtorc. Installed and started new warehouse_monitors on nodes. Called up warehouse_System_Monitor to make sure it wasn't running. The already running System Monitor had connected to all the new warehouse_monitors and everything was running fine.
MonitoringWork at NCSA • David Boxer, RA, working on warehouse • Craig worked bugs and fiddly things, David did development heavy lifting • Revamped network code (modularized) • Developed new info storage (more on this in the afternoon) • New info store and logistics • info store: redesigned and updated: DONE • protocol re-designed: DONE • send protocol: DONE • receive protocol: still to do • IBM offered him real money - he's off to work for them.
MonitoringWork at NCSA • Wire Protocol: • I (Craig) need to have working knowledge of signature/hash functions. When I do, I'll be back to coding on this • Perilously close to being able to do useful stuff • Documentation: • Have most of a web site written with philosophy of warehouse, and debugging tools.
Monitoring at NCSAFuture Work • New interaction (to come): Node Build and Config Manager • On start-up, will talk to Node State Manager and get list of up nodes • Subscribe to Node State Manager events for updates • For now, can continue to store node state, transition to Scheduler obtaining state information itself. • Also to come: • Intelligent error handling (target-based vs. severity based) • Command line debugging/control?