120 likes | 238 Views
Narayan Desai desai@mcs.anl.gov Argonne National Laboratory. SciDAC SSS Infrastructure and BCM Update 10/2002. Overview. Infrastructure Service Directory Event Manager SSSlib Bindings Wire Protocol Modules Build and Configuration Management Cluster Hardware Infrastructure Build System
E N D
Narayan Desai desai@mcs.anl.gov Argonne National Laboratory SciDAC SSS Infrastructure and BCM Update 10/2002
Overview • Infrastructure • Service Directory • Event Manager • SSSlib • Bindings • Wire Protocol Modules • Build and Configuration Management • Cluster Hardware Infrastructure • Build System • Node State Manager • Abstraction
Service Directory Status • Static Schemas • Complete Error Handling • Deployed • Reliable at Moderate Scale
Event Manager • Usage Model • Stable Schemas • Relatively Complete Error Handling • Completely Rewritten • Received Moderate Testing • Potential Subscription Type Differentiation
SSSlib • Robust • New C implementation • Bindings • C++ • Java • Python • Perl • Wire protocol modules • Basic • Challenge • http-rm (in development) • http (in development)
BCM Abstraction • “2nd try” Abstraction • Components • Node Manager • Cluster Control • Didn’t Work • “3rd is the charm • Components • Cluster Hardware Infrastructure • Build System • Node State Manager • Why we think it will work
Cluster Hardware Infrastructure • Handles new node integration • Abstracts cluster hardware infrastructure • Power Controllers • Serial Consoles • Bios Issues • Node Inventory • Node Identification
Node State Manager • Node State Tracking • Basic state monitoring • Node Administrative State • Online/Offline • Corrective Action Facility • Pull the plug on bad nodes • Unknown criterion
Build System • Disk Setup • Software Configuration • Software Deployment
Example Node Introduction • New node added • CHI identifies node • CHI hands off control of node to BS • BS builds node into “proper” configuration • BS hands off control of node to NSM • NSM can set node administrative state • In case of errors, node can be rebuilt or other actions can be taken
Soon • Start to standardize events • Enhance event data format? • Implement more wire protocols • Complete “hot-swap” tests for BCM components • Logic for the node state manager • Implement a modular cluster hardware infrastructure
Issues • Abstraction problems • BCM model unknown • Implementation feedback • Multiple implementations help