90 likes | 174 Views
Rick Bradshaw, Narayan Desai, Andrew Lusk, Rusty Lusk, Brian Pellin Mathematics and Computer Science Division Argonne National Laboratory. Experiences with SSS software Architecture in a “Production” Environment. The “SSS on Chiba” Project.
E N D
Rick Bradshaw, Narayan Desai, Andrew Lusk, Rusty Lusk, Brian Pellin Mathematics and Computer Science Division Argonne National Laboratory Experiences with SSS software Architecture in a “Production” Environment
The “SSS on Chiba” Project This was a summer project launched shortly after the last face-to-face meeting in June. Outline: • Definition of Project • Motivation • Limitations • Approach • Experiences • Status and Plans • Distribution
Project Definition • Chiba City consists of 256 dual processor nodes running Linux, with Myrinet and Fast Ethernet • Scalability testbed • Project: determine whether SSS component architecture could be used to replace existing Chiba City system software, consisting of • PBS • Maui scheduler • Home-grown user software for distributing files and executables • No shared file system • Home-grown system software for managing nodes
Motivation • Needed better systems software on Chiba City cluster • In general • For testing other SSS components (e.g. checkpointing) • For enabling Chiba as a testbed for scalable OS research • Needed to more thoroughly test existing ANL-written components • Stand-alone components • Build-and-Config Manager, Process Manager, Event Manager • Infrastructure components • Service Directory, Communication Library • Needed more experience with published XML interfaces • Had extra programming muscle available over the summer
Limitations • Needed to do this very fast, before summer resources evaporated • Chiba is in constant use by research computer scientists (e.g. developing parallel file system) and computational scientists (e.g., physics, biology, etc.)
Approach • Utilize assets on hand • Some central components (SD, EM, PM, Comm Library) • Existing publicized XML interfaces for these • Python programmers • Write stubs for other essential components • Scheduler • Nothing fancy • Only does FIFO with reservations and backfill • QM • Interface among user, scheduler, process manager • But some extra capabilities • Multiple job steps, e.g., to distribute files • Specify OS image to be loaded, to support testbed function • “PBS compatibility mode”, to allow users to reuse their job submission scripts • Use restriction syntax for stubs for simplicity and speed
Experiences • At end of summer, after 2-week shakedown, we convinced Chiba management to go forward rather than reinstall old software. (No more PBS.) • Have been running user job mix for about three weeks, with no disasters. • Shook out some ambiguities in XML specification for component interfaces • Fixed bugs • Found and fixed scalability problems
Status and Plans • Status • Working • Collecting user experiences • Plans • Short term • Incorporate other components from Process Management Working Group • Paul: kernel module, LAM support, and CP Manager • Craig: monitoring and data warehouse • Long term • Other components from rest of project, especially Resource Management Working Group components • Provide Chiba for OS experimentation as part of normal batch-scheduled jobs, e.g. Sandia group