220 likes | 232 Views
Metronome and The NMI Lab: This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “ Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure ”. Decision Time. Past
E N D
MetronomeandThe NMI Lab:This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure”
Decision Time • Past • Quick Review: why, what, who • Present • Current status, new this year • Future • Future plans, new next year
Why: The Problem • Good distributed computing (“grid”) software is… • badly needed • hard to find • hard to build and test
The Fix(Part of it, anyway) • Good build/test cycle • To be good, build/test process must be… • frequent • reliable • automatic • repeatable
The (Next) Problem • Building and testing distributed computing software requires… • Distributed resources • Not always in-house, not always dedicated to builds • I.e., shared, scheduled resources • Unless you have a spare Blue Gene lying around… and an old Alpha running RedHat 7.2… and an HPUX 11 box… and an Itanium running Scientific Linux 3 (CERN-flavored) … and… • Distributed testbeds, tests • Not: “the grid works on my machine… ship it!”
Grid Build and Test • Building and testing distributed computing software brings distributed challenges… • Complex workflows, cross-site/project/user scheduling priorities, data management, fault-tolerance, failure recovery • A lot like “real” distributed computing • Tinderbox or the latest Web 2.0 build system doesn’t cut it • Deep, integrated software stacks • Distributed providers
How We Do It • Use proven grid software to build and test new grid software • “Condor works, let’s use Condor” • Metronome is our second-generation build/test framework built on top of Condor, DAGMan, and other distributed computing technologies • NSF-funded
Metronome Principles • Tool-independent • Lightweight • Encourage explicit, well-controlled build/test environments • Central results repository • Fault-tolerance • Support platform-neutral and platform-specific tasks • Build/test separation
Metronome DAGMan DAG INPUT Distributed Build/Test Pool Spec File NMI Build & Test Software Condor Queue DAG Customer Source Code build/test jobs Spec File results results Customer Build/Test Scripts results Web Portal Finished Binaries MySQL Results DB OUTPUT
NMI Lab • Dedicated, heterogeneous distributed computing facility • Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs each for 50+ platforms. • Much harder to manage! You try finding a monitoring tool that works on 50 platforms! • Carefully-controlled resources • No mystery meat
The Team • Subset of the Condor Team • Becky Gietzel, master of all things NMI • Todd Miller, new guy on the block • Andy Pavlo, part-timer, short-timer • Ken Hahn, sysadmin to the stars • Me
Dogfood and Hats • Eating our own dogfood… • Condor builds failed last weekend (true!) • Condor developers complained to NMI Lab (“your build system failed… fix it!”) • NMI Lab discovered Condor bug (“hmm…”) • NMI Lab complained to Condor developers (“your software failed… fix it!”) • Feel the love!
New Name! • Before: • NMI Build & Test System, NMI Build & Test Software, NMI Build & Test Framework, NMI Software, NMI Build & Test Lab, UW-Madison Build & Test Lab, Build & Test Lab at UW-Madison • After: • Metronome + the NMI Lab • Why? • Old names were a mouthful • Clear separation between the software framework (Metronome) and the facility (the NMI Lab)
Real Work • Extremely Productive Collaborations • TeraGrid: production Metronome deployment using dynamically provisioned resources • ETICS, OMII: building higher-level services to generate and manage build/test jobs across an international federation of Metronome deployments • Extremely Productive Users • Condor, TeraGrid, Open Science Grid / VDT, Globus, NCSA (MyProxy), SDSC (SRB), LIGO, many others in this room…
New Metronome Capabilities • “Productization”, customization for other sites • Parallel testing • Enables dynamic, co-scheduled, distributed testbeds! • Automatic cross-site job migration • Run your own local Metronome pool with access to ours for exotic platforms • Many smaller features and extensions for production users -- users drive development • More bugs fixed than introduced!
New NMI Lab Capabilities • More platforms • “always with the platforms…” • new Itanium platforms, NLOTW (New Linux of the Week), additional vendor Unix machines, etc. • Now over 50 (!) platforms • Improved Lab Management • No, not me… better design and automation of systems & their administration
The Plan: Metronome • “Support, maintain, enhance” • VM--I mean slot--no wait, I mean VM support • Enhanced parallel testing support • Custom testbed environments (network, etc.) • Dynamic deployments (glide-in) • Advanced scheduling policies • Scalability testing enhancements • Better docs/installation/management
The Plan: NMI Lab • “Support, maintain, enhance” • More platforms, always with the platforms • More capacity • VM servers for… • Root-level testing • On-demand platforms • Federation with other Metronome labs • Better support, smoother management, less downtime • New sysadmin starting in June: take a bow, Ross!
You • Want to use it? • Metronome • The NMI Lab • http://nmi.cs.wisc.edu/
Feedback • When we started, the state of the art was unimpressive (almost non-existant)… we had to build our own • More build tools now exist -- if you know & like one of them, what do you like about it? • We’d like to better understand what we do well, what we don’t, and how we can integrate with other systems you find useful…