140 likes | 150 Views
Discover the importance of testing in ensuring the quality of grid software and why testing distributed software stacks can be challenging. Learn about Metronome, a framework and tool that provides explicit and well-controlled build/test environments for reproducibility. Explore the power of the NMI Lab, a facility built to use distributed resources, and how it can help you improve your software testing processes.
E N D
Keeping Your Software Ticking Testing with Metronome and the NMI Lab
Background: Why (In a Slide!) • Grid Software: Important to Science and Industry • Quality of Grid Software: Not So Much • Testing: Key to Quality • Testing Distributed Software: Hard • Testing Distributed Software Stacks: Harder • Distributed Software Testing Tools: Nonexistent (before) We Needed Help, We Built Something to Help Ourselves and Our Friends, We Think It Can Help Others
Background: What (In a Slide!) • A Framework and Tool: Metronome • Lightweight, built atop Condor, DAGMan, and other proven distributed computing tools • Portable, open source • Language/harness independent • Assumes >1 user, >1 project, >1 environment needing resources at >1 site. • Encourages explicit, well-controlled build/test environments for reproducibility • Central results repository • Fault-tolerant • Encourages build/test separation • A Facility: The NMI Lab • 200+ cores, 50+ platforms @ UW (Noah’s Ark; the Anti-Cluster) • Built to use distributed resources at other sites, grids, etc. • 200 users, dozens of registered projects (most of them “real”) • 84k builds & tests managed by 1M Condor jobs, producing 6.5M tracked tasks in the DB • A Team • Subset of Condor Team: Becky Gietzel, Todd Miller, Ross Oldenburg, myself. (More coming.) • A Community • Working with TeraGrid, OSG, ETICS, others towards a common intl. build/test infrastructure.
DAG DAGMan Metronome Architecture (In a Slide!) INPUT Distributed Build/Test Pool Spec File Metronome Condor Queue Spec File DAG Customer Source Code build/test jobs results results Customer Build/Test Scripts results Web Status Pages Finished Binaries MySQL Results DB OUTPUT
Why Is This Architecture Powerful? • Fault tolerance, resource management. • Real scheduler, not a toy or afterthought. • Flexible workflow tools. • Nothing to deploy in advance on worker nodes except Condor • can harness “unprepared” resources. • Advanced job migration capabilities • critical for goal of a common build/test infrastructure across projects, sites, countries.
10k Foot View • Past: • humble beginnings, ragtag crew of developers making building & testing easier for the projects around them (Condor, Globus, VDT, Teragrid...) • Present: • now we have tax money and users should have higher expectations • good news: six months into a new 3y funding cycle, our "professionalism" has improved from our humble beginnings -- better hardware, better processes, better staffing • bad news: we’re still a bit ragtag -- inconsistent support/development request tracking, inconsistent info on resource/lab improvements, issues, and resolution, generally reactive to problems • we're clearly contributing to the build & test capabilities of the community, but we’d like to deliver much more, especially WRT testing.
10k Foot View: Future • Maintain Metronome and the NMI Lab • continue to professionalize lab infrastructure, improve availability, stability, uptime • Better monitoring -> more proactive response to issues • Better scheduling of jobs, better use of VMs to respond to uneven x86 platform demand • Enhance Metronome and the NMI Lab • New features, new capabilities – but might be less important than clarity, usability, fit & finish of existing features.
10k Foot View: Future • Support Metronome and the NMI Lab • more systematic support operation (ticketing, etc.) • more utilization of basic testing capabilities by new users • more utilization of advanced testing capabilities by existing users • more & better information for users, admins, and pointed-haired bosses • better reporting on users, resources, usage, operations, etc. • Nurture Distributed Software Testing Community • to identify common B&T needs to improve software quality. • to challenge and help us to provide software & services to help meet B&T needs. • Tuesday’s meeting was a good start, I hope…
Testing Opportunities • more resources == more possibilities (just like science) • don’t just test under normal conditions, test the not-so-edge cases too (e.g., with CPU load!) • test everywhere your users run, not just where you develop • old/exotic/unique resources you don’t own (NMI Lab, TeraGrid) • “black box” • run your existing tinderbox, etc. test harness inside Metronome • decoupled builds & tests • run new tests on old builds • cross-platform binary compatibility testing • run quick smoke tests continuously, heavy tests nightly, performance/scalability tests before release
Testing Opportunities • managed (static) vs. “unmanaged” (auto-updating) platforms • isolate your changes from the OS vendors • test your changes against a fixed target • test your working code against a moving target • root-level testing • automated reports from testing tools • ValGrind, Purify, Coverity, etc. • cross-platform binary testing (build on A, test on B)
Testing Opportunities • Parameterized dependencies • build with multiple library versions, compilers, etc. • test against every Java VM, Maven, Ant version around • test against different DBs (MySQL, Postgres, Oracle, etc.), VM platforms (Xen, VMWare, etc.), batch systems • make sure new versions of Condor, Globus, etc. don’t break your code • Parallel scheduled testbeds • cross-platform testing (A to B) • deploy software stack across many hosts, test whole stack • multi-site testing (US to Europe) • network testing (cross-firewall, low-bandwidth, etc.) • scalability testing
Upshot • This is all work we’d like to help this community do. • Start small -- automated builds are an excellent start. • Think big -- what kinds of testing would pay dividends? • Let us know what we can do to help make it happen.