1 / 14

Keeping Your Software Ticking

Discover the importance of testing in ensuring the quality of grid software and why testing distributed software stacks can be challenging. Learn about Metronome, a framework and tool that provides explicit and well-controlled build/test environments for reproducibility. Explore the power of the NMI Lab, a facility built to use distributed resources, and how it can help you improve your software testing processes.

Download Presentation

Keeping Your Software Ticking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keeping Your Software Ticking Testing with Metronome and the NMI Lab

  2. Background: Why (In a Slide!) • Grid Software: Important to Science and Industry • Quality of Grid Software: Not So Much • Testing: Key to Quality • Testing Distributed Software: Hard • Testing Distributed Software Stacks: Harder • Distributed Software Testing Tools: Nonexistent (before) We Needed Help, We Built Something to Help Ourselves and Our Friends, We Think It Can Help Others

  3. Background: What (In a Slide!) • A Framework and Tool: Metronome • Lightweight, built atop Condor, DAGMan, and other proven distributed computing tools • Portable, open source • Language/harness independent • Assumes >1 user, >1 project, >1 environment needing resources at >1 site. • Encourages explicit, well-controlled build/test environments for reproducibility • Central results repository • Fault-tolerant • Encourages build/test separation • A Facility: The NMI Lab • 200+ cores, 50+ platforms @ UW (Noah’s Ark; the Anti-Cluster) • Built to use distributed resources at other sites, grids, etc. • 200 users, dozens of registered projects (most of them “real”) • 84k builds & tests managed by 1M Condor jobs, producing 6.5M tracked tasks in the DB • A Team • Subset of Condor Team: Becky Gietzel, Todd Miller, Ross Oldenburg, myself. (More coming.) • A Community • Working with TeraGrid, OSG, ETICS, others towards a common intl. build/test infrastructure.

  4. DAG DAGMan Metronome Architecture (In a Slide!) INPUT Distributed Build/Test Pool Spec File Metronome Condor Queue Spec File DAG Customer Source Code build/test jobs results results Customer Build/Test Scripts results Web Status Pages Finished Binaries MySQL Results DB OUTPUT

  5. Why Is This Architecture Powerful? • Fault tolerance, resource management. • Real scheduler, not a toy or afterthought. • Flexible workflow tools. • Nothing to deploy in advance on worker nodes except Condor • can harness “unprepared” resources. • Advanced job migration capabilities • critical for goal of a common build/test infrastructure across projects, sites, countries.

  6. Example: NMI Lab / ETICSSite Federation with Condor-C

  7. 10k Foot View • Past: • humble beginnings, ragtag crew of developers making building & testing easier for the projects around them (Condor, Globus, VDT, Teragrid...) • Present: • now we have tax money and users should have higher expectations • good news: six months into a new 3y funding cycle, our "professionalism" has improved from our humble beginnings -- better hardware, better processes, better staffing • bad news: we’re still a bit ragtag -- inconsistent support/development request tracking, inconsistent info on resource/lab improvements, issues, and resolution, generally reactive to problems • we're clearly contributing to the build & test capabilities of the community, but we’d like to deliver much more, especially WRT testing.

  8. 10k Foot View: Future • Maintain Metronome and the NMI Lab • continue to professionalize lab infrastructure, improve availability, stability, uptime • Better monitoring -> more proactive response to issues • Better scheduling of jobs, better use of VMs to respond to uneven x86 platform demand • Enhance Metronome and the NMI Lab • New features, new capabilities – but might be less important than clarity, usability, fit & finish of existing features.

  9. 10k Foot View: Future • Support Metronome and the NMI Lab • more systematic support operation (ticketing, etc.) • more utilization of basic testing capabilities by new users • more utilization of advanced testing capabilities by existing users • more & better information for users, admins, and pointed-haired bosses • better reporting on users, resources, usage, operations, etc. • Nurture Distributed Software Testing Community • to identify common B&T needs to improve software quality. • to challenge and help us to provide software & services to help meet B&T needs. • Tuesday’s meeting was a good start, I hope…

  10. Maslow’s Pyramid of Testing Needs

  11. Testing Opportunities • more resources == more possibilities (just like science) • don’t just test under normal conditions, test the not-so-edge cases too (e.g., with CPU load!) • test everywhere your users run, not just where you develop • old/exotic/unique resources you don’t own (NMI Lab, TeraGrid) • “black box” • run your existing tinderbox, etc. test harness inside Metronome • decoupled builds & tests • run new tests on old builds • cross-platform binary compatibility testing • run quick smoke tests continuously, heavy tests nightly, performance/scalability tests before release

  12. Testing Opportunities • managed (static) vs. “unmanaged” (auto-updating) platforms • isolate your changes from the OS vendors • test your changes against a fixed target • test your working code against a moving target • root-level testing • automated reports from testing tools • ValGrind, Purify, Coverity, etc. • cross-platform binary testing (build on A, test on B)

  13. Testing Opportunities • Parameterized dependencies • build with multiple library versions, compilers, etc. • test against every Java VM, Maven, Ant version around • test against different DBs (MySQL, Postgres, Oracle, etc.), VM platforms (Xen, VMWare, etc.), batch systems • make sure new versions of Condor, Globus, etc. don’t break your code • Parallel scheduled testbeds • cross-platform testing (A to B) • deploy software stack across many hosts, test whole stack • multi-site testing (US to Europe) • network testing (cross-firewall, low-bandwidth, etc.) • scalability testing

  14. Upshot • This is all work we’d like to help this community do. • Start small -- automated builds are an excellent start. • Think big -- what kinds of testing would pay dividends? • Let us know what we can do to help make it happen.

More Related