The NMI Build and Test Framework

The NMI Build and Test Framework

How Condor Got Started in the Build/Test Business: Prehistory • Oracle shamed^H^H^H^H^H^Hinspired us. • The Condor team was in the stone age, producing modern software to help people reliably automate their computing tasks -- with our bare hands. • Every Condor release took weeks/months to do. • Build by hand on each platform, discover lots of bugs introduced since the last release, track them down, re-build, etc.

What Did Oracle Do? • Oracle selected Condor as the resource manager underneath their Automated Integration Management Environment (AIME) • Decided to rely on Condor to perform automated build and regression testing of multiple components for Oracle's flagship Database Server product. • Oracle chose Condor because they liked the maturity of Condor's core components.

Doh! • Oracle used distributed computing to automate their build/test cycle, with huge success. • If Oracle can do it, why can’t we? • Use Condor to buildCondor! • NSF Middleware Initiative (NMI) • right initiative at the right time! • opportunity to collaborate with others to do for production software developers like Condor what Oracle was doing for themselves • important service to the scientific computing community

NMI Statement • Purpose – to develop, deploy and sustain a set of reusable and expandable middleware functions that benefit many science and engineering applications in a networked environment • Program encourages open source software development and development of middleware standards

Why should you care? From our experience, the functionality, robustness and maintainability of a production-quality software component depends on the effort involved in building, deploying and testing the component. • If it is true for a component, it is definitely true for a software stack • Doing it right is much harder than it appears from the outside • Most of us had very little experience in this area

Goals of theNMI Build & Test System • Design, develop and deploy a complete build system (HW and SW) capable of performing daily builds and tests of a suite of disparate software packages on a heterogeneous (HW, OS, libraries, …) collection of platforms • And make it: • Dependable • Traceable • Manageable • Portable • Extensible • Schedulable • Distributed

The Build Challenge • Automation - “build the component at the push of a button!” • always more to it than just configure & make • e.g., ssh to the “right” host; cvs checkout; untar; setenv, etc. • Reproducibility – “build the version we released 2 years ago!” • Well-managed & comprehensive source repository • Know your “externals” and keep them around • Portability – “build the component on nodeX.cluster.net!” • No dependencies on magic “local” capabilities • Understand your hardware & software requirements • Manageability – “run the build daily on 20 platforms and email me the outcome!”

The Testing Challenge • All the same challenges as builds (automation, reproducibility, portability, manageability), plus: • Flexibility • “test our RHEL4 binaries on RHEL5!” • “run our new tests on our old binaries” • Important to decouple build & test functions • making tests just a part of a build -- instead of an independent step -- makes it difficult/impossible to: • run new tests against old builds • test one platform’s binaries on another platform • run different tests at different frequencies

“Eating Our Own Dogfood” • What Did We Do? • We built the NMI Build & Test Lab on top of Condor, DAGMan, and other distributed computing technologies to automate the build, deploy, and test cycle. • To support it, we’ve had to construct and manage a dedicated, heterogeneous distributed computing facility. • Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs for each of ~40 platforms. • Much harder to manage! You try finding a nifty system/network/cluster admin tool that works on 40 platforms! • We’re JABCU (just another big Condor user) • If Condor sucks, we feel the pain.

How does grid s/w help? • Build & Test jobs are a lot like scientific computing jobs. Same problems... • Resource management • Advertising machine capabilities (hw, OS, installed software, config, etc.) • Advertising job requirements (hw, OS, prereq software, config, etc.) • Matchmaking substitution -- replacing dynamic parameters in build (e.g., available ports to use) with specifics of matched machine • Fault tolerance & reliable job results reporting! • never ever ever have to "babysit" a build or test to deal with external failures -- submit & forget until done, even if network does down or machine reboots • garbage collection -- we never have to clean up processes or disk droppings after a misbehaving build • DAGMan! • make dependencies explicit in a DAG, and get the same fault tolerance & reliability • Data management, file xfer, etc. • no shared filesystem! -- we need to make sure build/test node gets the files it needs from the submit machine, and gets the results back • Authentication • "gateway to the grid" -- grid resource access • in theory we can build/test on any remote grid using resources we don't manage (e.g., ANL, OMII, SDSC, NCSA machines)

NMI Build & Test Facility DAGMan DAG INPUT Distributed Build/Test Pool Spec File NMI Build & Test Software Condor Queue DAG Customer Source Code build/test jobs Spec File results results Customer Build/Test Scripts results Web Portal Finished Binaries MySQL Results DB OUTPUT

Numbers 100+ CPUs 40+ HW/OS “Platforms” 34+ OS 9 HW Arch 3 Sites ~100 GB of results per day ~1400 Builds/tests per month ~350 Condor jobs per day

Condor Build & Test • Automated Condor Builds • Two (sometimes three) separate Condor versions, each automatically built using NMI on 13-17 platforms nightly • Stable, developer, special release branches • Automated Condor Tests • Each nightly build’s output becomes the input to a new NMI run of our full Condor test suite • Ad-Hoc Builds & Tests • Each Condor developer can use NMI to submit ad-hoc builds & tests of their experimental workspaces or CVS branches to any or all platforms

More Condor Testing Work • Advanced Test Suite • Using binaries from each build, we deploy an entire self-contained Condor pool on each test machine • Runs a battery of Condor jobs and tests to verify critical features • Currently >150 distinct tests • each executed for each build, on each platform, for each release, every night • Flightworthy Initiative • Ensuring continued “core” Condor scalability, robustness • NSF funded, like NMI • Producing new tests all the time

NMI Build & Test Customers • NMI Build & Test Facility was built to serve all NMI projects • Who else is building and testing? • Globus • NMI Middleware Distribution • many “grid” tools, including Condor & Globus • Virtual Data Toolkit (VDT) for the Open Science Grid (OSG) • 40+ components • Soon TeraGrid, NEESgrid, others…

Recent Experience:SRB Client • Storage Resource Broker (SRB) • work done by Wayne Schroeder @ SDSC • started gently; took a little while for Wayne to warm up to the system • ran into a few problems with bad matches before mastering how we use prereqs • Our challenge: better docs, better error messages • emailed Tolya with questions, Tolya responded “to shed some more general light on the system and help avoid or better debug such problems in the future” • soon he got pretty comfortable with the system • moved on to write his own glue scripts • expanded builds to 34 platforms (!)

Failure, failure, failure… success!

SRB Client • But… couldn't get HP/UX build to work • at first we all thought it was a B&T system problem • once we looked closer Wayne realized that SRB in fact would not build there, so it was informative • Now with “one button” Wayne can test his SRB client build any time he wants, on 34 platforms, with no babysitting.

Build & Test Beyond NMI • We want to integrate with other, related software quality projects, and share build/test resources... • an international (US/Europe/China) federation of build/test grids… • Offer our tools as the foundation for other B&T systems • Leverage others’ work to improve out own B&T service

OMII-UK • Integrating software from multiple sources • Established open-source projects • Commissioned services & infrastructure • Deployment across multiple platforms • Verify interoperability between platforms & versions • Automatic Software Testing vital for the Grid • Build Testing – Cross platform builds • Unit Testing – Local Verification of APIs • Deployment Testing – Deploy & run package • Distributed Testing – Cross domain operation • Regression Testing – Compatibility between versions • Stress Testing – Correct operation under real loads • Distributed Testbed • Need a breadth & variety of resources not power • Needs to be a managed resource – process

Next: ETICS Build system, software configuration, service infrastructure, dissemination, EGEE, gLite, project coord. Software configuration, service infrastructure, dissemination NMI Build & Test Framework, Condor, distributed testing tools, service infrastructure Web portals and tools, quality process, dissemination, DILIGENT Test methods and metrics, unit testing tools, EBIT

ETICS Project Goals • ETICS will provide a multi-platform environment for building and testing middleware and applications for major European e-Science projects • “Strong point is automation: of builds, of tests, of reporting, etc. The goal is to simplify life when managing complex software management tasks” • One button to generate finished package (e.g., RPMs) for any chosen component • ETICS is developing a higher-level web service and DB to generate B&T jobs -- and use multiple, distributed NMI B&T Labs to execute & manage them • This work complements the existing NMI Build & Test system and is something we want to integrate & use to benefit other NMI users!

ETICS Web Interface

OMII-Japan • What They’re Doing • “…provide service which can use on-demand autobuild and test systems for Grid middlewares on on-demand virtual cluster. Developers can build and test their software immediately by using our autobuild and test systems” • Underlying B&T Infrastructure is NMI Build & Test Software

This was a Lot of Work… But It Got Easier Each Time • Deployments of the NMI B&T Software with international collaborators taught us how to export Build & Test as a service. • Tolya Karp: International B&T Hero • Improved (i.e., wrote) NMI install scripts • Improved configuration process • Debugged and solved a myriad of details that didn’t work in new environments

What We Don’t Do Well • Documentation • much better than ~6 months ago, but still incomplete • most existing users were walked through the system in person, and given lots of support • Submission/Specification API • we’re living comfortably in the 80’s: all command-line, all the time • we hope ETICS will improve this!

New Condor+NMI Users • Yahoo • First industrial user to deploy NMI B&T Framework to build/test custom Condor contributions • Hartford Financial • Deploying it as we speak…

What’s to Come • More US & international collaborations • More Industrial User/Developers… • New Features • Becky Gietzel: parallel testing! • Major new feature: multiple co-scheduled resources for individual tests • Going beyond multi-platform testing to cross-platform parallel testing • UW-Madison B&T Lab: ever more platforms • “it’s time to make the doughnuts” • Questions?

The NMI Build and Test Framework