270 likes | 285 Views
Condor Build & Test: NMI, OMII, ETICS. How the Condor Team Got Started in the Build/Test Business: Prehistory. Oracle shamed ^H^H^H^H^H^H inspired us. The Condor team was in the stone age, producing modern software to help people reliably automate their computing tasks -- with our bare hands.
E N D
How the Condor Team Got Started in the Build/Test Business: Prehistory • Oracle shamed^H^H^H^H^H^Hinspired us. • The Condor team was in the stone age, producing modern software to help people reliably automate their computing tasks -- with our bare hands. • Every Condor release took weeks/months to do. • Build by hand on each platform, discover lots of bugs introduced since the last release, track them down, re-build, etc.
What Did Oracle Do? • Oracle selected Condor as the resource manager underneath their Automated Integration Management Environment (AIME) • Relied on to perform automated build and regression testing of multiple components for Oracle's flagship Database Server product. • Oracle chose Condor because they liked the maturity of Condor's core components.
Doh! • Oracle used distributed computing to automate their build/test cycle, with huge success. • If Oracle can do it, why can’t we? • Use Condor to buildCondor! • NSF Middleware Initiative (NMI) • right initiative at the right time! • opportunity to collaborate with others to do for production software developers like Condor what Oracle was doing for themselves • important service to the scientific computing community
NMI Statement • Purpose – to develop, deploy and sustain a set of reusable and expandable middleware functions that benefit many science and engineering applications in a networked environment • Program encourages open source software development and development of middleware standards
Why should you care? From our experience, the functionality, robustness and maintainability of a production-quality software component depends on the effort involved in building, deploying and testing the component. • If it is true for a component, it is definitely true for a software stack • Doing it right is much harder than it appears from the outside • Most of us had very little experience in this area
Goals of theNMI Build & Test System • Design, develop and deploy a complete build system (HW and SW) capable of performing daily builds and tests of a suite of disparate software packages on a heterogeneous (HW, OS, libraries, …) collection of platforms • And make it: • Dependable • Traceable • Manageable • Portable • Extensible • Schedulable
The Build Challenge • Automation - “build the component at the push of a button!” • always more to it than just “configure” & “make” • e.g., ssh to right host; cvs checkout; untar; setenv, etc. • Reproducibility – “build the version we released 2 years ago!” • Well-managed & comprehensive source repository • Know your “externals” and keep them around • Portability – “build the component on nodeX.cluster.com!” • No dependencies on “local” capabilities • Understand your hardware & software requirements • Manageability – “run the build daily on 15 platforms and email me the outcome!”
The Testing Challenge • All the same challenges as builds (automation, reproducibility, portability, manageability), plus: • Flexibility • “test our RHEL4 binaries on RHEL5!” • “run our new tests on our old binaries” • important to decouple build & test functions • making tests just a part of a build -- instead of an independent step -- makes it difficult/impossible to: • run new tests against old builds • test one platform’s binaries on another platform • run different tests at different frequencies
“Eating Our Own Dogfood” • What Did We Do? • We built the NMI Build & Test Lab on top of Condor, DAGMan, and other distributed computing technologies to automate the build, deploy, and test cycle. • To support it, we’ve had to construct and manage a dedicated, heterogeneous distributed computing facility. • Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs each for ~40 platforms. • Much harder to manage! You try finding a sysadmin tool that works on 40 platforms! • We’re just another big Condor user • If Condor sucks, we feel the pain.
NMI Build & Test Facility DAGMan DAG INPUT Distributed Build/Test Pool Spec File NMI Build & Test Software Condor Queue DAG Customer Source Code build/test jobs Spec File results results Customer Build/Test Scripts results Web Portal Finished Binaries MySQL Results DB OUTPUT
Numbers 100 CPUs 39 HW/OS “Platforms” 34 OS 9 HW Arch 3 Sites ~100 GB of results per day ~1400 Builds/tests per month ~350 Condor jobs per day
Condor Build & Test • Automated Condor Builds • Two (sometimes three) separate Condor versions, each automatically built using NMI on 13-17 platforms nightly • Stable, developer, special release branches • Automated Condor Tests • Each nightly build’s output becomes the input to a new NMI run of our full Condor test suite • Ad-Hoc Builds & Tests • Each Condor developer can use NMI to submit ad-hoc builds & tests of their experimental workspaces or CVS branches to any or all platforms
More Condor Testing Work • Advanced Test Suite • Using binaries from each build, we deploy an entire self-contained Condor pool on each test machine • Runs a battery of Condor jobs and tests to verify critical features • Currently >150 distinct tests • each executed for each build, on each platform, for each release, every night • Flightworthy Initiative • Ensuring continued “core” Condor scalability, robustness • NSF funded, like NMI • Producing new tests all the time
NMI Build & Test Customers • NMI Build & Test Facility was built to serve all NMI projects • Who else is building and testing? • Globus • NMI Middleware Distribution • many “grid” tools, including Condor & Globus • Virtual Data Toolkit (VDT) for the Open Science Grid (OSG) • 40+ components • Soon TeraGrid, NEESgrid, others…
Build & Test Beyond NMI • We want to integrate with other, related software quality projects, and share build/test resources... • an international (US/Europe/China) federation of build/test grids… • Offer our tools as the foundation for other B&T systems • Leverage others’ work to improve out own B&T service
OMII-UK • Integrating software from multiple sources • Established open-source projects • Commissioned services & infrastructure • Deployment across multiple platforms • Verify interoperability between platforms & versions • Automatic Software Testing vital for the Grid • Build Testing – Cross platform builds • Unit Testing – Local Verification of APIs • Deployment Testing – Deploy & run package • Distributed Testing – Cross domain operation • Regression Testing – Compatibility between versions • Stress Testing – Correct operation under real loads • Distributed Testbed • Need a breadth & variety of resources not power • Needs to be a managed resource – process
NMI/OMII-UK Collaboration • Phase I: OMII-UK developed automated builds & tests using the NMI Build & Test Lab at UW-Madison • Phase II: OMII-UK deployed their own instance of the NMI Build & Test Lab at Southampton University • Our lab at UW-Madison is well and good, but some collaborators want/need their own local facilities. • Phase III (in progress): Move jobs freely between UW and OMII-UK B&T labs as needed.
Next: ETICS Build system, software configuration, service infrastructure, dissemination, EGEE, gLite, project coord. Software configuration, service infrastructure, dissemination NMI Build & Test Framework, Condor, distributed testing tools, service infrastructure Web portals and tools, quality process, dissemination, DILIGENT Test methods and metrics, unit testing tools, EBIT
ETICS Project Goals • ETICS will provide a multi-platform environment for building and testing middleware and applications for major European e-Science projects • “Strong point is automation: of builds, of tests, of reporting, etc. The goal is to simplify life when managing complex software management tasks” • One button to generate finished package (e.g., RPMs) for any chosen component • ETICS is developing a higher-level web service and DB to generate B&T jobs -- and use multiple, distributed NMI B&T Labs to execute & manage them • This work complements the existing NMI Build & Test system and is something we want to integrate & use to benefit other NMI users!
OMII-Japan • What They’re Doing • “…provide service which can use on-demand autobuild and test systems for Grid middlewares on on-demand virtual cluster. Developers can build and test their software immediately by using our autobuild and test systems” • Underlying B&T Infrastructure is NMI Build & Test Software
This was a Lot of Work… But It Got Easier Each Time • Deployments of the NMI B&T Software with international collaborators taught us how to export Build & Test as a service. • Tolya Karp: International B&T Hero • Improved (i.e., wrote) NMI install scripts • Improved configuration process • Debugged and solved a myriad of details that didn’t work in new environments
What This Means For You • NMI B&T Lab Deployment Experience + Improved Packaging + Improved Portability… • We now have unique ability to give you not only source code, but a whole production build & test infrastructure to go along with it • … and we have done it for a number of users already
New Condor+NMI Users • Yahoo • First industrial user to deploy NMI B&T Framework to build/test custom Condor contributions • Hartford Financial • Deploying it as we speak…
What’s to Come • More US & international collaborations • OMII-Europe • More Industrial User/Developers… • New Features • Becky Gietzel: parallel testing! • Major new feature: multiple co-scheduled resources for individual tests • Going beyond multi-platform testing to cross-platform parallel testing • UW-Madison B&T Lab: ever more platforms • “it’s time to make the doughnuts” • Questions?