290 likes | 442 Views
Attacking Intermittent Failures: Mozilla's War on Orange Mark Côté Software Artisan Mozilla Automation & Tools. Overview. Background: Intermittent failures Mozilla's automated-testing infrastructure The War on Orange: Data sources Metric Basic UI. Overview.
E N D
Attacking Intermittent Failures: Mozilla's War on Orange Mark Côté Software Artisan Mozilla Automation & Tools
Overview • Background: • Intermittent failures • Mozilla's automated-testing infrastructure • The War on Orange: • Data sources • Metric • Basic UI
Overview • The War on Orange, continued: • Advanced UI • Actions • Conclusion
Background – Intermittent failures • In any testing system of sufficient size, errors crop up that • occur infrequently but consistently • are not reliably reproducible • cannot be tied to a particular changeset • These are known as “intermittent failures”, and they're awful.
Background – buildbot • Buildbot: • Complex continuous-integration system • Triggered by commits to mozilla-central and other key branches • A few hundred builder slaves • Around a thousand tester slaves • Hundreds of thousands of tests executed against each build
Background – tbpl • Tinderbox Push Log (tbpl) presents buildbot results • Results are colour coded: • Green: passes • Red: fatal errors, including crashes • Orange: nonfatal errors • Blue: test restarted due to infrastructure error • Purple: unrecoverable infrastructure error
Background – Intermittent oranges • When an orange or a red occurs, the changeset is usually backed out... • Except that the orange might indicate an intermittent failure. • Intermittent failures are “starred” and marked with a comment, usually a Bugzilla bug ID.
Background – Intermittent oranges • Starring updates the Bugzilla bug with a comment about the occurrence. • Ultimately it has to be done by a human. • For the rest of this presentation, we refer to an “intermittent orange” as just an “orange”.
The War on Orange • Predictably, more and more oranges occurred over time • We had no way to know even how many oranges were occurring, yet alone any characteristics of them • We needed a system to track oranges over time and extract data about their occurrences
The War on Orange • We created a web tool, known as both the War on Orange (WOO) and OrangeFactor (OF) • Rich HTML/CSS/JS client • Python back-end powered by web.py • Assorted Python helper scripts and modules
The War on Orange – Data sources • Using two distinct sources of data means they sometimes fall out of sync • This is noted on the UI • Could fall back to orange data, but it isn't completely accurate
The War on Orange - Metric • Basic metric is referred to as the “orange factor” (OF) • The orange factor is the ratio of oranges to test runs in a given period of time • OF of 5 means 5 oranges every test run, on average • Ideal OF is 0!
The War on Orange – Advanced UI • Data is great; information is (way) better • Implement some of the common analyses • JSON data available via web API for further analysis
The War on Orange – Advanced UI • Orange Seed: estimate when an orange was introduced • Calculate average interval between occurrences • Extrapolate to point in the past • Estimate probable range based on interval variance
The War on Orange – Taking Action • Augment passive data interface with active alerts • Keeps project visibility and focus • Weekly progress reports • Notifications of significant events • Large increases/decreases in OF, new oranges
Conclusion • Intermittent failures essentially unavoidable • Cannot solve the problem without data • Automatic analysis even better than tracking data • Notifications to maintain visibility and focus
Links • Application: • http://brasstacks.mozilla.com/orangefactor/ • Project page: • https://wiki.mozilla.org/Auto-tools/Projects/WarOnOrange • Mozilla Automation & Tools' home: • https://wiki.mozilla.org/Auto-tools