590 likes | 802 Views
Graphics Stability. Steve Morrow Software Design Engineer WGGT stevemor @ microsoft.com Microsoft Corporation. Gershon Parent Software Swordsman WGGT gershonp @ microsoft.com Microsoft Corporation. Session Outline. Stability Benchmark History
E N D
Graphics Stability Steve Morrow Software Design Engineer WGGT stevemor @ microsoft.com Microsoft Corporation Gershon Parent Software Swordsman WGGT gershonp @ microsoft.com Microsoft Corporation
Session Outline • Stability Benchmark History • CRASH (Comparative Reliability Analyzer for Software and Hardware) • The CRASH Tool • The CRASH Plan • The Experiments • CDER (Customer Driver Experience Rating) • Program Background and Description • High-level Statistics of the Program • Factors Examined in the Crash Data • Normalized Ratings • Customer Experience and Loyalty
Stability Benchmark History • WinHEC – May ‘04 • CRASH 1.0 released. • Web portal has 52 non-MS members from 16 companies • November ’04 • CRASH 1.1 released to the web. Includes DB backend • December ’04 • Stability Benchmark components ship to 8,000 customers and normalizable OCA data begins flowing in • CRASH Lab completes first data collection pass • Web portal has over 60 non-MS members from 17 companies
CRASH Tool • CRASH is new dynamic software loading tool designed to expose and easily reproduce reliability defects in drivers/hardware • Answers the call from IHVs and OEMs for more reliability test tools. • Enables wide range of endurance/load/stress testing • Configurable load profiles • Scheduled cycling (starting and stopping) of test applications • Replay-ability • Automatic failure cause determination • Scripting for multiple passes with different scenarios • Creation of a final “score”
CRASH Demo o _ X o _ X
CRASH Demo o o _ _ X X
CRASH Demo o o _ _ X X
CRASH: 4 Phase Plan • Phase 1 • Produce CRASH documentation for review by partners • Release 1.0 to our partners for feedback • Phase 2 • Release 1.1 with database functionality to our partners • Execute controlled baseline experiments on a fixed set of HW and SW to evaluate the tool’s effectiveness • Phase 3 • Execute series of experiments and use results to increase accuracy and usefulness of the tool • Phase 4 • Create a CRASH-based tool for release to a larger audience
Experiment 1 Objectives • Determine if the CRASH data collected sufficient to draw meaningful conclusions about the part/driver stability differences • Determine how machine configuration affects stability • Evaluate how the different scenarios relate to conclusions about stability • Find the minimum data-set needed to make meaningful conclusions about part/driver stability • Create a “baseline” from which to measure future experiments • Identify other dimensions of stability not exposed in the CRASH score
Experiment 1 Details • Standardize on one late-model driver/part from four IHVs • Part/Driver A, Part/Driver B, Part/Driver C, Part/Driver D • Test them across 12 different flavors of over-the-counter PCs from 4 OEMs • OEM A, OEM B, OEM C, OEM D • High End and Low End • Include at least two motherboard types • MB Type 1, MB Type 2 • Clean install of XP SP2 plus latest WHQL drivers • Drivers snapped 8/16/04 • Use the 36 hr benchmark profile shipped with CRASH 1.1
Important Considerations • Results apply only to these Part/Driver/System combinations only • Extrapolation of these results to other parts or drivers or systems is impossible with this data
CRASH Terminology • Profile • Represents a complete “run” of the Crash tool against a driver • Contains one or more scenarios • Scenario • Describes a session of CRASH testing • Load intensity/profile • What tests will be used • How many times to run this scenario (loops) • Score • Score is always a number that represents the percentage of the testing completed before a system failure (hang or kernel-break)
CRASH Terminology: Failures • Failure • Hang • No minidump found and loop did not complete • Targeted Failure • Minidump auto-analysis found failure was in the display driver • Non-Targeted Failure • Minidump analysis found failure was not in display driver • Does not count against the score
Experiment 1 Test Profile • Real Life • Moderate load and application cycling • 9 max and 3 min load • Tractor Pull • No load cycling • Moderate application cycling • Incrementally increasing load • Intense • High frequency load and application cycling • 9 max and 0 min load
Statistical Relevance Questions • Question: How do I know that the difference between the averages of result set 1 and Result Set 2 are meaningful? • Question: How can I find the smallest result set size that will give me 95% confidence? • Answer: Use the “Randomization Test”
Randomization Test Delta 1 Set 1 Set 2 • Random test 10,000 times. If 95% of the time the Delta 1 is greater than Delta 2 then you are assured the difference is meaningful. • Try smaller sample sizes until the confidence drops below 95%. That is your minimum sample size. • Information on the “Randomization Test” can be found online at:http://www.uvm.edu/~dhowell/StatPages/Resampling/RandomizationTests.html Combination Set Delta 2 Random Set 1 Random Set 2
Scores and Confidence Intervals for Part/Driver/MB Combinations
The Experiment Matrix With three experiments completed, we can now compare: One driver across two OS configurations Two versions of one driver across a single OS configuration
Old vs. New Drivers This table compares the profile scores for old drivers vs. new drivers on OEM Image New drivers were noticeably better for parts/drivers C & D Part/Driver A and B were unchanged
OEM Image vs. Clean Install This table compares profile scores for OEM Image vs. Clean Install with Old Drivers Clean install scores universally better than OEM image for parts/drivers C and D Part/Driver A and B were unchanged Clean Install OEM Image Aug ’04 Driver Experiment 1 Experiment 3 Jan ’05 Driver Experiment 2
Future Plans • Collate with OCA data • CRASH failure to OCA bucket correlations • What buckets were fixed between 1st and 2nd driver versions? • Do our results match field data? • customer machines have hardware that is typically several years old • Can we find the non-display failure discrepancy in the field? • Begin to tweak other knobs • Content • Driver-versions • HW-versions • Windows codenamed “Longhorn” Test Bench • PCIe cards
Suggested Future Experiments • Include more motherboard types • Newer drivers or use a “Control Group” driver. Reference Rasterizer? • Disabled AGP to isolate chipset errors from AGP errors • Driver-Verifier enabled • Add non-graphics stress tests to the mix • Modified Loop Times
IHV Feedback • “There are definitely unique [driver] problems exposed through the use of CRASH and it is improving our driver stability greatly” • “[CRASH is] producing real failures and identifying areas of the driver that we are improving on” • “Thanks for a very useful tool”
CRASH 1.2 features • RunOnExit • User specified command run upon the completion of CRASH profile • More logging • Logging to help troubleshoot problems with data flow • More information output in xml • More system information • More failure details from minidumps • More control over where files are put • More robust handling of network issues
Customer Device Experience Rating (CDER) Program Background • Started from a desire to rate display driver stability based on OCA crashes • Controlled program addresses shortcomings of OCA data: • Unknown market share • Unknown crash reporting habits • Unknown info on non-crashing machines • This allows normalization of OCA data to be able to get accurate ‘number of crashes per machine’ stability rating
CDER Program Description & Status • Program & Tools • A panel of customers (Windows XP only) • User opt-in allows extensive data collection, unique machine ID • System Agent/scheduler • System Configuration Collector • OCA Minidump Collector • System usage tool (not yet in the analysis) • Status • All tools for Windows XP in place and functioning • First set of data collected, parsed, analyzed
Overall Crash Statistics of Panel • Machines • 8927 in panel • 49.9% experience no crashes • 50.1% experience crash(es) • 8580 have valid device & driver info • 82.2% have no display crashes • 17.8% have display crashes • Crashes • 16.1% of valid crashes are in display • Note: Crashes occurred over 4 yr period
Crash Analysis Factors • Examined several factors which may have an impact on stability ratings • Processor • Display Resolution • Bit Depth • Monitor Refresh Rate • Display Memory • Note: Vendor & part naming does not correspond to that in CRASH presentation. • Note: Unless otherwise noted, data for these analyses were from the last 3 years
Normalized Crash Data • The following data is normalized by program share of crashing and non-crashing machines
Crashes per Machine Ranking by Display Vendor for Last Year (2004)
‘Vendor A’ Normalized Crashes By Part/ASIC Family Over Last 3 Years
‘Display Vendor B’ Normalized Crashes By Part/ASIC Family Over Last 3 Years
‘Display Vendor C’ Normalized Crashes By Part/ASIC Family Over Last 3 Years
Ranking and Rating Conclusions • This is a first look • Need to incorporate system usage data • Need to continue collecting configuration data to track driver and hardware changes • Need more panelists, and a higher proportion of newer parts • With that said: • This is solid data • This demonstrates our tools work as designed • It shows the viability of a crash-based rating program
Customer Experience & Loyalty • A closer look at segment of panelists who: • Experienced display crashes, and • Switched or upgraded their display hardware or driver
Experience & Loyalty Highlights • 19.4% of users who experienced display crashes upgraded their drivers, or hardware, or changed to a different display vendor • 7.9% of users (nearly 41% of the 19.4%) who experienced display crashes switched to a competitor’s product • ALL users who switched to a competitor’s product had the same or better experience • Only 91.3% of those who upgraded had the same or better experience afterwards, based on crashes • Time clustering of crashes