Software Testing Doesn’t Scale

Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

Overview • The Problem: • S/W size & complexity inevitable • Short cycles reduce S/W reliability • S/W testing is the real issue • Testing doesn’t scale • trading complexity for quality • Cluster-based solution • The Inktomi lesson • Shared-nothing cluster architecture • Redundant data & metadata • Fault isolation domains

S/W Size & Complexity Inevitable • Successful S/W products grow large • # features used by a given user small • But union of per-user features sets is huge • Reality of commodity, high volume S/W • Large feature sets • Same trend as consumer electronics • Example mid-tier & server-side S/W stack: • SAP: ~47 mloc • DB: ~2 mloc • NT: ~50 mloc • Testing all feature interactions impossible

Short Cycles Reduce S/W Reliability • Reliable TP systems typically evolve slowly & conservatively • Modern ERP systems can go through 6+ minor revisions/year • Many e-commerce sites change even faster • Fast revisions a competitive advantage • Current testing and release methodology: • As much testing as dev time • Significant additional beta-cycle time • Unacceptable choice: • reliable but slow evolving or fast changing yet unstable and brittle

Testing the Real Issue • 15 yrs ago test teams tiny fraction of dev group • Now tests teams of similar size as dev & growing rapidly • Current test methodology improving incrementally: • Random grammar driven test case generation • Fault injection • Code path coverage tools • Testing remains effective at feature testing • Ineffective at finding inter-feature interactions • Only a tiny fraction of Heisenbugs found in testing (www.research.microsoft.com/~gray/Talks/ISAT_Gray_FT_Avialiability_talk.ppt) • Beta testing because test known to be inadequate • Test team growth scales exponentially with system complexity • Test and beta cycles already intolerably long

The Inktomi Lesson • Inktomi web search engine (SIGMOD’98) • Quickly evolving software: • Memory leaks, race conditions, etc. considered normal • Don’t attempt to test & beta until quality high • System availability of paramount importance • Individual node availability unimportant • Shared nothing cluster • Exploit ability to fail individual nodes: • Automatic reboots avoid memory leaks • Automatic restart of failed nodes • Fail fast: fail & restart when redundant checks fail • Replace failed hardware weekly (mostly disks) • Dark machine room • No panic midnight calls to admins • Mask failures rather than futile attempt to avoid

Apply to High Value TP Data? • Inktomi model: • Scales to 100’s of nodes • S/W evolves quickly • Low testing costs and no-beta requirement • Exploits ability to lose individual node without impacting system availability • Ability to temporarily lose some data W/O significantly impacting query quality • Can’t loose data availability in most TP systems • Redundant data allows node loss w/o data availability lost • Inktomi model with redundant data & metadata a solution to exploding test problem

Connection Model/Architecture • All data & metadata multiply redundant • Shared nothing • Single system image • Symmetric server nodes • Any client connects to any server • All nodes SAN-connected Client Server Node Server Cloud

Query execution on many subthreads synchronized by root thread Query execute Compilation & Execution Model Client Server Thread Lex analyze Parse Normalize Optimize Code generate Server Cloud

Lose node • Recompile • Re-execute Node Loss/Rejoin • Execution in progress Client • Rejoin. • Node local recovery • Rejoin cluster • Recover global data at rejoining node • Rejoin cluster Server Cloud

Redundant Data Update Model • Updates are standard parallel plans • Optimizer knows all redundant data paths • Generated plan updates all • No significant new technology • Like materialized view & index updates today Client Server Cloud

Fault Isolation Domains • Trade single-node perf for redundant data checks: • Fairly common…but complex error recovery is even more likely to be wrong than original forward processing code • Many of the best redundant checks are compiled out of “retail versions” when shipped (when needed most) • Fail fast rather than attempting to repair: • Bring down node for mem-based data structure faults • Never patch inconsistent data…other copies keep system available • If anything goes wrong “fire” the node and continue: • Attempt node restart • Auto-reinstall O/S, DB and recreate DB partition • Mark node “dead” for later replacement

Summary • 100 MLOC of server-side code and growing: • Can’t fight it & can’t test it … • quality will continue to decline if we don’t do something different • Can’t afford 2 to 3 year dev cycle • 60’s large system mentality still prevails: • Optimizing precious machine resources is false economy • Continuing focus on single-system perf dead wrong: • Scalability & system perf rather than individual node performance • Why are we still incrementally attacking an exponential problem? • Any reasonable alternatives to clusters?

Software Testing Doesn’t Scale James Hamilton JamesRH@microsoft.com Microsoft SQL Server

Software Testing Doesn’t Scale