Breaking databases for fun and publications: availability benchmarks

Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001

Motivation • Drinking the availability Kool-Aid • availability is the key metric for modern apps. • Database stack’s availability is especially important • guardians of the world’s hard state • almost any user’s request for electronic information hits a database stack • web services, directories, enterprise apps, ... • Can we trust database software stacks in the face of failure?

Availability benchmarking 101 • Availability benchmarks quantify system behavior under failures, maintenance, recovery • They require • a realistic workload for the system: TPC-C • quality of service metrics: txn rates, OK and aborted • fault-injection to simulate failures: single-disk errors normal behavior(99% conf.) QoS degradation failure Repair Time

sticky uncorrectable write error, log disk Disk hang during write to data disk Well, what happens? • Setup • 3-tier: Microsoft SQLServer/COM+/IIS & bus. logic • TPC-C-like workload; faults injected into DB data & log • Results • DBMS tolerates transient and recoverable failures, reflecting errors back via transaction aborts • middleware highly unstable: degrades or crashes when DBMS fails or undergoes lengthy recovery database fails, middleware degrades middleware causesdegraded performance middlewarecrashes database recovers

Summary • Database is pretty resilient • transaction abort == good error-reflection mechanism • Middleware/applications suck (well, at least this instance of them) • Robustness is end-to-end • user cannot distinguish DBMS and middleware failures • failure recovery must go beyond the DBMS • Achievable Grand Challenges? • build and run availability benchmarks on your systems • tolerate and recover from non-failstop system-level faults Does performance matter?

Backup slides

Experimental setup • Database • Microsoft SQL Server 2000, default configuration • Middleware/front-end software • Microsoft COM+ transaction monitor/coordinator • IIS 5.0 web server with Microsoft’s tpcc.dll HTML terminal interface and business logic • Microsoft BenchCraft remote terminal emulator • TPC-C-like OLTP order-entry workload • 10 warehouses, 100 active users, ~860 MB database • Measured metrics • throughput of correct NewOrder transactions/min • rate of aborted NewOrder transactions (txn/min)

Disk Emulator IDEsystemdisk SCSIsystemdisk SCSIsystemdisk UltraSCSI EmulatedDisk emulatorbacking disk(NTFS) IBM18 GB10k RPM Adaptec2940 AdvStorASC-U2W ASC VirtualSCSI lib. Intel P-II/300128 MB DRAMWindows NT 4.0 = Fast/Wide SCSI bus, 20 MB/sec Experimental setup (2) • Database installed in one of two configurations: • data on emulated disk, log on real (IBM) disk • data on real (IBM) disk, log on emulated disk Front End DB Server Adaptec3940 100mbEthernet MS BenchCraft RTEIIS + MS tpcc.dllMS COM+ IBM18 GB10k RPM SQL Server 2000 AMD K6-2/333128 MB DRAMWindows 2000 AS Intel P-III/450256 MB DRAMWindows 2000 AS DB data/log disks

Results • All results are from single-fault micro-benchmarks • 14 different fault types • injected once for each of data and log partitions • 4 categories of behavior detected 1) normal 2) transient glitch 3) degraded 4) failed

Type 1: normal behavior • System tolerates fault • Demonstrated for all sector-level faults except: • sticky uncorrectable read, data partition • sticky uncorrectable write, log partition

Type 2: transient glitch • One transaction is affected, aborts with error • Subsequent transactions using same data would fail • Demonstrated for one fault only: • sticky uncorrectable read, data partition

Type 3: degraded behavior • DBMS survives error after running log recovery • Middleware partially fails, results in degraded perf. • Demonstrated for one fault only: • sticky uncorrectable write, log partition

Type 4: failure • Example behaviors (10 distinct variants observed) Disk hang during write to data disk Simulated log disk power failure • DBMS hangs or aborts all transactions • Middleware behaves erratically, sometimes crashing • Demonstrated for all fatal disk-level faults • SCSI hangs, disk power failures

Breaking databases for fun and publications: availability benchmarks