130 likes | 143 Views
This article discusses the importance of availability benchmarks for database software stacks. It explores the behavior of a database system under failures, maintenance, and recovery, using a realistic workload and fault injection. The results show the resilience of the database but highlight issues with middleware and applications. The article concludes by presenting the achievable grand challenges in availability benchmarking and the need to tolerate and recover from non-failstop system-level faults.
E N D
Breaking databases for fun and publications: availability benchmarks Aaron Brown UC Berkeley ROC Group HPTS 2001
Motivation • Drinking the availability Kool-Aid • availability is the key metric for modern apps. • Database stack’s availability is especially important • guardians of the world’s hard state • almost any user’s request for electronic information hits a database stack • web services, directories, enterprise apps, ... • Can we trust database software stacks in the face of failure?
Availability benchmarking 101 • Availability benchmarks quantify system behavior under failures, maintenance, recovery • They require • a realistic workload for the system: TPC-C • quality of service metrics: txn rates, OK and aborted • fault-injection to simulate failures: single-disk errors normal behavior(99% conf.) QoS degradation failure Repair Time
sticky uncorrectable write error, log disk Disk hang during write to data disk Well, what happens? • Setup • 3-tier: Microsoft SQLServer/COM+/IIS & bus. logic • TPC-C-like workload; faults injected into DB data & log • Results • DBMS tolerates transient and recoverable failures, reflecting errors back via transaction aborts • middleware highly unstable: degrades or crashes when DBMS fails or undergoes lengthy recovery database fails, middleware degrades middleware causesdegraded performance middlewarecrashes database recovers
Summary • Database is pretty resilient • transaction abort == good error-reflection mechanism • Middleware/applications suck (well, at least this instance of them) • Robustness is end-to-end • user cannot distinguish DBMS and middleware failures • failure recovery must go beyond the DBMS • Achievable Grand Challenges? • build and run availability benchmarks on your systems • tolerate and recover from non-failstop system-level faults Does performance matter?
Experimental setup • Database • Microsoft SQL Server 2000, default configuration • Middleware/front-end software • Microsoft COM+ transaction monitor/coordinator • IIS 5.0 web server with Microsoft’s tpcc.dll HTML terminal interface and business logic • Microsoft BenchCraft remote terminal emulator • TPC-C-like OLTP order-entry workload • 10 warehouses, 100 active users, ~860 MB database • Measured metrics • throughput of correct NewOrder transactions/min • rate of aborted NewOrder transactions (txn/min)
Disk Emulator IDEsystemdisk SCSIsystemdisk SCSIsystemdisk UltraSCSI EmulatedDisk emulatorbacking disk(NTFS) IBM18 GB10k RPM Adaptec2940 AdvStorASC-U2W ASC VirtualSCSI lib. Intel P-II/300128 MB DRAMWindows NT 4.0 = Fast/Wide SCSI bus, 20 MB/sec Experimental setup (2) • Database installed in one of two configurations: • data on emulated disk, log on real (IBM) disk • data on real (IBM) disk, log on emulated disk Front End DB Server Adaptec3940 100mbEthernet MS BenchCraft RTEIIS + MS tpcc.dllMS COM+ IBM18 GB10k RPM SQL Server 2000 AMD K6-2/333128 MB DRAMWindows 2000 AS Intel P-III/450256 MB DRAMWindows 2000 AS DB data/log disks
Results • All results are from single-fault micro-benchmarks • 14 different fault types • injected once for each of data and log partitions • 4 categories of behavior detected 1) normal 2) transient glitch 3) degraded 4) failed
Type 1: normal behavior • System tolerates fault • Demonstrated for all sector-level faults except: • sticky uncorrectable read, data partition • sticky uncorrectable write, log partition
Type 2: transient glitch • One transaction is affected, aborts with error • Subsequent transactions using same data would fail • Demonstrated for one fault only: • sticky uncorrectable read, data partition
Type 3: degraded behavior • DBMS survives error after running log recovery • Middleware partially fails, results in degraded perf. • Demonstrated for one fault only: • sticky uncorrectable write, log partition
Type 4: failure • Example behaviors (10 distinct variants observed) Disk hang during write to data disk Simulated log disk power failure • DBMS hangs or aborts all transactions • Middleware behaves erratically, sometimes crashing • Demonstrated for all fatal disk-level faults • SCSI hangs, disk power failures