310 likes | 487 Views
Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME) A Case Study of Software RAID Systems. Aaron Brown 2000 Winter IRAM/ISTORE Retreat. Outline. Motivation: why new benchmarks, why AME? Availability benchmarks: a general approach
Towards Benchmarks for Availability, Maintainability, and Evolutionary Growth (AME)A Case Study of Software RAID Systems Aaron Brown 2000 Winter IRAM/ISTORE Retreat
Outline • Motivation: why new benchmarks, why AME? • Availability benchmarks: a general approach • Case study: availability of Software RAID • Conclusions and future directions
Why benchmark AME? • Most benchmarks measure performance • Today’s most pressing problems aren’t about performance • online service providers face new problems as they rapidly deploy, alter, and expand Internet services • providing 24x7 availability (A) • managing system maintenance (M) • handling evolutionary growth (E) • AME issues affect in-house providers and smaller installations as well • even UCB CS department
Why benchmarks? • Growing consensus that we need to focus on these problems in the research community • HotOS-7: “No-Futz Computing” • Hennessy at FCRC: • “other qualities [than performance] will become crucial: availability, maintainability, scalability” • “if access to services on servers is the killer app--availability is the key metric” • Lampson at SOSP ’99: • big challenge for systems research is building “systems that work: meet specifications, are always available, evolve while they run, grow without practical limits” • Benchmarks shape a field! • they define what we can measure
The components of AME • Availability • what factors affect the quality of service delivered by the system, and by how much/how long? • how well can systems survive typical failure scenarios? • Maintainability • how hard is it to keep a system running at a certain level of availability and performance? • how complex are maintenance tasks like upgrades, repair, tuning, troubleshooting, disaster recovery? • Evolutionary Growth • can the system be grown incrementally with heterogeneous components? • what’s the impact of such growth on availability, maintainability, and sustained performance?
Outline • Motivation: why new benchmarks, why AME? • Availability benchmarks: a general approach • Case study: availability of Software RAID • Conclusions and future directions
How can we measure availability? • Traditionally, percentage of time system is up • time-averaged, binary view of system state (up/down) • Traditional metric is too inflexible • doesn’t capture degraded states • a non-binary spectrum between “up” and “down” • time-averaging discards important temporal behavior • compare 2 systems with 96.7% traditional availability: • system A is down for 2 seconds per minute • system B is down for 1 day per month • Solution: measure variation in system quality of service metrics over time
Example Quality of Service metrics • Performance • e.g., user-perceived latency, server throughput • Degree of fault-tolerance • Completeness • e.g., how much of relevant data is used to answer query • Accuracy • e.g., of a computation or decoding/encoding process • Capacity • e.g., admission control limits, access to non-essential services
Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks
Methodology: reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior • 99% confidence intervals calculated from no-fault runs • Graphs can be distilled into numbers • quantify distribution of deviations from normal behavior, compute area under curve for deviations, ...
Outline • Motivation: why new benchmarks, why AME? • Availability benchmarks: a general approach • Case study: availability of Software RAID • Conclusions and future directions
Case study • Software RAID-5 plus web server • Linux/Apache vs. Windows 2000/IIS • Why software RAID? • well-defined availability guarantees • RAID-5 volume should tolerate a single disk failure • reduced performance (degraded mode) after failure • may automatically rebuild redundancy onto spare disk • simple system • easy to inject storage faults • Why web server? • an application with measurable QoS metrics that depend on RAID availability and performance
Benchmark environment: metrics • QoS metrics measured • hits per second • roughly tracks response time in our experiments • degree of fault tolerance in storage system • Workload generator and data collector • SpecWeb99 web benchmark • simulates realistic high-volume user load • mostly static read-only workload; some dynamic content • modified to run continuously and to measure average hits per second over each 2-minute interval
Benchmark environment: faults • Focus on faults in the storage system (disks) • How do disks fail? • according to Tertiary Disk project, failures include: • recovered media errors • uncorrectable write failures • hardware errors (e.g., diagnostic failures) • SCSI timeouts • SCSI parity errors • note: no head crashes, no fail-stop failures
Disk fault injection technique • To inject reproducible failures, we replaced one disk in the RAID with an emulated disk • a PC that appears as a disk on the SCSI bus • I/O requests processed in software, reflected to local disk • fault injection performed by altering SCSI command processing in the emulation software • Types of emulated faults: • media errors (transient, correctable, uncorrectable) • hardware errors (firmware, mechanical) • parity errors • power failures • disk hangs/timeouts
IBM18 GB10k RPM Server Disk Emulator IDEsystemdisk SCSIsystemdisk Adaptec2940 UltraSCSI EmulatedDisk Adaptec2940 IBM18 GB10k RPM emulatorbacking disk(NTFS) IBM18 GB10k RPM Adaptec2940 Adaptec2940 Adaptec2940 AdvStorASC-U2W IBM18 GB10k RPM EmulatedSpareDisk AMD K6-2-33364 MB DRAMLinux or Win2000 AMD K6-2-350Windows NT 4.0ASC VirtualSCSI lib. RAIDdata disks = Fast/Wide SCSI bus, 20 MB/sec System configuration • RAID-5 Volume: 3GB capacity, 1GB used per disk • 3 physical disks, 1 emulated disk, 1 emulated spare disk • 2 web clients connected via 100Mb switched Ethernet
Results: single-fault experiments • One exp’t for each type of fault (15 total) • only one fault injected per experiment • no human intervention • system allowed to continue until stabilized or crashed • Four distinct system behaviors observed (A) no effect: system ignores fault (B) RAID system enters degraded mode (C) RAID system begins reconstruction onto spare disk (D) system failure (hang or crash)
System behavior: single-fault (A) no effect (B) enter degraded mode (C) begin reconstruction (D) system failure
System behavior: single-fault (2) • Windows ignores benign faults • Windows can’t automatically rebuild • Linux reconstructs on all errors • Both systems fail when disk hangs
Interpretation: single-fault exp’ts • Linux and Windows take opposite approaches to managing benign and transient faults • these faults do not necessarily imply a failing disk • Tertiary Disk: 368/368 disks had transient SCSI errors; 13/368 disks had transient hardware errors, only 2/368 needed replacing. • Linux is paranoid and stops using a disk on any error • fragile: system is more vulnerable to multiple faults • but no chance of slowly-failing disk impacting perf. • Windows ignores most benign/transient faults • robust: less likely to lose data, more disk-efficient • less likely to catch slowly-failing disks and remove them • Neither policy is ideal! • need a hybrid?
Results: multiple-fault experiments • Scenario (1) disk fails (2) data is reconstructed onto spare (3) spare fails (4) administrator replaces both failed disks (5) data is reconstructed onto new disks • Requires human intervention • to initiate reconstruction on Windows 2000 • simulate 6 minute sysadmin response time • to replace disks • simulate 90 seconds of time to replace hot-swap disks
System behavior: multiple-fault Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not
Interpretation: multi-fault exp’ts • Linux and Windows have different reconstruction philosophies • Linux uses idle bandwidth for reconstruction • little impact on application performance • increases length of time system is vulnerable to faults • Windows steals app. bandwidth for reconstruction • reduces application performance • minimizes system vulnerability • but must be manually initiated (or scripted) • Windows favors fault-tolerance over performance; Linux favors performance over fault-tolerance • the same design philosophies seen in the single-fault experiments
Maintainability Observations • Scenario: administrator accidentally removes and replaces live disk in degraded mode • double failure; no guarantee on data integrity • theoretically, can recover if writes are queued • Windows recovers, but loses active writes • journalling NTFS is not corrupted • all data not being actively written is intact • Linux will not allow removed disk to be reintegrated • total loss of all data on RAID volume!
Maintainability Observations (2) • Scenario: administrator adds a new spare • a common task that can be done with hot-swap drive trays • Linux requires a reboot for the disk to be recognized • Windows can dynamically detect the new disk • Windows 2000 RAID is easier to maintain • easier GUI configuration • more flexible in adding disks • SCSI rescan and NTFS deal with administrator goofs • less likely to require administration due to transient errors • BUT must manually initiate reconstruction when needed
Outline • Motivation: why new benchmarks, why AME? • Availability benchmarks: a general approach • Case study: availability of Software RAID • Conclusions and future directions
Conclusions • AME benchmarks are needed to direct research toward today’s pressing problems • Our availability benchmark methodology is powerful • revealed undocumented design decisions in Linux and Windows 2000 software RAID implementations • transient error handling • reconstruction priorities • Windows & Linux SW RAIDs are imperfect • but Windows is easier to maintain, less likely to fail due to double faults, and less likely to waste disks • if spares are cheap and plentiful, Linux auto-reconstruction gives it the edge
Future Directions • Add maintainability • use methodology from availability benchmark • but include administrator’s response to faults • must develop model of typical administrator behavior • can we quantify administrative work needed to maintain a certain level of availability? • Expand availability benchmarks • apply to other systems: DBMSs, mail, distributed apps • use ISTORE-1 prototype • 80-node x86 cluster with built-in fault injection, diags. • Add evolutionary growth • just extend maintainability/availability techniques?
Availability Example: SW RAID • Win2k/IIS, Linux/Apache on software RAID-5 volumes Windows 2000/IIS Linux/ Apache • Windows gives more bandwidth to reconstruction, minimizing fault vulnerability at cost of app performance • compare to Linux, which does the opposite
(1) data disk faulted (2) reconstruction (3) spare faulted (4) disks replaced (5) reconstruction System behavior: multiple-fault Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not