550 likes | 570 Views
This article discusses the post-PC era, which will be driven by tiny embedded devices and the infrastructure needed to support them. It explores the motivation behind the ISTORE project and its research principles, as well as proposed techniques for achieving availability, maintainability, and evolutionary growth. The article also addresses the challenges and potential solutions for memory systems in microprocessors and the scalability problems in traditional server designs.
E N D
Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu 10 Feburary 2000
Perspective on Post-PC Era • PostPC Era will be driven by 2 technologies: 1) Tiny Embedded orMobile Consumer Devices • e.g., successor to PDA, cell phone, wearable computers • ubiquitous: in everything 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers
Outline 1) One instance of microprocessors for gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work
Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years)(We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep perf Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Revive Vector Architecture
I/O I/O I/O I/O V-IRAM1: Low Power v. High Perf. 4 x 64 or 8 x 32 or 16 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 16K I cache 16K D cache 4 x 64 4 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 4 x 64 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … … … M M M M M M M M M M
C P U+$ 4 Vector Pipes/Lanes VIRAM-1: System on a Chip • Prototype scheduled for tape-out mid 2000 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar I/O Memory(64 Mbits / 8 MBytes)
Base-line system comparison • All numbers in cycles/pixel • MMX and VIS results assume all data in L1 cache
IRAM Chip Challenges • Merged Logic-DRAM process: Cost of wafer, Impact on yield, testing cost of logic and DRAM • Price: on-chip DRAM v. separate DRAM chips? • Time delay of transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only • DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block • Applications: advantages in memory bandwidth, energy, system size to offset above challenges?
Other examples: Sony Playstation 2 • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) • Superscalar MIPS core + vector coprocessor + graphics/DRAM • Claim: “Toy Story” realism brought to games!
Other examples: IBM Blue Gene • Blue Gene Chip • 20 x 20 mm • 32 Multithreaded RISC processors + ??MB Embedded DRAM + high speed Network Interface on single chip • 1 GFLOPS / processor • 2’ x 2’ Board = 64 chips • Tower = 8 Boards • System = 64 Towers • Total 1 million processors (25 x 26 x 23 x 26), in just 2000 sq. ft. • Cost: $100M • Goal: 1 PetaFLOPS in 2005? • Application: Protein Folding
Outline 1) One instance of microprocessors for gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work
The problem space: big data • Big demand for enormous amounts of data • today: high-end enterprise and Internet applications • enterprise decision-support, data mining databases • online applications: e-commerce, mail, web, archives • future: infrastructure services, richer data • computational & storage back-ends for mobile devices • more multimedia content • more use of historical data to provide better services • Today’s server designs can’t easily scale to meet these huge demands • bus bandwidth bottlenecks limit access to stored data • SMP designs are near their limits and don’t offer incremental growth path
One approach: traditional NAS • Network-attached storage makes storage devices first-class citizens on the network • network file server appliances (NetApp, SNAP, ...) • storage-area networks (CMU NASD, NSIC OOD, ...) • active disks (CMU, UCSB, Berkeley IDISK) • These approaches primarily target performance scalability • scalable networks remove bus bandwidth limitations • migration of layout functionality to storage devices removes overhead of intermediate servers • There are bigger scaling problems than scalable performance!
The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow
The ISTORE project vision • Our goal: develop principles and investigate hardware/software techniques for building storage-based server systems that: • are highly available • require minimal maintenance • robustly handle evolutionary growth • are scalable to O(10000) nodes
Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance • “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • compare: biological systems spend > 50% of resources on maintenance • can make up performance by scaling system
Principles for achieving AME (2) • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen • Benchmarking • quantification brings rigor • requires new AME benchmarks “what gets measured gets done” “benchmarks shape a field”
Outline 1) One instance of microprocessors for gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work
Hardware techniques • Fully shared-nothing cluster organization • truly scalable architecture • architecture that can tolerate partial failure • automatic hardware redundancy • Storage distributed with computation nodes • distributed processing reduces data movement and avoids network bottlenecks • nodes are responsible for the health of the storage that they own • if AME is important, must provide resources to be used for AME
Hardware techniques (2) • Heavily instrumented hardware • sensors for temp, vibration, humidity, power, intrusion • helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node • provides remote control of power, remote console access to the node, selection of node boot code • collects, stores, processes environmental data for abnormalities • non-volatile “flight recorder” functionality • all diagnostic processors connected via independent diagnostic network
Hardware techniques (3) • Built-in fault injection capabilities • power control to individual node components • injectable glitches into I/O and memory busses • on-demand network partitioning/isolation • managed by diagnostic processor and network switches via diagnostic network • used for proactive hardware introspection • automated detection of flaky components • controlled testing of error-recovery mechanisms • important for AME benchmarking
Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mb/s • 2 1 Gb/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibrartion sensors... Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster
ISTORE Brick Block Diagram Mobile Pentium II Module SCSI North Bridge CPU Disk (18 GB) South Bridge Diagnostic Net DUAL UART DRAM 256 MB Super I/O Monitor & Control Diagnostic Processor BIOS Ethernets 4x100 Mb/s PCI • Sensors for heat and vibration • Control over power to individual nodes Flash RTC RAM
A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • 10,000+ nodes fit into one rack! • This scale is our ultimate design point
Software techniques • Fully-distributed, shared-nothing code • centralization breaks as systems scale up O(10000) • avoids single-point-of-failure front ends • Redundant data storage • required for high availability, simplifies self-testing • replication at the level of application objects • application can control consistency policy • more opportunity for data placement optimization
Software techniques (2) • “River” storage interfaces • NOW Sort experience: performance heterogeneity is the norm • disks: inner vs. outer track (50%), fragmentation • processors: load (1.5-5x) • So demand-driven delivery of data to apps • via distributed queues and graduated declustering • for apps that can handle unordered data delivery • automatically adapts to variations in performance of producers and consumers
Software techniques (3) • Reactive introspection • use statistical techniques to identify normal behavior and detect deviations from it • policy-driven automatic adaptation to abnormal behavior once detected • initially, rely on human administrator to specify policy • eventually, system learns to solve problems on its own by experimenting on isolated subsets of the nodes • one candidate: reinforcement learning
Software techniques (4) • Proactive introspection • continuous online self-testing of HW and SW • in deployed systems! • goal is to shake out “Heisenbugs” before they’re encountered in normal operation • needs data redundancy, node isolation, fault injection • techniques: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: push HW/SW to their limits • scrubbing: periodic restoration of potentially “decaying” hardware or software state • self-scrubbing data structures (like MVS) • ECC scrubbing for disks and memory
Applications • ISTORE is not one super-system that demonstrates all these techniques! • Initially provide library to support AME goals • Initial application targets • cluster web/email servers • self-scrubbing data structures, online self-testing • statistical identification of normal behavior • decision-support database query execution system • River-based storage, replica management • information retrieval for multimedia data • self-scrubbing data structures, structuring performance-robust distributed computation
Outline 1) One instance of microprocessors for gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work
Availability benchmarks • Questions to answer • what factors affect the quality of service delivered by the system, and by how much/how long? • how well can systems survive typical failure scenarios? • Availability metrics • traditionally, percentage of time system is up • time-averaged, binary view of system state (up/down) • traditional metric is too inflexible • doesn’t capture spectrum of degraded states • time-averaging discards important temporal behavior • Solution: measure variation in system quality of service metrics over time • performance, fault-tolerance, completeness, accuracy
Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks
Methodology: reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior • 99% confidence intervals calculated from no-fault runs • Graphs can be distilled into numbers • quantify distribution of deviations from normal behavior, compute area under curve for deviations, ...
Example results: software RAID-5 • Test systems: Linux/Apache and Win2000/IIS • SpecWeb ’99 to measure hits/second as QoS metric • fault injection at disks based on empirical fault data • transient, correctable, uncorrectable, & timeout faults • 15 single-fault workloads injected per system • only 4 distinct behaviors observed (A) no effect (C) RAID enters degraded mode (B) system hangs (D) RAID enters degraded mode & starts reconstruction • both systems hung (B) on simulated disk hangs • Linux exhibited (D) on all other errors • Windows exhibited (A) on transient errors and (C) on uncorrectable, sticky errors
Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not
Conclusions • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Mobile consumer electronic devices • Scaleable infrastructure • IRAM benchmarking result: faster than DSPs • ISTORE: hardware/software architecture for large scale network services • Scaling systems requires • new continuous models of availability • performance not limited by the weakest link • self* systems to reduce human interaction
Benchmark conclusions • Linux and Windows take opposite approaches to managing benign and transient faults • Linux is paranoid and stops using a disk on any error • Windows ignores most benign/transient faults • Windows is more robust except when disk is truly failing • Linux and Windows have different reconstruction philosophies • Linux uses idle bandwidth for reconstruction • Windows steals app. bandwidth for reconstruction • Windows rebuilds fault-tolerance more quickly • Win2k favors fault-tolerance over performance; Linux favors performance over fault-tolerance
ISTORE conclusions • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks are a powerful tool • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000
Future work • ISTORE • implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications • select the best techniques and integrate into a generic runtime system with “AME API” • AME benchmarks • expand availability benchmarks to distributed apps • add maintainability • use methodology from availability benchmark • but include administrator’s response to faults • must develop model of typical administrator behavior • can we quantify administrative work needed to maintain a certain level of availability?
The UC Berkeley ISTORE Project:bringing availability, maintainability, and evolutionary growth to storage-based clusters For more information: http://iram.cs.berkeley.edu/istore istore-group@cs.berkeley.edu
Backup Slides (mostly in the area of benchmarking)
Case study • Software RAID-5 plus web server • Linux/Apache vs. Windows 2000/IIS • Why software RAID? • well-defined availability guarantees • RAID-5 volume should tolerate a single disk failure • reduced performance (degraded mode) after failure • may automatically rebuild redundancy onto spare disk • simple system • easy to inject storage faults • Why web server? • an application with measurable QoS metrics that depend on RAID availability and performance
Benchmark environment: metrics • QoS metrics measured • hits per second • roughly tracks response time in our experiments • degree of fault tolerance in storage system • Workload generator and data collector • SpecWeb99 web benchmark • simulates realistic high-volume user load • mostly static read-only workload; some dynamic content • modified to run continuously and to measure average hits per second over each 2-minute interval
Benchmark environment: faults • Focus on faults in the storage system (disks) • How do disks fail? • according to Tertiary Disk project, failures include: • recovered media errors • uncorrectable write failures • hardware errors (e.g., diagnostic failures) • SCSI timeouts • SCSI parity errors • note: no head crashes, no fail-stop failures
Disk fault injection technique • To inject reproducible failures, we replaced one disk in the RAID with an emulated disk • a PC that appears as a disk on the SCSI bus • I/O requests processed in software, reflected to local disk • fault injection performed by altering SCSI command processing in the emulation software • Types of emulated faults: • media errors (transient, correctable, uncorrectable) • hardware errors (firmware, mechanical) • parity errors • power failures • disk hangs/timeouts
IBM18 GB10k RPM Server Disk Emulator IDEsystemdisk SCSIsystemdisk Adaptec2940 UltraSCSI EmulatedDisk Adaptec2940 IBM18 GB10k RPM emulatorbacking disk(NTFS) IBM18 GB10k RPM Adaptec2940 Adaptec2940 Adaptec2940 AdvStorASC-U2W IBM18 GB10k RPM EmulatedSpareDisk AMD K6-2-33364 MB DRAMLinux or Win2000 AMD K6-2-350Windows NT 4.0ASC VirtualSCSI lib. RAIDdata disks = Fast/Wide SCSI bus, 20 MB/sec System configuration • RAID-5 Volume: 3GB capacity, 1GB used per disk • 3 physical disks, 1 emulated disk, 1 emulated spare disk • 2 web clients connected via 100Mb switched Ethernet
Results: single-fault experiments • One exp’t for each type of fault (15 total) • only one fault injected per experiment • no human intervention • system allowed to continue until stabilized or crashed • Four distinct system behaviors observed (A) no effect: system ignores fault (B) RAID system enters degraded mode (C) RAID system begins reconstruction onto spare disk (D) system failure (hang or crash)
System behavior: single-fault (A) no effect (B) enter degraded mode (C) begin reconstruction (D) system failure
System behavior: single-fault (2) • Windows ignores benign faults • Windows can’t automatically rebuild • Linux reconstructs on all errors • Both systems fail when disk hangs