Computers for the Post-PC Era

Computers for the Post-PC Era David Patterson, Katherine Yelick University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs.berkeley.edu February 2000

Perspective on Post-PC Era • PostPC Era will be driven by 2 technologies: 1) “Gadgets”:Tiny Embedded or Mobile Devices • ubiquitous: in everything • e.g., successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices • e.g., successor to Big Fat Web Servers, Database Servers

Outline 1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work

L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: • 10X capacity vs. SRAM • on-chip memory latency 5-10X, bandwidth 50-100X • improve energy efficiency 2X-4X (no off-chip bus) • serial I/O 5-10X v. buses • smaller board area/volume IRAM advantages extend to: • a single chip system • a building block for larger systems

New Architecture Directions • “…media processing will become the dominant force in computer arch. and microprocessor design.” • “...new media-rich applications ... involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, 32-bit integer and Fl. Pt.” • Needs include real-time response, continuous media data types (no temporal locality), fine grain parallelism, coarse grain parallelism, memory bandwidth • “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEEComputer (9/97)

Cost: $1M each? Low latency, high BW memory system? Code density? Compilers? Performance? Power/Energy? Limited to scientific applications? Single-chip CMOS MPU/IRAM IRAM Much smaller than VLIW For sale, mature (>20 years)(We retarget Cray compilers) Easy scale speed with technology Parallel to save energy, keep performance Multimedia apps vectorizable too: N*64b, 2N*32b, 4N*16b Revive Vector Architecture

I/O I/O I/O I/O V-IRAM1: Low Power v. High Perf. 4 x 64 or 8 x 32 or 16 x 16 + x 2-way Superscalar Vector Instruction ÷ Processor Queue Load/Store Vector Registers 16K I cache 16K D cache 4 x 64 4 x 64 Serial I/O Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 4 x 64 4 x 64 4 x 64 4 x 64 4 x 64 … … … … … … … … … … M M M M M M M M M M

C P U+$ 4 Vector Pipes/Lanes VIRAM-1: System on a Chip • Prototype scheduled for tape-out mid 2001 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar I/O Memory(64 Mbits / 8 MBytes)

Media Kernel Performance

IRAM Chip Challenges • Merged Logic-DRAM process Cost: Cost of wafer, Impact on yield, testing cost of logic and DRAM • Price: on-chip DRAM v. separate DRAM chips? • Delay in transistor speeds, memory cell sizes in Merged process vs. Logic only or DRAM only • DRAM block: flexibility via DRAM “compiler” (vary size, width, no. subbanks) vs. fixed block • Apps: advantages in memory bandwidth, energy, system size to offset challenges?

Other examples: IBM “Blue Gene” • 1 PetaFLOPS in 2003 for $100M? • Application: Protein Folding • Blue Gene Chip • 25-32 Multithreaded RISC processors + 0.5MB Embedded DRAM / processor + high speed Network Interface on 20 x 20 mm chip • 1 GFLOPS / processor • 2’ x 2’ Board = 64 chips (1.6K-2K CPUs) • Rack = 8 Boards (512 chips,13K-16K CPUs) • System = 64-80 Racks (512 boards,32-40Kchips) • Total 1 million processors, 1 MW in just 2000 sq. ft. • Since single app, unbalanced system to save money • Traditional ratios: 1 MIPS, 1 MB, 1 Mbit/s I/O • Blue Gene ratios: 1 MIPS, 0.005 MB, 0.2 Mbit/s I/O

Other examples: Sony Playstation 2 • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) • Superscalar MIPS core + vector coprocessor + graphics/DRAM • Claim: “Toy Story” realism brought to games

Outline 1) Example microprocessor for PostPC gadgets 2) Motivation and the ISTORE project vision • AME: Availability, Maintainability, Evolutionary growth • ISTORE’s research principles • Proposed techniques for achieving AME • Benchmarks for AME • Conclusions and future work

The problem space: big data • Big demand for enormous amounts of data • today: high-end enterprise and Internet applications • enterprise decision-support, data mining databases • online applications: e-commerce, mail, web, archives • future: infrastructure services, richer data • computational & storage back-ends for mobile devices • more multimedia content • more use of historical data to provide better services • Today’s SMP server designs can’t easily scale • Bigger scaling problems than performance!

Lampson: Systems Challenges • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter “Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft

Hennessy: What Should the “New World” Focus Be? • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

ISTORE as Storage System of the Future • Availability, Maintainability, and Evolutionary growth key challenges for storage systems • Maintenance Cost = 10X to 100X Purchase Cost, so even 2X purchase cost for 1/2 maintenance cost wins • AME improvement enables even larger systems • ISTORE has cost-performance advantages • Better space, power/cooling costs ($@colocation site) • More MIPS, cheaper MIPS, no bus bottlenecks • Compression reduces network $, encryption protects • Single interconnect, supports evolution of technology • Match to future software storage services • Future storage service software target clusters

Is Maintenance the Key? • Rule of Thumb: Maintenance 10X to 100X HW • VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 • Sys. Man.: N crashes/problem, SysAdmin actions • Actions: set params bad, bad config, bad app install • HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%?

Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors...

ISTORE-1 Brick • Webster’s Dictionary: “brick: a handy-sized unit of building or paving material typically being rectangular and about 2 1/4 x 3 3/4 x 8 inches” • ISTORE-1 Brick: 2 x 4 x 11 inches (1.3x) • Single physical form factor, fixed cooling required, compatible network interface to simplify physical maintenance, scaling over time • Contents should evolve over time: contains most cost effective MPU, DRAM, disk, compatible NI • If useful, could have special bricks (e.g., DRAM rich) • Suggests network that will last, evolve: Ethernet

A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • 2006 brick: System On a Chip integrated with MicroDrive • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • If low power, 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

IStore-2Deltas from IStore-1 • Upgraded Storage Brick • Pentium III 650 MHz Processor • Two Gb Ethernet Copper Ports/brick • One 2.5" ATA disk(32 GB, 5400 RPM) • 2X DRAM memory • Geographically Disperse Nodes, Larger System • O(1000) nodes at Almaden, O(1000) at Berkeley • Halve into O(500) nodes at each site to simplify finding space problem, show that it works? • User Supplied UPS Support

ISTORE-2 Improvements (1): Operator Aids • Every Field Replaceable Unit (FRU) has a machine readable unique identifier (UID) => introspective software determines if storage system is wired properly initially, evolved properly • Can a switch failure disconnect both copies of data? • Can a power supply failure disable mirrored disks? • Computer checks for wiring errors, informs operator vs. management blaming operator upon failure • Leverage IBM Vital Product Data (VPD) technology? • External Status Lights per Brick • Disk active, Ethernet port active, Redundant HW active, HW failure, Software hickup, ...

ISTORE-2 Improvements (2): RAIN • ISTORE-1 switches 1/3 of space, power, cost, and for just 80 nodes! • Redundant Array of Inexpensive Disks (RAID): replace large, expensive disks by many small, inexpensive disks, saving volume, power, cost • Redundant Array of Inexpensive Network switches: replace large, expensive switches by many small, inexpensive switches, saving volume, power, cost? • ISTORE-1: Replace 2 16-port 1-Gbit switches by fat tree of 8 8-port switches, or 24 4-port switches?

ISTORE-2 Improvements (3): System Management Language • Define high-level, intuitive, non-abstract system management language • Goal: Large Systems managed by part-time operators! • Language interpretive for observation, but compiled, error-checked for config. changes • Examples of tasks which should be made easy • Set alarm if any disk is more than 70% full • Backup all data in the Philippines site to Colorado site • Split system into protected subregions • Discover & display present routing topology • Show correlation between brick temps and crashes

ISTORE-2 Improvements (4): Options to Investigate • TCP/IP Hardware Accelerator • Class 4: Hardware State Machine • ~10 microsecond latency, full Gbit bandwidth yet full TCP/IP functionality, TCP/IP APIs • Ethernet Sourced in Memory Controller (North Bridge) • Shelf of bricks on researchers’ desktops? • SCSI over TCP Support • Integrated UPS

Why is ISTORE-2 a big machine? • ISTORE is all about managing truly large systems - one needs a large system to discover the real issues and opportunities • target 1k nodes in UCB CS, 1k nodes in IBM ARC • Large systems attract real applications • Without real applications CS research runs open-loop • The geographical separation of ISTORE-2 sub-clusters exposes many important issues • the network is NOT transparent • networked systems fail differently, often insidiously

Advantages: Cost of Bandwidth Cost of Space Cost of Storage System v. Cost of Disks Physical Repair, Number of Spare Parts Cost of Processor Complexity Cluster advantages: dependability, scalability 1 v. 2 Networks A Case for Intelligent Storage

Cost of Space, Power, Bandwidth • Co-location sites (e.g., Exodus) offer space, expandable bandwidth, stable power • Charge ~$1000/month per rack ( ~ 10 sq. ft.) • Includes 1 20-amp circuit/rack; charges ~$100/month per extra 20-amp circuit/rack • Bandwidth cost: ~$500 per Mbit/sec/Month

Cost of Bandwidth, Safety • Network bandwidth cost is significant • 1000 Mbit/sec/month => $6,000,000/year • Security will increase in importance for storage service providers => Storage systems of future need greater computing ability • Compress to reduce cost of network bandwidth 3X; save $4M/year? • Encrypt to protect information in transit for B2B => Increasing processing/disk for future storage apps

Cost of Space, Power • Sun Enterprise server/array (64CPUs/60disks) • 10K Server (64 CPUs): 70 x 50 x 39 in. • A3500 Array (60 disks): 74 x 24 x 36 in. • 2 Symmetra UPS (11KW): 2 * 52 x 24 x 27 in. • ISTORE-1: 2X savings in space • ISTORE-1: 1 rack (big) switches, 1 rack (old) UPSs, 1 rack for 80 CPUs/disks (3/8 VME rack unit/brick) • ISTORE-2: 8X-16X space? • Space, power cost/year for 1000 disks: Sun $924k, ISTORE-1 $484k, ISTORE2 $50k

Cost of Storage System v. Disks • Examples show cost of way we build current systems (2 networks, many buses, CPU, …) Disks Disks Date Cost Main. Disks /CPU /IObus • NCR WM: 10/97 $8.3M -- 1312 10.2 5.0 • Sun 10k: 3/98 $5.2M -- 668 10.4 7.0 • Sun 10k: 9/99 $6.2M $2.1M 1732 27.0 12.0 • IBM Netinf: 7/00 $7.8M $1.8M 7040 55.0 9.0 =>Too complicated, too heterogenous • And Data Bases are often CPU or bus bound! • ISTORE disks per CPU: 1.0 • ISTORE disks per I/O bus: 1.0

Disk Limit: Bus Hierarchy Server Storage Area Network Memory bus CPU • Data rate vs. Disk rate • SCSI: Ultra3 (80 MHz), Wide (16 bit): 160 MByte/s • FC-AL: 1 Gbit/s = 125 MByte/s • Use only 50% of a bus • Command overhead (~ 20%) • Queuing Theory (< 70%) (FC-AL) Internal I/O bus Memory RAID bus (PCI) Mem External I/O bus Disk Array (SCSI) (15 disks/bus)

Physical Repair, Spare Parts • ISTORE: Compatible modules based on hot-pluggable interconnect (LAN) with few Field Replacable Units (FRUs): Node, Power Supplies, Switches, network cables • Replace node (disk, CPU, memory, NI) if any fail • Conventional: Heterogeneous system with many server modules (CPU, backplane, memory cards, …) and disk array modules (controllers, disks, array controllers, power supplies, … ) • Store all components available somewhere as FRUs • Sun Enterprise 10k has ~ 100 types of spare parts • Sun 3500 Array has ~ 12 types of spare parts

ISTORE: Complexity v. Perf • Complexity increase: • HP PA-8500: issue 4 instructions per clock cycle, 56 instructions out-of-order execution, 4Kbit branch predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D cache (> 80M transistors just in caches) • Intel SA-110: 16 KB I$, 16 KB D$, 1 instruction, in order execution, no branch prediction, 5 stage pipeline • Complexity costs in development time, development power, die size, cost • 550 MHz HP PA-8500 477 mm2, 0.25 micron/4M $330, 60 Watts • 233 MHz Intel SA-110 50 mm2, 0.35 micron/3M $18, 0.4 Watts

ISTORE: Cluster Advantages • Architecture that tolerates partial failure • Automatic hardware redundancy • Transparent to application programs • Truly scalable architecture • Limits in size today are maintenance costs, floor space cost - generally NOT capital costs • As a result, it is THE target architecture for new software apps for Internet

ISTORE: 1 vs. 2 networks • Current systems all have LAN + Disk interconnect (SCSI, FCAL) • LAN is improving fastest, most investment, most features • SCSI, FC-AL poor network features, improving slowly, relatively expensive for switches, bandwidth • FC-AL switches don’t interoperate • Two sets of cables, wiring? • Why not single network based on best HW/SW technology? • Note: there can be still 2 instances of the network (e.g. external, internal), but only one technology

Initial Applications • ISTORE is not one super-system that demonstrates all these techniques! • Initially provide middleware, library to support AME • Initial application targets • information retrieval for multimedia data (XML storage?) • self-scrubbing data structures, structuring performance-robust distributed computation • Home video server via XML storage? • email service • self-scrubbing data structures, online self-testing • statistical identification of normal behavior

UCB ISTORE Continued Funding • New NSF Information Technology Research, larger funding (>$500K/yr) • 1400 Letters • 920 Preproposals • 134 Full Proposals Encouraged • 240 Full Proposals Submitted • 60 Funded • We are 1 of the 60; starts Sept 2000

NSF ITR Collaboration with Mills • Mills: small undergraduate liberal arts college for women; 8 miles south of Berkeley • Mills students can take 1 course/semester at Berkeley • Hourly shuttle between campuses • Mills also has re-entry MS program for older students • To increase women in Computer Science (especially African-American women): • Offer undergraduate research seminar at Mills • Mills Prof leads; Berkeley faculty, grad students help • Mills Prof goes to Berkeley for meetings, sabbatical • Goal: 2X-3X increase in Mills CS+alumnae to grad school • IBM people want to help?

Conclusion: ISTORE as Storage System of the Future • Availability, Maintainability, and Evolutionary growth key challenges for storage systems • Cost of Maintenance = 10X Cost of Purchase, so even 2X purchase cost for 1/2 maintenance cost is good • AME improvement enables even larger systems • ISTORE has cost-performance advantages • Better space, power/cooling costs ($@colocation site) • More MIPS, cheaper MIPS, no bus bottlenecks • Compression reduces network $, encryption protects • Single interconnect, supports evolution of technology • Match to future software service architecture • Future storage service software target clusters

Conclusions (1): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server systems • more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers • via clusters of network-attached, computationally-enhanced storage nodes running distributed code • via hardware and software introspection • we are currently performing application studies to investigate and compare techniques • Availability benchmarks a powerful tool? • revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000

Conclusions (2) • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Gadgets: Embedded/Mobile devices • Infrastructure: Intelligent Storage and Networks • PostPC infrastructure requires • New Goals: Availability, Maintainability, Evolution • New Principles: Introspection, Performance Robustness • New Techniques: Isolation/fault insertion, Software scrubbing • New Benchmarks: measure, compare AME metrics

Berkeley Future work • IRAM: fab and test chip • ISTORE • implement AME-enhancing techniques in a variety of Internet, enterprise, and info retrieval applications • select the best techniques and integrate into a generic runtime system with “AME API” • add maintainability benchmarks • can we quantify administrative work needed to maintain a certain level of availability? • Perhaps look at data security via encryption? • Even consider denial of service?

The UC Berkeley IRAM/ISTORE Projects:Computers for the PostPC Era For more information: http://iram.cs.berkeley.edu/istore istore-group@cs.berkeley.edu

Backup Slides (mostly in the area of benchmarking)

Case study • Software RAID-5 plus web server • Linux/Apache vs. Windows 2000/IIS • Why software RAID? • well-defined availability guarantees • RAID-5 volume should tolerate a single disk failure • reduced performance (degraded mode) after failure • may automatically rebuild redundancy onto spare disk • simple system • easy to inject storage faults • Why web server? • an application with measurable QoS metrics that depend on RAID availability and performance

Benchmark environment: metrics • QoS metrics measured • hits per second • roughly tracks response time in our experiments • degree of fault tolerance in storage system • Workload generator and data collector • SpecWeb99 web benchmark • simulates realistic high-volume user load • mostly static read-only workload; some dynamic content • modified to run continuously and to measure average hits per second over each 2-minute interval

Benchmark environment: faults • Focus on faults in the storage system (disks) • How do disks fail? • according to Tertiary Disk project, failures include: • recovered media errors • uncorrectable write failures • hardware errors (e.g., diagnostic failures) • SCSI timeouts • SCSI parity errors • note: no head crashes, no fail-stop failures

Disk fault injection technique • To inject reproducible failures, we replaced one disk in the RAID with an emulated disk • a PC that appears as a disk on the SCSI bus • I/O requests processed in software, reflected to local disk • fault injection performed by altering SCSI command processing in the emulation software • Types of emulated faults: • media errors (transient, correctable, uncorrectable) • hardware errors (firmware, mechanical) • parity errors • power failures • disk hangs/timeouts

Computers for the Post-PC Era