1 / 42

ISTORE: A server for the PostPC Era

ISTORE as the storage system of the future, addressing challenges like availability, maintainability, and cost-performance advantages. Learn about key principles for achieving AME and CPU hardware techniques for effective storage-oriented nodes.

shavonne
Download Presentation

ISTORE: A server for the PostPC Era

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISTORE: A server for the PostPC Era Aaron Brown, Dave Martin, David Oppenheimer, Noah Trauhaft, Dave Patterson,Katherine Yelick University of California at Berkeley Patterson@cs.berkeley.edu UC Berkeley ISTORE Group istore-group@cs.berkeley.edu October 2000

  2. ISTORE as Storage System of the Future • Availability, Maintainability, and Evolutionary growth key challenges for storage systems • Maintenance Cost ~ >10X Purchase Cost per year, • Even 2X purchase cost for 1/2 maintenance cost wins • AME improvement enables even larger systems • ISTORE also cost-performance advantages • Better space, power/cooling costs ($@colocation site) • More MIPS, cheaper MIPS, no bus bottlenecks • Compression reduces network $, encryption protects • Single interconnect, supports evolution of technology, single network technology to maintain/understand • Match to future software storage services • Future storage service software target clusters

  3. Lampson: Systems Challenges • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter “Computer Systems Research-Past and Future” Keynote address, 17th SOSP, Dec. 1999 Butler Lampson Microsoft

  4. Jim Gray: Trouble-Free Systems “What Next? A dozen remaining IT problems” Turing Award Lecture, FCRC, May 1999 Jim Gray Microsoft • Manager • Sets goals • Sets policy • Sets budget • System does the rest. • Everyone is a CIO (Chief Information Officer) • Build a system • used by millions of people each day • Administered and managed by a ½ time person. • On hardware fault, order replacement part • On overload, order additional equipment • Upgrade hardware and software automatically.

  5. Jim Gray: Trustworthy Systems • Build a system used by millions of people that • Only services authorized users • Service cannot be denied (can’t destroy data or power). • Information cannot be stolen. • Is always available: (out less than 1 second per 100 years = 8 9’s of availability) • 1950’s 90% availability, Today 99% uptime for web sites, 99.99% for well managed sites (50 minutes/year)3 extra 9s in 45 years. • Goal: 5 more 9s: 1 second per century. • And prove it.

  6. Hennessy: What Should the “New World” Focus Be? • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint “Back to the Future: Time to Return to Longstanding Problems in Computer Systems?” Keynote address, FCRC, May 1999 John Hennessy Stanford

  7. The real scalability problems: AME • Availability • systems should continue to meet quality of service goals despite hardware and software failures • Maintainability • systems should require only minimal ongoing human administration, regardless of scale or complexity: Today, cost of maintenance = 10-100 cost of purchase • Evolutionary Growth • systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow

  8. VAX crashes ‘85, ‘93 [Murp95]; extrap. to ‘01 • Sys. Man.: N crashes/problem, SysAdminaction • Actions: set params bad, bad config, bad app install • HW/OS 70% in ‘85 to 28% in ‘93. In ‘01, 10%? Is Maintenance the Key? • Rule of Thumb: Maintenance 10X HW • so over 5 year product life, ~ 95% of cost is maintenance

  9. Principles for achieving AME • No single points of failure, lots of redundancy • Performance robustness is more important than peak performance • Performance can be sacrificed for improvements in AME • resources should be dedicated to AME • biological systems > 50% of resources on maintenance • can make up performance by scaling system • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen

  10. CPU Hardware Techniques (1): SON • SON: Storage Oriented Nodes • Distribute processing with storage • If AME really important, provide resources! • Most storage servers limited by speed of CPUs!! • Amortize sheet metal, power, cooling, network for disk to add processor, memory, and a real network? • Embedded processors 2/3 perf, 1/10 cost, power? • Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems • Advantages of cluster organization • Truly scalable architecture • Architecture that tolerates partial failure • Automatic hardware redundancy

  11. Hardware techniques (2) • Heavily instrumented hardware • sensors for temp, vibration, humidity, power, intrusion • helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node • provides remote control of power, remote console access to the node, selection of node boot code • collects, stores, processes environmental data for abnormalities • non-volatile “flight recorder” functionality • all diagnostic processors connected via independent diagnostic network

  12. Hardware techniques (3) • On-demand network partitioning/isolation • Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance • Allows testing, repair of online system • Managed by diagnostic processor and network switches via diagnostic network • Built-in fault injection capabilities • Power control to individual node components • Injectable glitches into I/O and memory busses • Managed by diagnostic processor • Used for proactive hardware introspection • automated detection of flaky components • controlled testing of error-recovery mechanisms

  13. “Hardware” culture (4) • Benchmarking • One reason for 1000X processor performance was ability to measure (vs. debate) which is better • e.g., Which most important to improve: clock rate, clocks per instruction, or instructions executed? • Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor”

  14. Example single-fault result • Compares Linux and Solaris reconstruction • Linux: minimal performance impact but longer window of vulnerability to second fault • Solaris: large perf. impact but restores redundancy fast Linux Solaris

  15. Deriving ISTORE • What is the interconnect? • FC-AL? (Interoperability? Cost of switches?) • Infiniband? (When? Cost of switches? Cost of NIC?) • Gbit Ehthernet? • Pick Gbit Ethernet as commodity switch, link • As main stream, fastest improving in cost performance • We assume Gbit Ethernet switches will get cheap over time (Network Processors, volume, …)

  16. Deriving ISTORE • Number of Disks / Gbit port? • Bandwidth of 2000 disk • Raw bit rate: 427 Mbit/sec. • Data transfer rate: 40.2 MByte/sec • Capacity: 73.4 GB • Disk trends • BW: 40%/year • Capacity, Areal density,$/MB: 100%/year • 2003 disks • ~ 500 GB capacity (<8X) • ~ 110 MB/sec or 0.9 Gbit/sec (2.75X) • Number of Disks / Gbit port = 1

  17. ISTORE-1 Brick • Webster’s Dictionary: “brick: a handy-sized unit of building or paving material typically being rectangular and about 2 1/4 x 3 3/4 x 8 inches” • ISTORE-1 Brick: 2 x 4 x 11 inches (1.3x) • Single physical form factor, fixed cooling required, compatible network interface to simplify physical maintenance, scaling over time • Contents should evolve over time: contains most cost effective MPU, DRAM, disk, compatible NI • If useful, could have special bricks (e.g., DRAM rich) • Suggests network that will last, evolve: Ethernet

  18. Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • ISTORE Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibration sensors... Disk Half-height canister ISTORE-1 hardware platform • 80-node x86-based cluster, 1.4TB storage • cluster nodes are plug-and-play, intelligent, network-attached storage “bricks” • a single field-replaceable unit to simplify maintenance • each node is a full x86 PC w/256MB DRAM, 18GB disk • more CPU than NAS; fewer disks/node than cluster

  19. Common Question: RAID? • Switched Network sufficient for all types of communication, including redundancy • Hierarchy of buses is generally not superior to switched network • Veritas, others offer software RAID 5 and software Mirroring (RAID 1) • Another use of processor per disk

  20. Advantages: Cost of Bandwidth Cost of Space Cost of Storage System v. Cost of Disks Physical Repair, Number of Spare Parts Cost of Processor Complexity Cluster advantages: dependability, scalability 1 v. 2 Networks A Case for Intelligent Storage

  21. Cost of Space, Power, Bandwidth • Co-location sites (e.g., Exodus) offer space, expandable bandwidth, stable power • Charge ~$1000/month per rack ( ~ 10 sq. ft.) • Includes 1 20-amp circuit/rack; charges ~$100/month per extra 20-amp circuit/rack • Bandwidth cost: ~$500 per Mbit/sec/Month

  22. Cost of Bandwidth, Safety • Network bandwidth cost is significant • 1000 Mbit/sec/month => $6,000,000/year • Security will increase in importance for storage service providers • XML => server format conversion for gadgets => Storage systems of future need greater computing ability • Compress to reduce cost of network bandwidth 3X; save $4M/year? • Encrypt to protect information in transit for B2B => Increasing processing/disk for future storage apps

  23. Cost of Space, Power • Sun Enterprise server/array (64CPUs/60disks) • 10K Server (64 CPUs): 70 x 50 x 39 in. • A3500 Array (60 disks): 74 x 24 x 36 in. • 2 Symmetra UPS (11KW): 2 * 52 x 24 x 27 in. • ISTORE-1: 2X savings in space • ISTORE-1: 1 rack (big) switches, 1 rack (old) UPSs, 1 rack for 80 CPUs/disks (3/8 VME rack unit/brick) • ISTORE-2: 8X-16X space? • Space, power cost/year for 1000 disks: Sun $924k, ISTORE-1 $484k, ISTORE2 $50k

  24. Cost of Storage System v. Disks • Hardware RAID box ~ 5X cost of disks

  25. Disk Limit: Bus Hierarchy Server Storage Area Network Memory bus CPU • Data rate vs. Disk rate • SCSI: Ultra3 (80 MHz), Wide (16 bit): 160 MByte/s • FC-AL: 1 Gbit/s = 125 MByte/s • Use only 50% of a bus • Command overhead (~ 20%) • Queuing Theory (< 70%) (FC-AL) Internal I/O bus Memory RAID bus (PCI) Mem External I/O bus Disk Array (SCSI) (15 disks/bus)

  26. Physical Repair, Spare Parts • ISTORE: Compatible modules based on hot-pluggable interconnect (LAN) with few Field Replacable Units (FRUs): Node, Power Supplies, Switches, network cables • Replace node (disk, CPU, memory, NI) if any fail • Conventional: Heterogeneous system with many server modules (CPU, backplane, memory cards, …) and disk array modules (controllers, disks, array controllers, power supplies, … ) • Store all components available somewhere as FRUs • Sun Enterprise 10k has ~ 100 types of spare parts • Sun 3500 Array has ~ 12 types of spare parts

  27. ISTORE: Complexity v. Perf • Complexity increase: • HP PA-8500: issue 4 instructions per clock cycle, 56 instructions out-of-order execution, 4Kbit branch predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D cache (> 80M transistors just in caches) • Intel Xscale: 16 KB I$, 16 KB D$, 1 instruction, in order execution, no branch prediction, 6 stage pipeline • Complexity costs in development time, development power, die size, cost • 550 MHz HP PA-8500 477 mm2, 0.25 micron/4M $330, 60 Watts • 1000 MHz Intel StrongARM2 (“Xscale”) @ 1.5 Watts, 800 MHz at 0.9 W, … 50 Mhz @ 0.01W, 0.18 micron (old chip 50 mm2, 0.35 micron, $18) • => Count $ for system, not processors/disk

  28. ISTORE: Cluster Advantages • Architecture that tolerates partial failure • Automatic hardware redundancy • Transparent to application programs • Truly scalable architecture • Given maintenance is 10X-100X capital costs, clustersize limits today are maintenance, floor space cost - generally NOT capital costs • As a result, it is THE target architecture for new software apps for Internet

  29. ISTORE: 1 vs. 2 networks • Current systems all have LAN + Disk interconnect (SCSI, FCAL) • LAN is improving fastest, most investment, most features • SCSI, FC-AL poor network features, improving slowly, relatively expensive for switches, bandwidth • FC-AL switches don’t interoperate • Two sets of cables, wiring? • SysAdmin trained in 2 networks, SW interface, …??? • Why not single network based on best HW/SW technology? • Note: there can be still 2 instances of the network (e.g. external, internal), but only one technology

  30. Initial Applications • ISTORE-1 is not one super-system that demonstrates all these techniques! • Initially provide middleware, library to support AME • Initial application targets • information retrieval for multimedia data (XML storage?) • self-scrubbing data structures, structuring performance-robust distributed computation • Example: home video server using XML interfaces • email service • self-scrubbing data structures, online self-testing • statistical identification of normal behavior

  31. A glimpse into the future? • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • ISTORE HW in 5-7 years: • 2006 brick: System On a Chip integrated with MicroDrive • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • From brick to “domino” • If low power, 10,000 nodes fit into one rack! • O(10,000) scale is our ultimate design point

  32. Conclusion: ISTORE as Storage System of the Future • Availability, Maintainability, and Evolutionary growth key challenges for storage systems • Maintenance Cost ~ 10X Purchase Cost per year, so over 5 year product life, ~ 95% of cost of ownership • Even 2X purchase cost for 1/2 maintenance cost wins • AME improvement enables even larger systems • ISTORE has cost-performance advantages • Better space, power/cooling costs ($@colocation site) • More MIPS, cheaper MIPS, no bus bottlenecks • Compression reduces network $, encryption protects • Single interconnect, supports evolution of technology, single network technology to maintain/understand • Match to future software storage services • Future storage service software target clusters

  33. Questions? Contact us if you’re interested:email: patterson@cs.berkeley.eduhttp://iram.cs.berkeley.edu/ “If it’s important, how can you say if it’s impossible if you don’t try?” Jean Morreau, a founder of European Union

  34. Clusters and TPC Software 8/’00 • TPC-C: 6 of Top 10 performance are clusters, including all of Top 5; 4 SMPs • TPC-H: SMPs and NUMAs • 100 GB All SMPs (4-8 CPUs) • 300 GB All NUMAs (IBM/Compaq/HP 32-64 CPUs) • TPC-R: All are clusters • 1000 GB :NCR World Mark 5200 • TPC-W: All web servers are clusters (IBM)

  35. Clusters and TPC-C Benchmark Top 10 TPC-C Performance (Aug. 2000) Ktpm 1. Netfinity 8500R c/s Cluster 441 2. ProLiant X700-96P Cluster 262 3. ProLiant X550-96P Cluster 230 4. ProLiant X700-64P Cluster 180 5. ProLiant X550-64P Cluster 162 6. AS/400e 840-2420 SMP 152 7. Fujitsu GP7000F Model 2000 SMP 139 8. RISC S/6000 Ent. S80 SMP 139 9. Bull Escala EPC 2400 c/s SMP 136 10. Enterprise 6500 Cluster Cluster 135

  36. Cost of Storage System v. Disks • Examples show cost of way we build current systems (2 networks, many buses, CPU, …) Disks Disks Date Cost Main. Disks /CPU /IObus • NCR WM: 10/97 $8.3M -- 1312 10.2 5.0 • Sun 10k: 3/98 $5.2M -- 668 10.4 7.0 • Sun 10k: 9/99 $6.2M $2.1M 1732 27.0 12.0 • IBM Netinf: 7/00 $7.8M $1.8M 7040 55.0 9.0 =>Too complicated, too heterogenous • And Data Bases are often CPU or bus bound! • ISTORE disks per CPU: 1.0 • ISTORE disks per I/O bus: 1.0

  37. Common Question: Why Not Vary Number of Processors and Disks? • Argument: if can vary numbers of each to match application, more cost-effective solution? • Alternative Model 1: Dual Nodes + E-switches • P-node: Processor, Memory, 2 Ethernet NICs • D-node: Disk, 2 Ethernet NICs • Response • As D-nodes running network protocol, still need processor and memory, just smaller; how much save? • Saves processors/disks, costs more NICs/switches: N ISTORE nodes vs. N/2 P-nodes + N D-nodes • Isn't ISTORE-2 a good HW prototype for this model? Only run the communication protocol on N nodes, run the full app and OS on N/2

  38. Common Question: Why Not Vary Number of Processors and Disks? • Alternative Model 2: N Disks/node • Processor, Memory, N disks, 2 Ethernet NICs • Response • Potential I/O bus bottleneck as disk BW grows • 2.5" ATA drives are limited to 2/4 disks per ATA bus • How does a research project pick N? What’s natural? • Is there sufficient processing power and memory to run the AME monitoring and testing tasks as well as the application requirements? • Isn't ISTORE-2 a good HW prototype for this model? Software can act as simple disk interface over network and run a standard disk protocol, and then run that on N nodes per apps/OS node. Plenty of Network BW available in redundant switches

  39. SCSI v. IDE $/GB • Prices from PC Magazine, 1995-2000

  40. Grove’s Warning “...a strategic inflection point is a time in the life of a business when its fundamentals are about to change. ... Let's not mince words: A strategic inflection point can be deadly when unattended to. Companies that begin a decline as a result of its changes rarely recover their previous greatness.” Only the Paranoid Survive, Andrew S. Grove, 1996

  41. Availability benchmark methodology • Goal: quantify variation in QoS metrics as events occur that affect system availability • Leverage existing performance benchmarks • to generate fair workloads • to measure & trace quality of service metrics • Use fault injection to compromise system • hardware faults (disk, memory, network, power) • software faults (corrupt input, driver error returns) • maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads • the availability analogues of performance micro- and macro-benchmarks

  42. Benchmark Availability?Methodology for reporting results • Results are most accessible graphically • plot change in QoS metrics over time • compare to “normal” behavior? • 99% confidence intervals calculated from no-fault runs

More Related