120 likes | 135 Views
Scaling for the Future. Katherine Yelick U.C. Berkeley, EECS http://iram.cs.berkeley.edu/{istore} http://www.cs.berkeley.edu/projects/titanium. Two Independent Problems. Building a reliable, scalable infrastructure Scalable processor, cluster, and wide-area systems
E N D
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS http://iram.cs.berkeley.edu/{istore} http://www.cs.berkeley.edu/projects/titanium
Two Independent Problems • Building a reliable, scalable infrastructure • Scalable processor, cluster, and wide-area systems • IRAM, ISTORE, and OceanStore • One example application for the infrastructure • Microscale simulation of biological systems • Model signals from cell membrane to nucleus • Understanding disease and for pharmacological and BioMEMS-mediated therapy
L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M IRAM: Scaling within a Chip Microprocessor & DRAM on a single chip: • Avoids memory bus bottleneck • Address power limits by spreading logic over chip VIRAM chip: • Vector architecture • exploits bandwidth • preserves power & area advantages • Support for multimedia • IBM will fabricate Sp ’01 • 200 MHz, 3.2 Gflops, 2 W • .18 um mixed logic/DRAM
ISTORE: Scaling Clusters • Design points • 2001: 80 nodes in 3 racks • 2002: 1000 nodes in 10 racks (?) • 2005: 10K nodes in 1 rack (?) • Add IRAM to 1” disk • Key problems are availability, maintainability, and evolutionary growth (AME) of a thousand node servers • Approach • Hardware built for availability: monitor, diagnostics • New class of benchmarks for AME • Reliable systems from unreliable hw/sw components • Introspection: the system watches itself
OceanStore: Scaling to Utilities • Transparent data service provided by federation of companies: • Monthly fee paid to one service provider • Companies buy and sell capacity from each other • Assumptions: • Untrusted Infrastructure:only ciphertext in the infrastructure • Promiscuous Caching:cache anywhere, anytime • Optimistic Concurrency Control: avoid locking Canadian OceanStore Sprint AT&T IBM Pac Bell IBM
The Real Scalability Problems: AME • Availability • systems should continue to meet quality of service goals despite failures and extreme load • Maintainability • minimize human administration • Evolutionary Growth • graceful evolution; dynamic scalability • These are problems for computation and storage services
Research Principles • Redundancy everywhere • Hardware: processors, networks, disks,… • Software: language, libraries, runtime,… • Introspection • reactive techniques to detect and adapt to failures, workload variations, and system evolution • proactive techniques to anticipate and avert problems before they happen • Benchmarking • Define quantitative AME measures • Benchmarks drive the field
Benchmarks • Availability benchmarks • Measure QoS as fault events occur • Support for fault injection key • Example of software RAID system • Maintainability benchmarks • Human factor is a challenge • Evolutionary growth benchmarks • Performance with heterogeneous hardware
Example: Faults in Software RAID • Compares Linux and Solaris reconstruction • Linux:minimal performance impact but longer window of vulnerability to second fault • Solaris: large perf. impact but restores redundancy fast Linux Solaris
Simulating Microscale Biological Systems • Large scale simulation useful for • Fundamental biological questions: cell behavior • Design of treatments, including Bio-MEMs • Simulations limited in part by • Machine complexity, e.g., memory hierarchies • Algorithmic complexity, e.g., adaptation • Old software model: • Hide the machine from the users • Implicit parallelism, hardware-controlled caching, • Results were unusable • Witness success of MPI
New Model for Scalable High Confidence Computing • Domain-specific language that judiciously exposes machine structure • Explicit parallelism, load balancing and locality control • Allows for construction of complex, distributed data structures • Current • Demonstration on higher level models • Heart simulation • Future plans • Algorithms and software that adapts to faults • Microscale systems
Conclusions • Scaling at all levels • Processors, clusters, wide area • Application challenges • Both storage and compute intensive • Key challenges to future infrastructure are: • Availability and reliability • Complexity of the machine