360 likes | 375 Views
This article discusses the challenges faced by computer architecture in keeping up with the pace of technological advancements, specifically the limitations of slow memory and the potential solution of implementing multithreading. It also explores the implications of these advancements on the future of computing.
E N D
Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project (www.cs.wisc.edu/multifacet) October 2004 Full Disclosure: Consult for Sun & US NSF
talk Executive Summary: Problem • Expect computer performance doubling every 2 years • Derives from Technology & Architecture • Technology will advance for ten or more years • But Architecture faces a Rock: Slow Memory • a.k.a. Wall [Wulf & McKee 1995] • Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless ofthe real Moore’s Law (doubling transistors)
Executive Summary: Recommendation • Chip Multiprocessing (CMP) Can Help • Implement multiple processors per chip • >>10x cost-performance for multithreaded workloads • What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading • Make most workloads flourish with chip multiprocessing • Computer architects can help, but long run • Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware) • Necessary For Restoring Popular Moore’s Law
Outline • Executive Summary • Background • Moore’s Law • Architecture • Instruction Level Parallelism • Caches • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading
talk Society Expects A Popular Moore’s Law Computing critical: commerce, education, engineering, entertainment, government, medicine, science, … • Servers (> PCs) • Clients (= PCs) • Embedded (< PCs) • Come to expect a misnamed “Moore’s Law” • Computer performance doubles every two years (same cost) • Progress in next two years = All past progress • Important Corollary • Computer cost halves every two years (same performance) • In ten years, same performance for 3% (sales tax – Jim Gray) • Derives from Technology & Architecture
(Technologist’s) Moore’s Law Provides Transistors Number of transistorsper chip doubles everytwo years (18 months) Merely a “Law” of Business Psychology
Performance from Technology & Architecture Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.
Time Time Instrns Instrns Architects Use Transistors To Compute Faster • Bit Level Parallelism (BLP) within Instructions • Instruction Level Parallelism (ILP) among Instructions • Scores of speculative instructions look sequential!
Architects Use Transistors Tolerate Slow Memory • Cache • Small, Fast Memory • Holds information (expected)to be used soon • Mostly Successful • Apply Recursively • Level-one cache(s) • Level-two cache • Most of microprocessordie area is cache!
Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Technology Continues • Slow Memory • Implications • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading
Future Technology Implications • For (at least) ten years, Moore’s Law continues • More repeated doublings of number of transistors per chip • Faster transistors • But hard for processor architects to use • More transistors due global wire delays • Faster transistors due too much dynamic power • Moreover, hitting a Rock: Slow Memory • Memory access = 100s floating-point multiplies! • a.k.a. Wall [Wulf & McKee 1995]
Rock: Memory Gets (Relatively) Slower Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.
I1 I2 window = 4 (64) Compute Phases I3 I4 Memory Phases Time Time Instrns Instrns Impact of Slow Memory (Rock) • Off-Chip Misses are now hundreds of cycles • More Realistic Case Good Case!
Implications of Slow Memory (Rock) • Increasing Memory Latency hides Compute Phase • Near Term Implications • Reduce memory latency • Fewer memory accesses • More Memory Level Parallelism (MLP) • Longer Term Implications • What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000? • What can amazing speculative hardware do?
Assessment So Far • Appears • Popular Moore’s Law (doubling performance)will end soon, regardless of thereal Moore’s Law (doubling transistors) • Processor performance hitting Rock (Slow Memory) • No known way to overcome this, unless • Redefine performance in Popular Moore’s Law • From Processor Performance • To Chip Performance
Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Small & Large CMPs • CMP Systems • CMP Workload • Go to the Hard Place of Mainstream Multithreading
Performance for Chip, not Processor or Thread • Chip Multiprocessing (CMP) • Replicate Processor • Private L1 Caches • Low latency • High bandwidth • Shared L2 Cache • Larger than if private
Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing Piranha Processing Node
I$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way CPU
I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay CPU CPU CPU CPU ICS CPU CPU CPU CPU
I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS L2$ L2$ L2$ L2$ CPU CPU CPU CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS L2$ L2$ L2$ L2$ CPU CPU CPU CPU 8 banks @1.6GB/sec
MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE) prog., 1K instr., even/odd interleaving HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth 4 Links @ 8GB/s HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS Router RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL I$ I$ I$ I$ I$ I$ I$ I$ D$ D$ D$ D$ D$ D$ D$ D$ MEM-CTL MEM-CTL MEM-CTL MEM-CTL Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth HE CPU L2$ CPU L2$ CPU L2$ CPU L2$ ICS Router RE L2$ L2$ L2$ L2$ CPU CPU CPU CPU Single Chip
Single-Chip Piranha Performance • Piranha’s performance margin 3x for OLTP and 2.2x for DSS • Piranha has more outstanding misses better utilizes memory system
Simultaneous Multithreading (SMT) • Multiplex S logical processors on each processor • Replicate registers, share caches, & manage other parts • Implementation factors keep S small, e.g., 2-4 • Cost-effective gain if threads available • E.g, S=2 1.4x performance • Modest cost • Limits waste if additional logical processor(s) not used • Worthwhile CMP enhancement
C M C Small CMP Systems • Use One CMP (with C cores of S-way SMT) • C=[2,16] & S=[2,4] C*S = [4,64] • Size of a small PC! • Directly Connect CMP (C) toMemory Controller (M) or DRAM
M M C C C C C C C C M M M M M M Processor-Centric Dance Hall Medium CMP Systems • Use 2-16 CMPs (with C cores of S-way SMT) • Smaller: 2*4*4 = 32 • Larger: 16*16*4 = 1024 • In a single cabinet • Connecting CMPs & Memory Controllers/DRAM & many issues
Inflection Points • Inflection point occurs when • Smooth input change leads • Disruptive output change • Enough transistors for … • 1970s simple microprocessor • 1980s pipelined RISC • 1990s speculative out-of-order • 2000s … • CMP will be Server Inflection Point • Expect >10x performance for less cost • Implying, >>10x cost-performance • Early CMPs like old SMPs but expect dramatic advances!
So What’s Wrong with CMP Picture? • Chip Multiprocessors • Allow profitable use of more transistors • Support modest to vast multithreading • Will be inflection point for commercial servers • But • Many workloads have single thread (available to run) • Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing) • Go to a Hard Place • Make most workloads flourish with CMPs
Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading • Parallel from Fringe to Center • For All of Computer Science!
Thread Parallelism from Fringe to Center • History • Automatic Computer (vs. Human) Computer • Digital Computer (vs. Analog) Computer • Must Change • Parallel Computer (vs. Sequential) Computer • Parallel Algorithm (vs. Sequential) Algorithm • Parallel Programming (vs. Sequential) Programming • Parallel Library (vs. Sequential) Library • Parallel X (vs. Sequential) X • Otherwise, repeated performance doublings unlikely
Computer Architects Can Contribute • Chip Multiprocessor Design • Transcend pre-CMP multiprocessor design • Intra-CMP has lower latency & much higher bandwidth • Hide Multithreading (Helper Threads) • Assist Multithreading (Thread-Level Speculation) • Ease Multithreaded Programming (Transactions) • Provide a “Gentle Ramp to Parallelism” (Hennessy)
But All of Computer Science is Needed • Hide Multithreading (Libraries & Compilers) • Assist Multithreading (Development Environments) • Ease Multithreaded Programming (Languages) • Divide & Conquer Multithreaded Complexity(Theory & Abstractions) • Must Enable • 99% of programmers think sequentially while • 99% of instructions execute in parallel • Enable a “Parallelism Superhighway”
Summary • (Single-Threaded) Computing faces a Rock: Slow Memory • Popular Moore’s Law (doubling performance) will end soon • Chip Multiprocessing Can Help • >>10x cost-performance for multithreaded workloads • What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading • Make most workloads flourish with chip multiprocessing • Computer architects can help, but long run • Requires moving multithreading from CS fringe to center • Necessary For Restoring Popular Moore’s Law