1 / 49

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors . Guri Sohi University of Wisconsin. Outline. Waves of innovation in architecture Innovation in uniprocessors Opportunities for innovation Example CMP innovations Parallel processing models

edwins
Download Presentation

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin

  2. Outline • Waves of innovation in architecture • Innovation in uniprocessors • Opportunities for innovation • Example CMP innovations • Parallel processing models • CMP memory hierarcies

  3. Waves of Research and Innovation • A new direction is proposed or new opportunity becomes available • The center of gravity of the research community shifts to that direction • SIMD architectures in the 1960s • HLL computer architectures in the 1970s • RISC architectures in the early 1980s • Shared-memory MPs in the late 1980s • OOO speculative execution processors in the 1990s

  4. Waves • Wave is especially strong when coupled with a “step function” change in technology • Integration of a processor on a single chip • Integration of a multiprocessor on a chip

  5. Uniprocessor Innovation Wave • Integration of processor on a single chip • The inflexion point • Argued for different architecture (RISC) • More transistors allow for different models • Speculative execution • Then the rebirth of uniprocessors • Continue the journey of innovation • Totally rethink uniprocessor microarchitecture

  6. The Next Wave • Can integrate a simple multiprocessor on a chip • Basic microarchitecture similar to traditional MP • Rethink the microarchitecture and usage of chip multiprocessors

  7. Broad Areas for Innovation • Overcoming traditional barriers • New opportunities to use CMPs for parallel/multithreaded execution • Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)

  8. Remainder of Talk Roadmap • Summary of traditional parallel processing • Revisiting traditional barriers and overheads to parallel processing • Novel CMP applications and workloads • Novel CMP microarchitectures

  9. Multiprocessor Architecture • Take state-of-the-art uniprocessor • Connect several together with a suitable network • Have to live with defined interfaces • Expend hardware to provide cache coherence and streamline inter-node communication • Have to live with defined interfaces

  10. Software Responsibilities • Reason about parallelism, execution times and overheads • This is hard • Use synchronization to ease reasoning • Parallel trends towards serial with the use of synchronization • Very difficult to parallelize transparently

  11. Net Result • Difficult to get parallelism speedup • Computation is serial • Inter-node communication latencies exacerbate problem • Multiprocessors rarely used for parallel execution • Typical use: improve throughput • This will have to change • Will need to rethink “parallelization”

  12. Rethinking Parallelization • Speculative multithreading • Speculation to overcoming other performance barriers • Revisiting computation models for parallelization • New parallelization opportunities • New types of workloads (e.g., multimedia) • New variants of parallelization models

  13. Speculative Multithreading • Speculatively parallelize an application • Use speculation to overcome ambiguous dependences • Use hardware support to recover from mis-speculation • E.g., multiscalar • Use hardware to overcome barriers

  14. Overcoming Barriers: Memory Models • Weak models proposed to overcome performance limitations of SC • Speculation used to overcome “maybe” dependences • Series of papers showing SC can achieve performance of weak models

  15. Implications • Strong memory models not necessarily low performance • Programmer does not have to reason about weak models • More likely to have parallel programs written

  16. Overcoming Barriers: Synchronization • Synchronization to avoid “maybe” dependences • Causes serialization • Speculate to overcome serialization • Recent work on techniques to dynamically elide synchronization constructs

  17. Implications • Programmer can make liberal use of synchronization to ease programming • Little performance impact of synchronization • More likely to have parallel programs written

  18. Revisiting Parallelization Models • Transactions • simplify writing of parallel code • very high overhead to implement semantics in software • Hardware support for transactions will exist • Speculative multithreading is ordered transactions • No software overhead to implement semantics • More applications likely to be written with transactions

  19. New Opportunities for CMPs • New opportunities for parallelism • Emergence of multimedia workloads • Amenable to traditional parallel processing • Parallel execution of overhead code • Program demultiplexing • Separating user and OS

  20. New Opportunities for CMPs • New opportunities for microarchitecture innovation • Instruction memory hierarchies • Data memory hierarchies

  21. Reliable Systems • Software is unreliable and error-prone • Hardware will be unreliable and error-prone • Improving hardware/software reliability will result in significant software redundancy • Redundancy will be source of parallelism

  22. Software Reliability & Security Reliability & Security via Dynamic Monitoring - Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… - VM performs checks for Java/C# High Overheads! - Encourages use of unsafe code

  23. A Program B C B’ D C’ Software Reliability & Security Monitoring Code - Divide program into tasks - Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A’

  24. A B A’ B’ C’ C’ Software Reliability & Security - Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks D C COMMIT D COMMIT ABORT

  25. Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow

  26. Software Reliability & Security

  27. Program De-multiplexing • New opportunities for parallelism • 2-4X parallelism • Program is a multiplexing of methods (or functions) onto single control flow • De-multiplex methods of program • Execute methods in parallel in dataflow fashion

  28. Data-flow Execution • Data-flow Machines • Dataflow in programs • No control flow, PC • Nodes are instrs. on FU • OoO Superscalar Machines • Sequential programs • Limited dataflow with ROB • Nodes are instrs. on FU 3 1 2 6 4 5 7

  29. Program Demultiplexing Nodes - methods (M) Processors - FUs 3 1 2 6 4 Demux’ed Execution Sequential Program 5 7

  30. On Trigger- Execute 6 1 2 Demux’ed exec. (6) Sequential Program 3 4 On Call- Use 5 6 7 Program Demultiplexing • Triggers • usually fire after data dep of M • Chosen with software support • Handlers generate params • PD • Data-flow based spec. execution • Differs from control-flow spec.

  31. Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M

  32. Impact of Parallelization • Expect different characteristics for code on each core • More reliance on inter-core parallelism • Less reliance on intra-core parallelism • May have specialized cores

  33. Microarchitectural Implications • Processor Cores • Skinny, less complex • Will SMT go away? • Perhaps specialized • Memory structures (i-caches, TLBs, d-caches) • Different organizations possible

  34. Microarchitectural Implications • Novel memory hierarchy opportunities • Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth • Pressure on non-core techniques to tolerate longer latencies • Helper threads, pre-execution

  35. Instruction Memory Hierarchy • Computation spreading

  36. Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

  37. Code Commonality • Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity • Poor utilization of L1 I-cache due to replication

  38. Removing Duplication MPKI (64K) 18.4 17.9 23.3 3.91 19.1 26.8 22.1 4.211 • Avoiding duplication significantly reduces misses • Shared L1 I-cache may be impractical

  39. Computation Spreading • Avoid reference spreading by distributing the computation based on code regions • Each processor is responsible for a particular region of code (for multiple threads) • References to a particular code region is localized • Computation from one thread is carried out in different processors

  40. Example TIME Canonical Model Computation Spreading

  41. L1 cache performance • Significant reduction in instruction misses • Data Cache Performance is deteriorated • Migration of computation disrupts locality of data reference

  42. Data Memory Hierarchy • Co-operative caching

  43. Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

  44. Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D Interconnect Network Shared L2

  45. Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) … … Cooperative Caching Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) A Spectrum of CMP Cache Designs Shared Caches Private Caches Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency

  46. CMP Cooperative Caching P P L1I L1D L1D L1I L2 L2 L2 L2 L1I L1D L1D L1I P P • Basic idea: Private caches cooperatively form an aggregate global cache • Use private caches for fast access / isolation • Share capacity through cooperation • Mitigate interference via cooperation throttling • Inspired by cooperative file/web caches

  47. Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads Reduction of off-chip access rate Shared (LRU) Most reduction MPKI (private): 3.21 11.63 5.53 1.67 0.50 5.97 15.84 5.28

  48. Cycles Spent on L1 Misses Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses Trace-based simulation: no miss overlapping P:Private C:C-Clean F:C-1Fwd S:Shared PCFS PCFS PCFS PCFS PCFS PCFS PCFS PCFS OLTP Apache ECperf JBB Barnes Ocean Mix1 Mix2

  49. Summary • Start of a new wave of innovation to rethink parallelism and parallel architecture • New opportunities for innovation in CMPs • New opportunities for parallelizing applications • Expect little resemblance between MPs today and CMPs in 15 years • We need to invent and define differences

More Related