Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin

Outline • Waves of innovation in architecture • Innovation in uniprocessors • Opportunities for innovation • Example CMP innovations • Parallel processing models • CMP memory hierarcies

Waves of Research and Innovation • A new direction is proposed or new opportunity becomes available • The center of gravity of the research community shifts to that direction • SIMD architectures in the 1960s • HLL computer architectures in the 1970s • RISC architectures in the early 1980s • Shared-memory MPs in the late 1980s • OOO speculative execution processors in the 1990s

Waves • Wave is especially strong when coupled with a “step function” change in technology • Integration of a processor on a single chip • Integration of a multiprocessor on a chip

Uniprocessor Innovation Wave • Integration of processor on a single chip • The inflexion point • Argued for different architecture (RISC) • More transistors allow for different models • Speculative execution • Then the rebirth of uniprocessors • Continue the journey of innovation • Totally rethink uniprocessor microarchitecture

The Next Wave • Can integrate a simple multiprocessor on a chip • Basic microarchitecture similar to traditional MP • Rethink the microarchitecture and usage of chip multiprocessors

Broad Areas for Innovation • Overcoming traditional barriers • New opportunities to use CMPs for parallel/multithreaded execution • Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)

Remainder of Talk Roadmap • Summary of traditional parallel processing • Revisiting traditional barriers and overheads to parallel processing • Novel CMP applications and workloads • Novel CMP microarchitectures

Multiprocessor Architecture • Take state-of-the-art uniprocessor • Connect several together with a suitable network • Have to live with defined interfaces • Expend hardware to provide cache coherence and streamline inter-node communication • Have to live with defined interfaces

Software Responsibilities • Reason about parallelism, execution times and overheads • This is hard • Use synchronization to ease reasoning • Parallel trends towards serial with the use of synchronization • Very difficult to parallelize transparently

Net Result • Difficult to get parallelism speedup • Computation is serial • Inter-node communication latencies exacerbate problem • Multiprocessors rarely used for parallel execution • Typical use: improve throughput • This will have to change • Will need to rethink “parallelization”

Rethinking Parallelization • Speculative multithreading • Speculation to overcoming other performance barriers • Revisiting computation models for parallelization • New parallelization opportunities • New types of workloads (e.g., multimedia) • New variants of parallelization models

Speculative Multithreading • Speculatively parallelize an application • Use speculation to overcome ambiguous dependences • Use hardware support to recover from mis-speculation • E.g., multiscalar • Use hardware to overcome barriers

Overcoming Barriers: Memory Models • Weak models proposed to overcome performance limitations of SC • Speculation used to overcome “maybe” dependences • Series of papers showing SC can achieve performance of weak models

Implications • Strong memory models not necessarily low performance • Programmer does not have to reason about weak models • More likely to have parallel programs written

Overcoming Barriers: Synchronization • Synchronization to avoid “maybe” dependences • Causes serialization • Speculate to overcome serialization • Recent work on techniques to dynamically elide synchronization constructs

Implications • Programmer can make liberal use of synchronization to ease programming • Little performance impact of synchronization • More likely to have parallel programs written

Revisiting Parallelization Models • Transactions • simplify writing of parallel code • very high overhead to implement semantics in software • Hardware support for transactions will exist • Speculative multithreading is ordered transactions • No software overhead to implement semantics • More applications likely to be written with transactions

New Opportunities for CMPs • New opportunities for parallelism • Emergence of multimedia workloads • Amenable to traditional parallel processing • Parallel execution of overhead code • Program demultiplexing • Separating user and OS

New Opportunities for CMPs • New opportunities for microarchitecture innovation • Instruction memory hierarchies • Data memory hierarchies

Reliable Systems • Software is unreliable and error-prone • Hardware will be unreliable and error-prone • Improving hardware/software reliability will result in significant software redundancy • Redundancy will be source of parallelism

Software Reliability & Security Reliability & Security via Dynamic Monitoring - Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… - VM performs checks for Java/C# High Overheads! - Encourages use of unsafe code

A Program B C B’ D C’ Software Reliability & Security Monitoring Code - Divide program into tasks - Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A’

A B A’ B’ C’ C’ Software Reliability & Security - Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks D C COMMIT D COMMIT ABORT

Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow

Software Reliability & Security

Program De-multiplexing • New opportunities for parallelism • 2-4X parallelism • Program is a multiplexing of methods (or functions) onto single control flow • De-multiplex methods of program • Execute methods in parallel in dataflow fashion

Data-flow Execution • Data-flow Machines • Dataflow in programs • No control flow, PC • Nodes are instrs. on FU • OoO Superscalar Machines • Sequential programs • Limited dataflow with ROB • Nodes are instrs. on FU 3 1 2 6 4 5 7

Program Demultiplexing Nodes - methods (M) Processors - FUs 3 1 2 6 4 Demux’ed Execution Sequential Program 5 7

On Trigger- Execute 6 1 2 Demux’ed exec. (6) Sequential Program 3 4 On Call- Use 5 6 7 Program Demultiplexing • Triggers • usually fire after data dep of M • Chosen with software support • Handlers generate params • PD • Data-flow based spec. execution • Differs from control-flow spec.

Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M

Impact of Parallelization • Expect different characteristics for code on each core • More reliance on inter-core parallelism • Less reliance on intra-core parallelism • May have specialized cores

Microarchitectural Implications • Processor Cores • Skinny, less complex • Will SMT go away? • Perhaps specialized • Memory structures (i-caches, TLBs, d-caches) • Different organizations possible

Microarchitectural Implications • Novel memory hierarchy opportunities • Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth • Pressure on non-core techniques to tolerate longer latencies • Helper threads, pre-execution

Instruction Memory Hierarchy • Computation spreading

Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

Code Commonality • Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity • Poor utilization of L1 I-cache due to replication

Removing Duplication MPKI (64K) 18.4 17.9 23.3 3.91 19.1 26.8 22.1 4.211 • Avoiding duplication significantly reduces misses • Shared L1 I-cache may be impractical

Computation Spreading • Avoid reference spreading by distributing the computation based on code regions • Each processor is responsible for a particular region of code (for multiple threads) • References to a particular code region is localized • Computation from one thread is carried out in different processors

Example TIME Canonical Model Computation Spreading

L1 cache performance • Significant reduction in instruction misses • Data Cache Performance is deteriorated • Migration of computation disrupts locality of data reference

Data Memory Hierarchy • Co-operative caching

Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D Interconnect Network Shared L2

Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) … … Cooperative Caching Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) A Spectrum of CMP Cache Designs Shared Caches Private Caches Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency

CMP Cooperative Caching P P L1I L1D L1D L1I L2 L2 L2 L2 L1I L1D L1D L1I P P • Basic idea: Private caches cooperatively form an aggregate global cache • Use private caches for fast access / isolation • Share capacity through cooperation • Mitigate interference via cooperation throttling • Inspired by cooperative file/web caches

Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads Reduction of off-chip access rate Shared (LRU) Most reduction MPKI (private): 3.21 11.63 5.53 1.67 0.50 5.97 15.84 5.28

Cycles Spent on L1 Misses Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses Trace-based simulation: no miss overlapping P:Private C:C-Clean F:C-1Fwd S:Shared PCFS PCFS PCFS PCFS PCFS PCFS PCFS PCFS OLTP Apache ECperf JBB Barnes Ocean Mix1 Mix2

Summary • Start of a new wave of innovation to rethink parallelism and parallel architecture • New opportunities for innovation in CMPs • New opportunities for parallelizing applications • Expect little resemblance between MPs today and CMPs in 15 years • We need to invent and define differences

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Presentation Transcript

Lecture 23: Multiprocessors

Chapter4 Multiprocessors

Maestro: Orchestrating Lifetime Reliability in Chip Multiprocessors Authors: Shuguang Feng Shantanu Gupta Amin

JAVA Multi-thread Programming on multiprocessors.

Bandwidth Adaptive Snooping

New Schedulability Tests for Real-Time task sets scheduled by Deadline Monotonic on Multiprocessors

Multiprocessors continued

A Novel 3D Layer-Multiplexed On-Chip Network

Analyzing the Impact of Data Prefetching on Chip MultiProcessors

Symmetric Multiprocessors

The Perf. of Spin Lock Alternatives for Shared-Memory Multiprocessors By T. E. Anderson

Comparing Memory Systems for Chip Multiprocessors

Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Software for embedded multiprocessors

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Multiprocessors and Threads

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache

Reactive Synchronization Algorithms for Multiprocessors

Multiprocessors 2

Automated Parallelization of Non-Uniform Convolutions on Chip Multiprocessors

Chapter 6 Multiprocessors and Thread-Level Parallelism