Advanced Microarchitecture

Advanced Microarchitecture Multi-This, Multi-That, …

Limits on IPC • Lam92 • This paper focused on impact of control flow on ILP • Speculative execution can expose 10-400 IPC • assumes no machine limitations except for control dependencies and actual dataflow dependencies • Wall91 • This paper looked at limits more broadly • No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC • ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC • perfect bpred, register renaming and memory disambiguation: 7-60 IPC • This paper did not consider “control independent” instructions Lecture 17: Multi-This, Multi-That, ...

Practical Limits • Today, 1-2 IPC sustained • far from the 10’s-100’s reported by limit studies • Limited by: • branch prediction accuracy • underlying DFG • influenced by algorithms, compiler • memory bottleneck • design complexity • implementation, test, validation, manufacturing, etc. • power • die area Lecture 17: Multi-This, Multi-That, ...

Differences BetweenReal Hardware and Limit Studies? • Real branch predictors aren’t 100% accurate • Memory disambiguation is not perfect • Physical resources are limited • can’t have infinite register renaming w/o infinite PRF • need infinite-entry ROB, RS and LSQ • need 10’s-100’s of execution units for 10’s-100’s of IPC • Bandwidth/Latencies are limited • studies assumed single-cycle execution • infinite fetch/commit bandwidth • infinite memory bandwidth (perfect caching) Lecture 17: Multi-This, Multi-That, ...

Bridging the Gap Power has been growing exponentially as well Watts/ IPC 100 10 1 Diminishing returns w.r.t. larger instruction window, higher issue-width Single-Issue Pipelined Limits Superscalar Out-of-Order (Today) Superscalar Out-of-Order (Hypothetical- Aggressive) Lecture 17: Multi-This, Multi-That, ...

Past the Knee of the Curve? Made sense to go Superscalar/OOO: good ROI Performance Very little gain for substantial effort “Effort” Scalar In-Order Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO Lecture 17: Multi-This, Multi-That, ...

So how do we get more Performance? • Keep pushing IPC and/or frequenecy? • possible, but too costly • design complexity (time to market), cooling (cost), power delivery (cost), etc. • Look for other parallelism • ILP/IPC: fine-grained parallelism • Multi-programming: coarse grained parallelism • assumes multiple user-visible processing elements • all parallelism up to this point was user-invisible Lecture 17: Multi-This, Multi-That, ...

User Visible/Invisible • All microarchitecture performance gains up to this point were “free” • free in that no user intervention required beyond buying the new processor/system • recompilation/rewriting could provide even more benefit, but you get some even if you do nothing • Multi-processing pushes the problem of finding the parallelism to above the ISA interface Lecture 17: Multi-This, Multi-That, ...

4-wide OOO CPU Task A Task B Benefit Task A 3-wide OOO CPU 3-wide OOO CPU Task B Task A 2-wide OOO CPU 2-wide OOO CPU Task B Workload Benefits runtime Task A Task B 3-wide OOO CPU This assumes you have two tasks/programs to execute… Lecture 17: Multi-This, Multi-That, ...

… If Only One Task runtime Task A 3-wide OOO CPU 4-wide OOO CPU Task A Benefit Task A 3-wide OOO CPU 3-wide OOO CPU No benefit over 1 CPU Task A 2-wide OOO CPU 2-wide OOO CPU Performance degradation! Idle Lecture 17: Multi-This, Multi-That, ...

Sources of (Coarse) Parallelism • Different applications • MP3 player in background while you work on Office • Other background tasks: OS/kernel, virus check, etc. • Piped applications • gunzip -c foo.gz | grep bar | perl some-script.pl • Within the same application • Java (scheduling, GC, etc.) • Explicitly coded multi-threading • pthreads, MPI, etc. Lecture 17: Multi-This, Multi-That, ...

(Execution) Latency vs. Bandwidth • Desktop processing • typically want an application to execute as quickly as possible (minimize latency) • Server/Enterprise processing • often throughput oriented (maximize bandwidth) • latency of individual task less important • ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time Lecture 17: Multi-This, Multi-That, ...

parallelizable 1CPU 2CPUs 3CPUs 4CPUs Benefit of MP Depends on Workload • Limited number of parallel tasks to run on PC • adding more CPUs than tasks provide zero performance benefit • Even for parallel code, Amdahl’s law will likely result in sub-linear speedup • In practice, parallelizable portion may not be evenly divisible Lecture 17: Multi-This, Multi-That, ...

Cache Coherency Protocols • Not covered in this course • You should have seen a bunch of this in CS6290 • Many different protocols • different number of states • different bandwidth/performance/complexity tradeoffs • current protocols usually referred to by their states • ex. MESI, MOESI, etc. Lecture 17: Multi-This, Multi-That, ...

Shared Memory Focus • Most small-medium multi-processors (these days) use some sort of shared memory • shared memory doesn’t scale as well to larger number of nodes • communications are broadcast based • bus becomes a severe bottleneck • or you have to deal with directory-based implementations • message passing doesn’t need centralized bus • can arrange multi-processor like a graph • nodes = CPUs, edges = independent links/routes • can have multiple communications/messages in transit at the same time Lecture 17: Multi-This, Multi-That, ...

SMP Machines • SMP = Symmetric Multi-Processing • Symmetric = All CPUs are “equal” • Equal = any process can run on any CPU • contrast with older parallel systems with master CPU and multiple worker CPUs CPU0 CPU1 CPU2 CPU3 Pictures found from google images Lecture 17: Multi-This, Multi-That, ...

Hardware Modifications for SMP • Processor • mainly support for cache coherence protocols • includes caches, write buffers, LSQ • control complexity increases, as memory latencies may be substantially more variable • Motherboard • multiple sockets (one per CPU) • datapaths between CPUs and memory controller • Other • Case: larger for bigger mobo, better airflow • Power: bigger power supply for N CPUs • Cooling: need to remove N CPUs’ worth of heat Lecture 17: Multi-This, Multi-That, ...

Chip-Multiprocessing • Simple SMP on the same chip Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX Lecture 17: Multi-This, Multi-That, ... Pictures found from google images

Shared Caches • Resources can be shared between CPUs • ex. IBM Power 5 CPU0 CPU1 L2 cache shared between both CPUs (no need to keep two copies coherent) L3 cache is also shared (only tags are on-chip; data are off-chip) Lecture 17: Multi-This, Multi-That, ...

Benefits? • Cheaper than mobo-based SMP • all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory) • less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication) • Performance • on-chip communication is faster • Efficiency • potentially better use of hardware resources than trying to make wider/more OOO single-threaded CPU Lecture 17: Multi-This, Multi-That, ...

Performance vs. Power • 2x CPUs not necessarily equal to 2x performance • 2x CPUs  ½ power for each • maybe a little better than ½ if resources can be shared • Back-of-the-Envelope calculation: • 3.8 GHz CPU at 100W • Dual-core: 50W per CPU • P  V3: Vorig3/VCMP3 = 100W/50W  VCMP = 0.8 Vorig • f  V: fCMP = 3.0GHz Lecture 17: Multi-This, Multi-That, ...

Simultaneous Multi-Threading • Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC • poor utilization • SMP: 2-4 CPUs, but need independent tasks • else poor utilization as well • SMT: Idea is to use a single large uni-processor as a multi-processor Lecture 17: Multi-This, Multi-That, ...

SMT (4 threads) CMP 2x HW Cost Approx 1x HW Cost SMT (2) Regular CPU Lecture 17: Multi-This, Multi-That, ...

Overview of SMT Hardware Changes • For an N-way (N threads) SMT, we need: • Ability to fetch from N threads • N sets of registers (including PCs) • N rename tables (RATs) • N virtual memory spaces • But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.) Lecture 17: Multi-This, Multi-That, ...

Cycle-Multiplexed fetch logic RS PC0 I$ Decode, etc. fetch PC1 PC2 cycle % N SMT Fetch • Duplicate fetch logic RS fetch PC0 Decode, Rename, Dispatch I$ PC1 fetch PC2 fetch • Alternatives • Other-Multiplexed fetch logic • Duplicate I$ as well Lecture 17: Multi-This, Multi-That, ...

SMT Rename • Thread #1’s R12 != Thread #2’s R12 • separate name spaces • need to disambiguate RAT0 PRF RAT PRF Thread0 Register # Thread-ID concat RAT1 Thread1 Register # Register # Lecture 17: Multi-This, Multi-That, ...

Shared RS Entries Sub T5 = T17 – T2 Add T12 = RT20 + T8 Load T25 = 0[T31] Xor T14 = T12 ^ T19 Load T23 = 0[T14] Sub T19 = T12 – T16 Xor T31 = T17 ^ T5 Add T17 = RT29 + T3 SMT Issue, Exec, Bypass, … • No change needed After Renaming Thread 0: Add R1 = R2 + R3 Sub R4 = R1 – R5 Xor R3 = R1 ^ R4 Load R2 = 0[R3] Thread 0: Add T12 = RT20 + T8 Sub T19 = T12 – T16 Xor T14 = T12 ^ T19 Load T23 = 0[T14] Thread 1: Add R1 = R2 + R3 Sub R4 = R1 – R5 Xor R3 = R1 ^ R4 Load R2 = 0[R3] Thread 1: Add T17 = RT29 + T3 Sub T5 = T17 – T2 XorT31 = T17 ^ T5 Load T25 = 0[T31] Lecture 17: Multi-This, Multi-That, ...

SMT Cache • Each process has own virtual address space • TLB must be thread-aware • translate (thread-id,virtual page)  physical page • Virtual portion of caches must also be thread-aware • VIVT cache must now be (virutal addr, thread-id)-indexed, (virtual addr, thread-id)-tagged • Similar for VIPT cache Lecture 17: Multi-This, Multi-That, ...

SMT Commit • One “Commit PC” per thread • Register File Management • ARF/PRF organization • need one ARF per thread • Unified PRF • need one “architected RAT” per thread • Need to maintain interrupts, exceptions, faults on a per-thread basis • like OOO needs to appear to outside world that it is in-order, SMT needs to appear as if it is actually N CPUs Lecture 17: Multi-This, Multi-That, ...

SMT Design Space • Number of threads • Full-SMT vs. Hard-partitioned SMT • full-SMT: ROB-entries can be allocated arbitrarily between the threads • hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc. • Amount of duplication • Duplicate I$, D$, fetch engine, decoders, schedulers, etc.? • There’s a continuum of possibilities between SMT and CMP • ex. could have CMP where FP unit is shared SMT-styled Lecture 17: Multi-This, Multi-That, ...

SMT Performance • When it works, it fills idle “issue slots” with work from other threads; throughput improves • But sometimes it can cause performance degradation! Time( ) < Time( ) Finish one task, then do the other Do both at same time using SMT Lecture 17: Multi-This, Multi-That, ...

How? • Cache thrashing L2 I$ D$ Executes reasonably quickly due to high cache hit rates Thread0 just fits in the Level-1 Caches I$ D$ Caches were just big enough to hold one thread’s data, but not two thread’s worth Context switch to Thread1 I$ D$ Now both threads have significantly higher cache miss rates Thread1 also fits nicely in the caches Lecture 17: Multi-This, Multi-That, ...

Fairness • Consider two programs • By themselves: • Program A: runtime = 10 seconds • Program B: runtime = 10 seconds • On SMT: • Program A: runtime = 14 seconds • Program B: runtime = 18 seconds • Standard Deviation of Speedups (lower = better) • A’s speedup: 10/14 = 0.71 • B’s speedup: 10/18 = 0.56 • SDS = 0.11 Lecture 17: Multi-This, Multi-That, ...

Fairness (2) • SDS encourages everyone to be punished similarly • does not account for actual performance, so if everyone is 1000x slower, it’s still “fair” • Alternative: Harmonic Mean of Weighted IPCs (HMWIPC) • IPCi = achieved IPC for thread i • SingleIPCi = IPC when thread i runs alone • HMWIPC = N SingleIPC1 + SingleIPC2 + … + SingleIPCN IPC1 IPC2 IPCN Lecture 17: Multi-This, Multi-That, ...

This is all combinable • Can have a system that supports SMP, CMP and SMT at the same time • Take a dual-socket SMP motherboard… • Insert two chips, each with a dual-core CMP… • Where each core supports two-way SMT • This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages Lecture 17: Multi-This, Multi-That, ...

OS Confusion • SMT/CMP is supposed to look like multiple CPUs to the software/OS Performance worse than if SMT was turned off and used 2-way SMP only A CPU0 2-way SMT A/B B CPU1 idle 2-way SMT idle CPU2 idle CPU3 2 cores (either SMP/CMP) Say OS has two tasks to run… Schedule tasks to (virtual) CPUs Lecture 17: Multi-This, Multi-That, ...

OS Confusion (2) • Asymmetries in MP-Hierarchy can be very difficult for the OS to deal with • need to break abstraction: OS needs to know which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT) • Distinct applications should be scheduled to physically different CPUs • no cache contention, no power contention • Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP) • reduce latency of inter-thread communication, possibly reduce duplication if shared L2 is used • Use SMT as last choice Lecture 17: Multi-This, Multi-That, ...

Multi-* is Happening • Intel Pentium 4 already had “Hyperthreading” (SMT) • went away for a while, but is back in Core i7 • IBM Power 5 and later have SMT • Dual, Quad core already here • Octo-core soon • Intel Core i7: 8 cores, each with 2-thread SMT • So is single-thread performance dead? • Is single-thread microarchitecture performance dead? Following adapted from Mark Hill’s HPCA08 keynote talk Lecture 17: Multi-This, Multi-That, ...

Recall Amdahl’s Law • Begins with Simple Software Assumption (Limit Arg.) • Fraction F of execution time perfectly parallelizable • No Overhead for • Scheduling • Synchronization • Communication, etc. • Fraction1 – F Completely Serial • Time on 1 core = (1 – F) / 1 + F / 1 = 1 • Time on N cores = (1 – F) / 1 + F / N The following slides derived from Mark Hill’s HPCA’08 Keynote

1 Amdahl’s Speedup = F 1 - F + 1 N Recall Amdahl’s Law [1967] • For mainframes, Amdahl expected 1 - F = 35% • For a 4-processor speedup = 2 • For infinite-processor speedup < 3 • Therefore, stay with mainframes with one/few processors • Do multicore chips repeal Amdahl’s Law? • Answer: No, But.

Designing Multicore Chips Hard • Designers must confront single-core design options • Instruction fetch, wakeup, select • Execution unit configuation & operand bypass • Load/queue(s) & data cache • Checkpoint, log, runahead, commit. • As well as additional design degrees of freedom • How many cores? How big each? • Shared caches: levels? How many banks? • Memory interface: How many banks? • On-chip interconnect: bus, switched, ordered?

Want Simple Multicore Hardware Model To Complement Amdahl’s Simple Software Model (1) Chip Hardware Roughly Partitioned into • Multiple Cores (with L1 caches) • The Rest (L2/L3 cache banks, interconnect, pads, etc.) • Changing Core Size/Number does NOT change The Rest (2) Resources for Multiple Cores Bounded • Bound of N resources per chip for cores • Due to area, power, cost ($$$), or multiple factors • Bound = Power? (but our pictures use Area)

Want Simple Multicore Hardware Model, cont. (3) Micro-architects can improve single-core performance using more of the bounded resource • A Simple Base Core • Consumes 1 Base Core Equivalent (BCE) resources • Provides performance normalized to 1 • An Enhanced Core (in same process generation) • Consumes R BCEs • Performance as a function Perf(R) • What does functionPerf(R) look like?

More on Enhanced Cores • (Performance Perf(R) consuming R BCEs resources) • If Perf(R) > R Always enhance core • Cost-effectively speedups both sequential & parallel • Therefore, Equations Assume Perf(R) < R • Graphs Assume Perf(R) = square root of R • 2x performance for 4 BCEs, 3x for 9 BCEs, etc. • Why? Models diminishing returns with “no coefficients” • How to speedup enhanced core? • <Insert favorite or TBD micro-architectural ideas here>

How Many (Symmetric) Cores per Chip? • Each Chip Bounded to N BCEs (for all cores) • Each Core consumes R BCEs • Assume SymmetricMulticore = All Cores Identical • Therefore, N/R Cores per Chip —(N/R)*R = N • For an N = 16 BCE Chip: Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

1 Symmetric Speedup = F * R 1 - F + Perf(R) Perf(R)*N Enhanced Cores speed Serial & Parallel Performance of Symmetric Multicore Chips • Serial Fraction 1-F uses 1 core at rate Perf(R) • Serial time = (1 – F) / Perf(R) • Parallel Fraction uses N/R cores at rate Perf(R) each • Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N • Therefore, w.r.t. one base core: • Implications?

Symmetric Multicore Chip, N = 16 BCEs F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) Need to increase parallelism to make multicore optimal! F=0.5 R=16, Cores=1, Speedup=4 (16 cores) (8 cores) (2 cores) (1 core) (4 cores)

Symmetric Multicore Chip, N = 16 BCEs At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism! F=0.9, R=2, Cores=8, Speedup=6.7 F=0.5 R=16, Cores=1, Speedup=4

Symmetric Multicore Chip, N = 16 BCEs F1, R=1, Cores=16, Speedup16 F matters: Amdahl’s Law applies to multicore chips Researchers should target parallelism F first

Symmetric Multicore Chip, N = 16 BCEs Recall F=0.9, R=2, Cores=8, Speedup=6.7 As Moore’s Law enables N to go from 16 to 256 BCEs, More core enhancements? More cores? Or both?

Advanced Microarchitecture