ECE 721 Simultaneous Multithreading (SMT)

ECE 721Simultaneous Multithreading (SMT) Spring 2019 Prof. Eric Rotenberg

SMT Motivation • Wide superscalar is profitable when program has high ILP • Wide superscalar is not profitable when program is memory-bound or has low ILP • Memory-bound: Processor may eventually idle if waiting for last-level cache miss (100s of cycles) • Low ILP: Frequent branch mispredictions or few data-independent instructions cause execution lanes to be wasted or unused, respectively

SMT Motivation (cont.) • Simultaneous multithreading (SMT) • Multiple programs simultaneously share resources of a wide superscalar • Thread-level parallelism (TLP) improves utilization of wide superscalar • While a thread is stalled for a last-level cache miss, other threads can use execution lanes • If a thread is in a low-ILP phase, other threads can use execution lanes

SMT Motivation (cont.) • SMT is powerful because it exploits both ILP and TLP • If only a single program is running, wide superscalar minimizes its execution time by accelerating high-ILP phases • If multiple programs are running(“multiprogrammed workload”): • Individual programs may be slowed down (compared to if they ran separately) because they must share resources during their high-ILP phases • Nonetheless, the multiprogrammed workload as a whole will finish sooner with SMT than if the constituent programs were run one at a time, due to higher utilization of processor resources

execution lanes thread 1 thread 2 single-threaded cycle thread 3 SMT

ISCA-23 SMT Paper • Contributions • Implementable SMT microarchitecture • Leverage existing superscalar mechansims • Single-program performance unimpacted • Exploit TLP better • Basic fetch/issue policies exploit TLP poorly • Utilization tops out at 50% • Increasing number of threads beyond 5 doesn’t help • Insight into fetch and issue bottlenecks • Novel fetch/issue policies

FP units PC Multiple PCs Replicate architectural state • Thread selection • Replicate RAS, BHR Fetch Unit Data Cache FP Registers FP queue Instruction Cache Selective squash (no change) Decode Int. Registers Int. queue Int.+ load/store units Register Renaming Selective squash (no change) Replicate architectural state Partition LQ/SQ • Multiple rename map tables • Multiple arch. map tables • Partition single active list • Multiple GBMs

Types of Changes • Types of changes required for SMT support • REP: Replicate hardware • SIZE: Resize hardware • CTL: Additional control

Instruction Fetch • Multiple program counters (REP) • Thread selection (CTL) • Per-thread return address stacks and global branch history registers (REP)

Register File Management • Per-thread rename map tables (REP) • Per-thread arch. map tables (REP) • Partition active list among active threads (CTL) • Larger register file to hold architectural state for all threads (SIZE)

Issue Queues • Selective squash in Issue Queues and Execution Lanes • No noticeable change in these • Each thread maintains its own global branch mask (GBM) in Rename Stage, so a branch in a thread can only squash instructions in that thread • T threads and a maximum of B outstanding branches among all threads • T GBMs • B bits in each GBM • GBMs are mutually-exclusive because threads share the pool of B branches • A given branch bit can be “1” in only one GBM • A branch bit is free if it is “0” in all GBMs GBM of Thread 0 11001000 GBM of Thread 1 00000010 GBM of Thread 2 00010100 GBM of Thread 3 00100001 Example: T=4, B=8

LQ/SQ • Partition LQ/SQ among active threads (CTL). Motivation: • Must only consider memory dependencies within a thread. • Remember, we are talking about a speculative instruction window. • Globally-performed (committed) stores and loads among threads is different (see below). • What about shared memory among threads? • Must handle identically to globally-performed stores from other cores: only snoop committed stores from other threads. • Do this for threads in the same core and from other cores.

EXAMPLE: Register management changes for SMT • Suppose ISA defines 32 logical registers • Suppose superscalar core has: • Maximum of 64 active instructions • Maximum of 12 unresolved branches

EXAMPLE (cont.) • Superscalar core without SMT support • 1 rename map table • 1 architectural map table • 12 shadow map tables • 96 physical registers • 32 for committed state • 64 for speculative state • 64 entries in freelist • 64 entries in active list

EXAMPLE (cont.) • Same superscalar core with SMT support, 4 threads • 4 rename map tables • 4 architectural map tables • 12 shadow map tables • 192 physical registers • 4*32 = 128, for committed state • 64 for speculative state • 64 entries in freelist • 64 entries in active list • Partition among active threads • E.g., if only two threads running, one thread gets top half and the other gets bottom half. Or use sophisticated algorithm to partition based on resource needs. • Four sets of H/T pointers

About extra physical registers • If number of active threads < maximum number of threads, can’t we use the extra architectural register contexts for more speculative registers? • Yes, but it would require: • Free list and active list would be larger: sized larger assuming one thread instead of smaller assuming T threads (T = maximum # threads) • Non-reconfigurable SMT: active list size = free list size = PRF size – T*(# logical registers) • Reconfigurable SMT: active list size = free list size = PRF size – 1*(# logical registers) • Reconfiguration at the time of adding or removing active threads • Drain pipeline of all in-flight instructions (stall fetch while retiring all in-flight instructions). Must ensure enough free registers for instantiating additional architectural register contexts. • Add a thread: Pop free registers from free list to populate a blank-slate AMT and RMT. • Remove a thread: Push free registers from newly-deallocated thread’s AMT.

Nice Aspects of Implementation • OOO execution core unaware of multiple threads • Single-thread performance not impacted much • Mild degradation due to larger physical register file (described later…)

Potential Problem Areas • Predictor and cache pressure • Claims from paper • For chosen workload, predictor and cache conflicts not a problem • Any extra mispredictions and misses are cushioned by abundant TLP • Large benchmarks in practice • Cache conflicts may cause slowdown w.r.t. running programs serially • Serial execution exploits cache locality

Register File Impact • Why is register file larger with SMT? • 1 thread: Minimum of 1*32 integer registers • 8 threads: Minimum of 8*32 integer registers • I.e., need to store architectural state of all threads • Note that amount of speculative state is independent of number of threads (depends only on total number of active instructions) • Implication • Don’t want to increase cycle time • Expand register read stage into two stages; same with register write stage

misprediction penalty 6 cycle minimum may overlap misfetch penalty 2 cycles Decode Fetch Rename Queue Reg Read Exec. Reg. Wrt. Commit single bypass register usage 4 cycle minimum misprediction penalty 7 cycle minimum misfetch penalty 2 cycles Decode Reg Read Exec. Reg. Wrt. Commit Fetch Rename Queue Reg Read Reg. Wrt. double bypasses register usage 6 cycle minimum

Performance of Base SMT • Configuration • 8-way issue superscalar, 8 threads • Positive results • Single-thread performance degrades only 2% due to additional pipestages • SMT throughput is 84% higher than superscalar • Negative results • Processor utilization still low at 50% (IPC = 4) • Throughput peaks at 5 or 6 threads (not 8)

SMT Bottlenecks • Fetch throughput • Sustaining only 4.2 useful instructions per cycle! • Base thread selection: Round-Robin, 1 thread at a time • “Horizontal waste” due to single-threaded fetch stage • Sources of waste include misalignment and taken branches

SMT Bottlenecks (cont.) • Lack of parallelism • 8 independent threads should provide plenty of parallelism • Perhaps have the wrong instructions in the issue queues!

Fetch Unit • Try other fetch models • Notation: alg.num1.num2 • alg => thread selection method (which thread(s) to fetch) • num1 => # of threads that can fetch in 1 cycle • num2 => max # of instructions fetched per thread per cycle • There are 8 instruction cache banks and conflicts are modeled

Fetch Unit: Partitioning • Keep thread selection (alg) fixed • Round Robin (RR) • Models • RR.1.8 • Base scheme: 1 thread at a time, has full b/w of 8 • RR.2.4 and RR.4.2 • Total # of fetched instructions remains 8 • If num1 is too high, suffer thread shortage problem: too few threads to achieve 8 instr./cycle • RR.2.8 • Eliminates thread shortage problem (each thread can fetch max b/w) while reducing horizontal waste

Fetch Unit: Thread Choice • Replace Round-Robin (RR) with more intelligent thread selection policies • BRCOUNT • MISSCOUNT • ICOUNT

BRCOUNT • Give high priority to those threads with fewest unresolved branches • Attacks wrong-path fetching

MISSCOUNT • Give priority to threads with fewest outstanding data cache misses • Attacks IQ clog

ICOUNT • Give priority to threads with fewest instructions in decode, rename, and issue queues • Threads that make good progress => high priority • Threads that make slow progress => low priority • Attacks IQ clog generally

Helper Threads

Helper Threads • Do work on behalf of program • Exploit spare threads in SMT or CMP

Two Helper Thread Research Thrusts • Performance • Pre-execution • Accelerate main program • Functionality • Augment program with other useful facets • Fault tolerance, software bug detection, security, profiling, garbage collection, dynamic optimizer, etc.

Other Threading Paradigms • Multiscalar (1992-1996), speculative multithreading (mid/late 90s) • Performance • Divide single program into multiple, speculatively parallel tasks • Different task execution substrates • Multiple tightly-coupled processing elements (PEs) – register + memory communication among tasks • Loosely-coupled PEs, i.e., CMP substrate – memory communication only • SMT substrate – register + memory communication • Multipath execution (1960s, 1990s) • Performance • Go down both paths of unconfident branches • Typically SMT substrate • Redundant threading • Transient fault tolerance: AR-SMT (1999), SRT (2000-2004) • Transient faults and hardware bugs: DIVA (1999) • Fault tolerance AND performance: Slipstream (1999-)

Extra notes – work in progress!

Multiple Processes • The following discussion applies to multiple processes sharing the processor and memory system • Applies to time sharing, i.e., context switching • Applies to space sharing, i.e., simultaneous multithreading • i.e., doesn’t matter whether threads are switching or simultaneous

Caches • Using physical addresses (PA) simplifies a lot • Respects separation (scenario #1) • Respects sharing (scenario #2) scenario #2: different VA, same PA (sharing) scenario #1: same VA, different PA process 0: VA(A), PA(X) process 1: VA(B), PA(X) process 0: VA(A), PA(X) process 1: VA(A), PA(Y)

Caches (cont.) • Three options for accessing cache • Serial: access cache after TLB search • Easily respects separation & sharing (physically indexed, physically tagged) • No constraints on cache organization • Increases hit time • Parallel, constrained: index cache in parallel with TLB search • Configure cache such that index bits fit entirely within page offset bits • ECE 521: assoc >= (cache size / page size) • Easily respects separation & sharing (physically indexed, physically tagged) • Does not increase hit time • Increasing cache size requires increasing set-associativity • Parallel, unconstrained: index cache in parallel with TLB search • Allow index bits to spill into virtual page number part of virtual address • Does not increase hit time • No constraints on cache organization • Does not easily respect separation & sharing (virtually indexed, physically tagged). Flush caches after context switch (doesn’t work for SMT) or take other sophisticated hardware/software measures.

TLB • ISA defines TLB entry • A translation is usually a tuple with three fields: [process id, VA, PA} • Including process id means TLB doesn’t need to be flushed after context-switches • Flush isn’t even an option for SMT

LQ/SQ: virtual or physical addresses? • The following discussion applies to a core without or with SMT • SMT has no bearing on the discussion • Use physical addresses in both LQ and SQ for these reasons • Committing stores • Committed store traffic (invalidates or updates) from other cores or threads must search LQ • For loads, how to search SQ in parallel with TLB access? • Two SQ CAMs, one with virtual addresses (SQ-V) and the other with physical addresses (SQ) • Load searches SQ-V first with its virtual address, and SQ second with its physical address • Fast, speculative store-load forwarding from SQ-V • Recover if SQ disagrees in next cycle

OLD • Same discussion applies • Only difference is that cache flush solution (if relevant) does not make sense in the case of SMT • Summing up: • The problem is not caused by SMT • The problem is caused by interplay of these factors: • Virtual memory • Multiple processes (whether conventional time-sharing or SMT) • Desire to access TLB and Cache in parallel

ECE 721 Simultaneous Multithreading (SMT)

ECE 721 Simultaneous Multithreading (SMT)

Presentation Transcript

Simultaneous Multithreading (SMT)

Transient Fault Detection via Simultaneous Multithreading

SIMULTANEOUS MULTITHREADING

Symbiotic Jobscheduling for a Simultaneous Multithreading Processor

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

Transient Fault Detection and Recovery via Simultaneous Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Simultaneous Multithreading: Multiplying Alpha Performance

CSE 502 Graduate Computer Architecture Lec 11 –Simultaneous Multithreading

Simultaneous Multithreading (SMT)

Computer Architecture Lec 10 –Simultaneous Multithreading

Soft Real-Time Scheduling on Simultaneous Multithreaded Processors

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

CMPE 511 Multithreading

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Transient Fault Detection via Simultaneous Multithreading

ECE Projects Lab

ECE-548 Sequential Machine Theory

ECE Projects Lab

Limits to ILP and Simultaneous Multithreading

Improving Database Performance on Simultaneous Multithreading Processors

EECS 252 Graduate Computer Architecture Lec 10 –Simultaneous Multithreading