400 likes | 410 Views
Learn about the motivation behind Simultaneous Multithreading (SMT) and how it improves the utilization of wide superscalar processors by exploiting both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP).
E N D
ECE 721Simultaneous Multithreading (SMT) Spring 2019 Prof. Eric Rotenberg
SMT Motivation • Wide superscalar is profitable when program has high ILP • Wide superscalar is not profitable when program is memory-bound or has low ILP • Memory-bound: Processor may eventually idle if waiting for last-level cache miss (100s of cycles) • Low ILP: Frequent branch mispredictions or few data-independent instructions cause execution lanes to be wasted or unused, respectively
SMT Motivation (cont.) • Simultaneous multithreading (SMT) • Multiple programs simultaneously share resources of a wide superscalar • Thread-level parallelism (TLP) improves utilization of wide superscalar • While a thread is stalled for a last-level cache miss, other threads can use execution lanes • If a thread is in a low-ILP phase, other threads can use execution lanes
SMT Motivation (cont.) • SMT is powerful because it exploits both ILP and TLP • If only a single program is running, wide superscalar minimizes its execution time by accelerating high-ILP phases • If multiple programs are running(“multiprogrammed workload”): • Individual programs may be slowed down (compared to if they ran separately) because they must share resources during their high-ILP phases • Nonetheless, the multiprogrammed workload as a whole will finish sooner with SMT than if the constituent programs were run one at a time, due to higher utilization of processor resources
execution lanes thread 1 thread 2 single-threaded cycle thread 3 SMT
ISCA-23 SMT Paper • Contributions • Implementable SMT microarchitecture • Leverage existing superscalar mechansims • Single-program performance unimpacted • Exploit TLP better • Basic fetch/issue policies exploit TLP poorly • Utilization tops out at 50% • Increasing number of threads beyond 5 doesn’t help • Insight into fetch and issue bottlenecks • Novel fetch/issue policies
FP units PC Multiple PCs Replicate architectural state • Thread selection • Replicate RAS, BHR Fetch Unit Data Cache FP Registers FP queue Instruction Cache Selective squash (no change) Decode Int. Registers Int. queue Int.+ load/store units Register Renaming Selective squash (no change) Replicate architectural state Partition LQ/SQ • Multiple rename map tables • Multiple arch. map tables • Partition single active list • Multiple GBMs
Types of Changes • Types of changes required for SMT support • REP: Replicate hardware • SIZE: Resize hardware • CTL: Additional control
Instruction Fetch • Multiple program counters (REP) • Thread selection (CTL) • Per-thread return address stacks and global branch history registers (REP)
Register File Management • Per-thread rename map tables (REP) • Per-thread arch. map tables (REP) • Partition active list among active threads (CTL) • Larger register file to hold architectural state for all threads (SIZE)
Issue Queues • Selective squash in Issue Queues and Execution Lanes • No noticeable change in these • Each thread maintains its own global branch mask (GBM) in Rename Stage, so a branch in a thread can only squash instructions in that thread • T threads and a maximum of B outstanding branches among all threads • T GBMs • B bits in each GBM • GBMs are mutually-exclusive because threads share the pool of B branches • A given branch bit can be “1” in only one GBM • A branch bit is free if it is “0” in all GBMs GBM of Thread 0 11001000 GBM of Thread 1 00000010 GBM of Thread 2 00010100 GBM of Thread 3 00100001 Example: T=4, B=8
LQ/SQ • Partition LQ/SQ among active threads (CTL). Motivation: • Must only consider memory dependencies within a thread. • Remember, we are talking about a speculative instruction window. • Globally-performed (committed) stores and loads among threads is different (see below). • What about shared memory among threads? • Must handle identically to globally-performed stores from other cores: only snoop committed stores from other threads. • Do this for threads in the same core and from other cores.
EXAMPLE: Register management changes for SMT • Suppose ISA defines 32 logical registers • Suppose superscalar core has: • Maximum of 64 active instructions • Maximum of 12 unresolved branches
EXAMPLE (cont.) • Superscalar core without SMT support • 1 rename map table • 1 architectural map table • 12 shadow map tables • 96 physical registers • 32 for committed state • 64 for speculative state • 64 entries in freelist • 64 entries in active list
EXAMPLE (cont.) • Same superscalar core with SMT support, 4 threads • 4 rename map tables • 4 architectural map tables • 12 shadow map tables • 192 physical registers • 4*32 = 128, for committed state • 64 for speculative state • 64 entries in freelist • 64 entries in active list • Partition among active threads • E.g., if only two threads running, one thread gets top half and the other gets bottom half. Or use sophisticated algorithm to partition based on resource needs. • Four sets of H/T pointers
About extra physical registers • If number of active threads < maximum number of threads, can’t we use the extra architectural register contexts for more speculative registers? • Yes, but it would require: • Free list and active list would be larger: sized larger assuming one thread instead of smaller assuming T threads (T = maximum # threads) • Non-reconfigurable SMT: active list size = free list size = PRF size – T*(# logical registers) • Reconfigurable SMT: active list size = free list size = PRF size – 1*(# logical registers) • Reconfiguration at the time of adding or removing active threads • Drain pipeline of all in-flight instructions (stall fetch while retiring all in-flight instructions). Must ensure enough free registers for instantiating additional architectural register contexts. • Add a thread: Pop free registers from free list to populate a blank-slate AMT and RMT. • Remove a thread: Push free registers from newly-deallocated thread’s AMT.
Nice Aspects of Implementation • OOO execution core unaware of multiple threads • Single-thread performance not impacted much • Mild degradation due to larger physical register file (described later…)
Potential Problem Areas • Predictor and cache pressure • Claims from paper • For chosen workload, predictor and cache conflicts not a problem • Any extra mispredictions and misses are cushioned by abundant TLP • Large benchmarks in practice • Cache conflicts may cause slowdown w.r.t. running programs serially • Serial execution exploits cache locality
Register File Impact • Why is register file larger with SMT? • 1 thread: Minimum of 1*32 integer registers • 8 threads: Minimum of 8*32 integer registers • I.e., need to store architectural state of all threads • Note that amount of speculative state is independent of number of threads (depends only on total number of active instructions) • Implication • Don’t want to increase cycle time • Expand register read stage into two stages; same with register write stage
misprediction penalty 6 cycle minimum may overlap misfetch penalty 2 cycles Decode Fetch Rename Queue Reg Read Exec. Reg. Wrt. Commit single bypass register usage 4 cycle minimum misprediction penalty 7 cycle minimum misfetch penalty 2 cycles Decode Reg Read Exec. Reg. Wrt. Commit Fetch Rename Queue Reg Read Reg. Wrt. double bypasses register usage 6 cycle minimum
Performance of Base SMT • Configuration • 8-way issue superscalar, 8 threads • Positive results • Single-thread performance degrades only 2% due to additional pipestages • SMT throughput is 84% higher than superscalar • Negative results • Processor utilization still low at 50% (IPC = 4) • Throughput peaks at 5 or 6 threads (not 8)
SMT Bottlenecks • Fetch throughput • Sustaining only 4.2 useful instructions per cycle! • Base thread selection: Round-Robin, 1 thread at a time • “Horizontal waste” due to single-threaded fetch stage • Sources of waste include misalignment and taken branches
SMT Bottlenecks (cont.) • Lack of parallelism • 8 independent threads should provide plenty of parallelism • Perhaps have the wrong instructions in the issue queues!
Fetch Unit • Try other fetch models • Notation: alg.num1.num2 • alg => thread selection method (which thread(s) to fetch) • num1 => # of threads that can fetch in 1 cycle • num2 => max # of instructions fetched per thread per cycle • There are 8 instruction cache banks and conflicts are modeled
Fetch Unit: Partitioning • Keep thread selection (alg) fixed • Round Robin (RR) • Models • RR.1.8 • Base scheme: 1 thread at a time, has full b/w of 8 • RR.2.4 and RR.4.2 • Total # of fetched instructions remains 8 • If num1 is too high, suffer thread shortage problem: too few threads to achieve 8 instr./cycle • RR.2.8 • Eliminates thread shortage problem (each thread can fetch max b/w) while reducing horizontal waste
Fetch Unit: Thread Choice • Replace Round-Robin (RR) with more intelligent thread selection policies • BRCOUNT • MISSCOUNT • ICOUNT
BRCOUNT • Give high priority to those threads with fewest unresolved branches • Attacks wrong-path fetching
MISSCOUNT • Give priority to threads with fewest outstanding data cache misses • Attacks IQ clog
ICOUNT • Give priority to threads with fewest instructions in decode, rename, and issue queues • Threads that make good progress => high priority • Threads that make slow progress => low priority • Attacks IQ clog generally
Helper Threads • Do work on behalf of program • Exploit spare threads in SMT or CMP
Two Helper Thread Research Thrusts • Performance • Pre-execution • Accelerate main program • Functionality • Augment program with other useful facets • Fault tolerance, software bug detection, security, profiling, garbage collection, dynamic optimizer, etc.
Other Threading Paradigms • Multiscalar (1992-1996), speculative multithreading (mid/late 90s) • Performance • Divide single program into multiple, speculatively parallel tasks • Different task execution substrates • Multiple tightly-coupled processing elements (PEs) – register + memory communication among tasks • Loosely-coupled PEs, i.e., CMP substrate – memory communication only • SMT substrate – register + memory communication • Multipath execution (1960s, 1990s) • Performance • Go down both paths of unconfident branches • Typically SMT substrate • Redundant threading • Transient fault tolerance: AR-SMT (1999), SRT (2000-2004) • Transient faults and hardware bugs: DIVA (1999) • Fault tolerance AND performance: Slipstream (1999-)
Multiple Processes • The following discussion applies to multiple processes sharing the processor and memory system • Applies to time sharing, i.e., context switching • Applies to space sharing, i.e., simultaneous multithreading • i.e., doesn’t matter whether threads are switching or simultaneous
Caches • Using physical addresses (PA) simplifies a lot • Respects separation (scenario #1) • Respects sharing (scenario #2) scenario #2: different VA, same PA (sharing) scenario #1: same VA, different PA process 0: VA(A), PA(X) process 1: VA(B), PA(X) process 0: VA(A), PA(X) process 1: VA(A), PA(Y)
Caches (cont.) • Three options for accessing cache • Serial: access cache after TLB search • Easily respects separation & sharing (physically indexed, physically tagged) • No constraints on cache organization • Increases hit time • Parallel, constrained: index cache in parallel with TLB search • Configure cache such that index bits fit entirely within page offset bits • ECE 521: assoc >= (cache size / page size) • Easily respects separation & sharing (physically indexed, physically tagged) • Does not increase hit time • Increasing cache size requires increasing set-associativity • Parallel, unconstrained: index cache in parallel with TLB search • Allow index bits to spill into virtual page number part of virtual address • Does not increase hit time • No constraints on cache organization • Does not easily respect separation & sharing (virtually indexed, physically tagged). Flush caches after context switch (doesn’t work for SMT) or take other sophisticated hardware/software measures.
TLB • ISA defines TLB entry • A translation is usually a tuple with three fields: [process id, VA, PA} • Including process id means TLB doesn’t need to be flushed after context-switches • Flush isn’t even an option for SMT
LQ/SQ: virtual or physical addresses? • The following discussion applies to a core without or with SMT • SMT has no bearing on the discussion • Use physical addresses in both LQ and SQ for these reasons • Committing stores • Committed store traffic (invalidates or updates) from other cores or threads must search LQ • For loads, how to search SQ in parallel with TLB access? • Two SQ CAMs, one with virtual addresses (SQ-V) and the other with physical addresses (SQ) • Load searches SQ-V first with its virtual address, and SQ second with its physical address • Fast, speculative store-load forwarding from SQ-V • Recover if SQ disagrees in next cycle
OLD • Same discussion applies • Only difference is that cache flush solution (if relevant) does not make sense in the case of SMT • Summing up: • The problem is not caused by SMT • The problem is caused by interplay of these factors: • Virtual memory • Multiple processes (whether conventional time-sharing or SMT) • Desire to access TLB and Cache in parallel