320 likes | 496 Views
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar . Presented by: Ashay Rane. Published in: SIGARCH Computer Architecture News, 2003. Agenda. Overview (IMT, state-of-art) IMT enhancements Key results Critique Relation to Term Project.
E N D
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture News, 2003
Agenda • Overview (IMT, state-of-art) • IMT enhancements • Key results • Critique • Relation to Term Project
Implicitly Multithreaded Processor (IMT) • SMT with speculation • Optimizations to basic SMT support • Average perf. improvement of 24%Max: 69%
State-of-the-art • Pentium 4 HT • IBM POWER5 • MIPS MT
Speculative SMT operation • When branch encountered, start executing likely path “speculatively”i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence) • Overcome cost, overhead with savings in execution time and power (but worth the effort) • Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts. • If dependence violation, squash thread and restart execution
How to buffer speculative data? • Load/Store Queue (LSQ) • Buffers data (along with its address) • Helps enforce dependency check • Makes rollback possible • Cache-based approaches
IMT: Most significant improvements • Assistance from Multiscalar compiler • Resource- and dependence-aware fetch policy • Multiplexing threads on a single hardware context • Overlapping thread startup operations with previous threads execution
What does Compiler do? • Extracts threads from program (loops) • Generates thread descriptor data about registers read and written and control flow exits (for rename tables) • Annotates instructions with special codes (“forward” & “release”) for dependence checking
Fetch Policy • Hardware keeps track of resource utilization • Resource requirement prediction from past four execution instances • When dependencies exist (detected from compiler-generated data), bias towards non-speculative threads • Goal is to reduce number of thread squashes
Multiplexing threads on a single hardware context • Observations: • Threads usually short • Number of contexts less (2-8) Hence frequent switching, less overlap
Multiplexing (contd.) • Larger threads can lead to: • Speculation buffer overflow • Increased dependence mis-speculation • Hence thread squashing • Each execution context can further support multiple threads (3-6)
Multiplexing: Required Hardware • Per context per thread: • Program Counter • Register rename table • LSQ shared among threads running on 1 execution context
Multiplexing: Implementation Issues • LSQ shared but it needs to maintain loads and stores for each thread separately • Therefore, create “gaps” for yet-to-be-fetched instructions / data • If space falls short, squash subsequent thread • What if threads from one program are mapped to different contexts? • IMT searches through other contexts • Easier to have multiple LSQs per context per thread but not good cost and power consumption
Register renaming • Required because multiple threads may use same registers • Separate rename tables • Master Rename Table (global) • Local Rename Table (per thread) • Pre-assign table (per thread)
Register renaming: Flow • Thread Invocation: • Copy from Master table into Local table (to reflect current status) • Also use “create” and “use” mask of thread descriptor(to for dependence check) • Before every subsequent thread invocation: • Pre-assign rename maps into Pre-assign table • Copy from Pre-assign table to Master table and mark registers as “busy”. So no successor thread can use them before current thread writes to them.
Hiding thread startup delay • Rename tables to be setup before execution begins • Occupies table bandwidth, hence cannot be done for a number of threads in parallel • Hence overlap setting up of rename tables with previous thread’s execution
Load/Store Queue • Per context • Speculative load / store: Search through current and other contexts for dependence • No searching for non-speculative loads • Searching can take time, so schedules load-dependent instructions accordingly
Average improvement: 24% • Reduction in data dependence stalls • Little overhead of optimizations • Not all benchmark programs
Assuming 2-3 threads per context, 6-8 LSQ entries per thread. • Performance relative to IMT with unlimited resources
ICOUNT: Favor least number of instructions remaining to be executed • Biased-ICOUNT: Favor non-speculative threads • Worst-case resource estimation • Reduced thread squashing
TME: Executes both paths of an unpredictable branch (but such branches uncommon) • DMT: • Hardware-selection of threads. So spawns threads on backward-branch or function call instead of loops. • Also spawns threads out of order. So lower accuracy of branch prediction.
Compiler Support • Improvement in applications compiled using Multiscalar compiler • Scientific computing applications, not for desktop applications
LSQ Limitations • LSQ size deciding the size of speculative thread • Pentium 4 (without SMT):48 Loads, 24 Stores • Pentium 4 HT:24 Loads, 12 Stores per thread • IBM Power5:32 Loads, 32 Stores per thread
LSQ Limitations: Alternative • Cache-based approachi.e. Partition the cache to support different versions • Extra support required, but scalable
Register file size • IMT considers register file sizes of 128 and up. • Pentium 4 (as well as HT):Register file size = 128 • IBM POWER5:Register file size = 80
Searching LSQ • Since loads and stores organized as per thread, search involves all locations of other threads. • If loads/stores organized according to addresses then lesser values to search. • Can make use of associativity of cache
So how is performance still high? • Assistance from Compiler • Resource and dependency-aware fetching • Multiple threads on an execution context • Overlapping rename table creation with execution
Term project • “Cache-based throughput improvement techniques for Speculative SMT processors” • Optimizations from IMT • Increasing granularity to reduce number of thread squashes