Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture News, 2003

Agenda • Overview (IMT, state-of-art)‏ • IMT enhancements • Key results • Critique • Relation to Term Project

Implicitly Multithreaded Processor (IMT)‏ • SMT with speculation • Optimizations to basic SMT support • Average perf. improvement of 24%Max: 69%

State-of-the-art • Pentium 4 HT • IBM POWER5 • MIPS MT

Speculative SMT operation • When branch encountered, start executing likely path “speculatively”i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence) • Overcome cost, overhead with savings in execution time and power (but worth the effort)‏ • Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts. • If dependence violation, squash thread and restart execution

How to buffer speculative data? • Load/Store Queue (LSQ)‏ • Buffers data (along with its address)‏ • Helps enforce dependency check • Makes rollback possible • Cache-based approaches

IMT: Most significant improvements • Assistance from Multiscalar compiler • Resource- and dependence-aware fetch policy • Multiplexing threads on a single hardware context • Overlapping thread startup operations with previous threads execution

What does Compiler do? • Extracts threads from program (loops)‏ • Generates thread descriptor data about registers read and written and control flow exits (for rename tables) • Annotates instructions with special codes (“forward” & “release”) for dependence checking

Fetch Policy • Hardware keeps track of resource utilization • Resource requirement prediction from past four execution instances • When dependencies exist (detected from compiler-generated data), bias towards non-speculative threads • Goal is to reduce number of thread squashes

Multiplexing threads on a single hardware context • Observations: • Threads usually short • Number of contexts less (2-8)‏ Hence frequent switching, less overlap

Multiplexing (contd.)‏ • Larger threads can lead to: • Speculation buffer overflow • Increased dependence mis-speculation • Hence thread squashing • Each execution context can further support multiple threads (3-6)‏

Multiplexing: Required Hardware • Per context per thread: • Program Counter • Register rename table • LSQ shared among threads running on 1 execution context

Multiplexing: Implementation Issues • LSQ shared but it needs to maintain loads and stores for each thread separately • Therefore, create “gaps” for yet-to-be-fetched instructions / data • If space falls short, squash subsequent thread • What if threads from one program are mapped to different contexts? • IMT searches through other contexts • Easier to have multiple LSQs per context per thread but not good cost and power consumption

Register renaming • Required because multiple threads may use same registers • Separate rename tables • Master Rename Table (global)‏ • Local Rename Table (per thread)‏ • Pre-assign table (per thread)‏

Register renaming: Flow • Thread Invocation: • Copy from Master table into Local table (to reflect current status)‏ • Also use “create” and “use” mask of thread descriptor(to for dependence check)‏ • Before every subsequent thread invocation: • Pre-assign rename maps into Pre-assign table • Copy from Pre-assign table to Master table and mark registers as “busy”. So no successor thread can use them before current thread writes to them.

Hiding thread startup delay • Rename tables to be setup before execution begins • Occupies table bandwidth, hence cannot be done for a number of threads in parallel • Hence overlap setting up of rename tables with previous thread’s execution

Load/Store Queue • Per context • Speculative load / store: Search through current and other contexts for dependence • No searching for non-speculative loads • Searching can take time, so schedules load-dependent instructions accordingly

Key Results

Average improvement: 24% • Reduction in data dependence stalls • Little overhead of optimizations • Not all benchmark programs

Assuming 2-3 threads per context, 6-8 LSQ entries per thread. • Performance relative to IMT with unlimited resources

ICOUNT: Favor least number of instructions remaining to be executed • Biased-ICOUNT: Favor non-speculative threads • Worst-case resource estimation • Reduced thread squashing

TME: Executes both paths of an unpredictable branch (but such branches uncommon)‏ • DMT: • Hardware-selection of threads. So spawns threads on backward-branch or function call instead of loops. • Also spawns threads out of order. So lower accuracy of branch prediction.

Critique

Compiler Support • Improvement in applications compiled using Multiscalar compiler • Scientific computing applications, not for desktop applications

LSQ Limitations • LSQ size deciding the size of speculative thread • Pentium 4 (without SMT):48 Loads, 24 Stores • Pentium 4 HT:24 Loads, 12 Stores per thread • IBM Power5:32 Loads, 32 Stores per thread

LSQ Limitations: Alternative • Cache-based approachi.e. Partition the cache to support different versions • Extra support required, but scalable

Register file size • IMT considers register file sizes of 128 and up. • Pentium 4 (as well as HT):Register file size = 128 • IBM POWER5:Register file size = 80

Searching LSQ • Since loads and stores organized as per thread, search involves all locations of other threads. • If loads/stores organized according to addresses then lesser values to search. • Can make use of associativity of cache

Searching LSQ (contd.)‏

So how is performance still high? • Assistance from Compiler • Resource and dependency-aware fetching • Multiple threads on an execution context • Overlapping rename table creation with execution

Term project • “Cache-based throughput improvement techniques for Speculative SMT processors” • Optimizations from IMT • Increasing granularity to reduce number of thread squashes

Thank you

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Presentation Transcript

Nandita Vijaykumar

Differentiating implicitly

Multithreaded Processors

GPU Computing: Pervasive Massively Multithreaded Processors

Multithreaded Processors

Presenter: Babak . N . Saif

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar

Processors and Chipsets

Processors with H yper -T hreading and AliRoot performance

Multithreaded Processors

12. Multithreaded Processors

Motherboards and Processors

Multithreaded Processors

Multithreaded Processors

Soft Real-Time Scheduling on Simultaneous Multithreaded Processors

Multithreaded Processors

T. Okita, N. Sekimura and T. Iwai

Implicitly Defined Functions