240 likes | 352 Views
Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor. Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar. Presented by: Ashok Venkatesan. Outline. Background Thread Level Parallelism(TLP)
E N D
Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar Presented by: Ashok Venkatesan
Outline • Background • Thread Level Parallelism(TLP) • Explicit & Implicit TLP • An Example • Multiplex • Threading Model • MUCS protocol • Key Performance Factors • Performance Analysis • Conclusion
Thread Level Parallelism • ILP Wall • Increasing CPI with increasing clock rates • Limited ILP in applications • Insufficient memory locality • Using TLP • Increased granularity of parallelism • Exploitation of Multi-cores • Threads: • A Logical sub-process that carries its own state. • State – Instructions, data, PC, register file, stack, etc.,
Explicit & Implicit TLP • Explicit TLP • Program is explicitly partitioned into threads by programmer and an API is used to dispatch and execute on multiple cores. • Static – defined in the program • Main Overhead – Thread Dispatch • Implicit or Speculative TLP • Threads are peeled off from a sequential execution stream of the program by hardware prediction. • Dynamic – runtime prediction • Main Overhead – Speculative State Overflow
Example – Exec Explicit Threads • Data Dependence is resolved using a barrier here • Dispatch of threads is done using a fork (System API) call
Example – Exec Implicit Threads • Both data dependence as well as dispatch are handled by a hardware predictor
Multiplex • Unifies explicit and implicit threading on a CMP • Obviates the need for serializing unanalyzable program segments by using speculative TLP • Avoids implicit threading’s speculation overhead and performance loss in compiler-analyzable program segments by using explicit threading. • Implements a single snoopy bus protocol to unify cache coherence with memory renaming and disambiguation.
Threading Model • Thread selection • Partitioning code into distinct instruction sequences. • Thread dispatch • Assigning threads to execute on different CPUs • Data communication and speculation • Propagating data between independent threads.
Thread Selection in Multiplex • Methodology • Compiler chooses between threading models • Prioritizes explicit threading over implicit threading • Implicit threads selected by runtime speculation by hardware • However, software specifies implicit thread boundaries • Pros – Minimizes explicit and implicit overheads • Scenarios • Executing loops with small bodies implicitly • Executing tail ends of unevenly partitioned segments implicitly
Thread Dispatch – An Overview • Dispatching conventional threads involve • Assigning PCs of CPUs the address of the first instruction of the thread • Assigning a private SP to CPUs • Copying stacks and register values prior to dispatch • Thread Descriptor – holds thread information • Stores the addresses of possible subsequent dispatch target threads • Holds register dependency information
Thread Dispatch in Multiplex • Methodology • Predict subsequent threads based on current threads • Dispatch, execute and commit sequentially • Re-dispatch on squashing • Suspend dispatch upon mode switch to allow thread commits to complete • Instruction Set Changes - fork, stop and setsp • A Thread Predictor unit added to handle speculative prediction • A mode bit added to the Thread Descriptor • A TD Cache caches recently referenced descriptors
MUCS Protocol • Mux Unified Coherence and Speculation - MUCS • Offers data coherence as well as versioning support • Key Design Objectives – minimize speculation overheads in two respects • Dependence resolution in the common case should be handled within the cache thereby minimizing bus transactions • Thread commit/squashes should only be done en masse and not as individual cache blocks.
MUCS Protocol • 6 bits used for monitoring states of each cache block • Use – Set per speculative loadexecuted before store • Dirty – Set per speculative store in both modes • Commit – Set en masse on commit of speculative blocks • Stale – Set on a cache block when a newer version of data is available in another CPU • Squash – Set en masse on a cache touched by a squashed thread • Valid – Set per cache fill upon misses in both modes to determine validity of tag (not data)
Key Performance Factors • Thread Size • Load Imbalance • Data Dependence • Thread dispatch/completion overhead • Speculative State Overflow
Performance Analysis – Best Case • Class 1 applications favor Implicit-only CMPs • Class 2 applications favor explicit-only CMPs • Avg Speedup of 4 dual issue CMP over one dual issue CMP • Implicit-only=1.14, Explicit-only=2.17, Multiplex = 2.3
Performance Analysis - Overheads • I – implicit only, m - multiplex • Fpppp: provably parallel code = 0%, low squash buffer hits • wave5, tomcatv and swim have control flow irregularities in the inner loop i.e I/O stalls
Performance Analysis – Cache Size • Effects of increasing cache size – performance increases • Multiplex incurs lesser overflow than implicit-only CMP • Effects of increasing data rates – performance decreases
Conclusion • Coexistence of implicit and explicit multi-threading brings about a better speedup, showing a speedup of 2.63 during simulation • MUCS protocol allows such an implementation by mapping a coherence protocol needed for explicit threading to a subset of the states required for implicit threading and hence eliminates the need of extra hardware. • The dominant overheads for implicit and explicit threading are speculative state overflow and thread dispatching respectively.