Presented by: Ashok Venkatesan

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar Presented by: Ashok Venkatesan

Outline • Background • Thread Level Parallelism(TLP) • Explicit & Implicit TLP • An Example • Multiplex • Threading Model • MUCS protocol • Key Performance Factors • Performance Analysis • Conclusion

Thread Level Parallelism • ILP Wall • Increasing CPI with increasing clock rates • Limited ILP in applications • Insufficient memory locality • Using TLP • Increased granularity of parallelism • Exploitation of Multi-cores • Threads: • A Logical sub-process that carries its own state. • State – Instructions, data, PC, register file, stack, etc.,

Explicit & Implicit TLP • Explicit TLP • Program is explicitly partitioned into threads by programmer and an API is used to dispatch and execute on multiple cores. • Static – defined in the program • Main Overhead – Thread Dispatch • Implicit or Speculative TLP • Threads are peeled off from a sequential execution stream of the program by hardware prediction. • Dynamic – runtime prediction • Main Overhead – Speculative State Overflow

Example – Exec Explicit Threads • Data Dependence is resolved using a barrier here • Dispatch of threads is done using a fork (System API) call

Example – Exec Implicit Threads • Both data dependence as well as dispatch are handled by a hardware predictor

Multiplex • Unifies explicit and implicit threading on a CMP • Obviates the need for serializing unanalyzable program segments by using speculative TLP • Avoids implicit threading’s speculation overhead and performance loss in compiler-analyzable program segments by using explicit threading. • Implements a single snoopy bus protocol to unify cache coherence with memory renaming and disambiguation.

Anatomy of a Multiplex CMP

Threading Model • Thread selection • Partitioning code into distinct instruction sequences. • Thread dispatch • Assigning threads to execute on different CPUs • Data communication and speculation • Propagating data between independent threads.

Thread Selection in Multiplex • Methodology • Compiler chooses between threading models • Prioritizes explicit threading over implicit threading • Implicit threads selected by runtime speculation by hardware • However, software specifies implicit thread boundaries • Pros – Minimizes explicit and implicit overheads • Scenarios • Executing loops with small bodies implicitly • Executing tail ends of unevenly partitioned segments implicitly

Thread Dispatch – An Overview • Dispatching conventional threads involve • Assigning PCs of CPUs the address of the first instruction of the thread • Assigning a private SP to CPUs • Copying stacks and register values prior to dispatch • Thread Descriptor – holds thread information • Stores the addresses of possible subsequent dispatch target threads • Holds register dependency information

Thread Dispatch in Multiplex • Methodology • Predict subsequent threads based on current threads • Dispatch, execute and commit sequentially • Re-dispatch on squashing • Suspend dispatch upon mode switch to allow thread commits to complete • Instruction Set Changes - fork, stop and setsp • A Thread Predictor unit added to handle speculative prediction • A mode bit added to the Thread Descriptor • A TD Cache caches recently referenced descriptors

MUCS Protocol • Mux Unified Coherence and Speculation - MUCS • Offers data coherence as well as versioning support • Key Design Objectives – minimize speculation overheads in two respects • Dependence resolution in the common case should be handled within the cache thereby minimizing bus transactions • Thread commit/squashes should only be done en masse and not as individual cache blocks.

MUCS Protocol

MUCS Protocol • 6 bits used for monitoring states of each cache block • Use – Set per speculative loadexecuted before store • Dirty – Set per speculative store in both modes • Commit – Set en masse on commit of speculative blocks • Stale – Set on a cache block when a newer version of data is available in another CPU • Squash – Set en masse on a cache touched by a squashed thread • Valid – Set per cache fill upon misses in both modes to determine validity of tag (not data)

Key Performance Factors • Thread Size • Load Imbalance • Data Dependence • Thread dispatch/completion overhead • Speculative State Overflow

Performance Analysis – System Info

Performance Analysis – Best Case • Class 1 applications favor Implicit-only CMPs • Class 2 applications favor explicit-only CMPs • Avg Speedup of 4 dual issue CMP over one dual issue CMP • Implicit-only=1.14, Explicit-only=2.17, Multiplex = 2.3

Performance Analysis - Overheads • I – implicit only, m - multiplex • Fpppp: provably parallel code = 0%, low squash buffer hits • wave5, tomcatv and swim have control flow irregularities in the inner loop i.e I/O stalls

Performance Analysis – Cache Size • Effects of increasing cache size – performance increases • Multiplex incurs lesser overflow than implicit-only CMP • Effects of increasing data rates – performance decreases

Conclusion • Coexistence of implicit and explicit multi-threading brings about a better speedup, showing a speedup of 2.63 during simulation • MUCS protocol allows such an implementation by mapping a coherence protocol needed for explicit threading to a subset of the states required for implicit threading and hence eliminates the need of extra hardware. • The dominant overheads for implicit and explicit threading are speculative state overflow and thread dispatching respectively.

Questions?

Thank you

Presented by: Ashok Venkatesan

Presented by: Ashok Venkatesan

Presentation Transcript

RTI International is a trade name of Research Triangle Institute

Nike’s Software Architecture and Infrastructure: Enabling Integrated Solutions for Gigahertz Designs

Skill Development Initiative (SDI) through Modular Employable Skills (MES) National Vehicle for skill building to reduc

ORGANIZATIONAL BEHAVIOR

Surgical Fires

AUDIT IN COMPUTERIZED ENVIRONMENT

School Law for School Psychologists

Safe Hiring Practices and Pre-employment Screening Presented for the HR Screener—2003

Presented by Indiana Treasurer of State’s Office

Presented By:

Presented by: Mr. Dali Mdunge

Presented by : Dr.Talal Alanzi

Presented by

Anorectal manometry and Biofeedback therapy

Presented by

WELCOME

EVM System Surveillance Presented By: [NAMES] Presented to: [GROUP]

PRESENTED BY Mr RSS Zitha

Presented by JOHN EVINGER

RISK MANAGEMENT IN SPORT AND RECREATION:Programming, Equipment and Facility Safety Presented At:

presented by David Burns

Presented by