Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007

Overview • Previous Architectures • New Hybrid Architecture • Possible Benefits • Scrutiny • Experimental Results • Relation to Project

Something Old • CMP (single-Chip Multi-core Processors) • Two or more independent cores • Single ISA heterogeneous multiprocessors • Cores of varying size, performance • Same ISA • Improve throughput for multi-threaded • Single-Threaded?

Superscalar • Increase performance w/o recompiling • Efficiently handle runtime events • Branch Direction • Target Address • Load Latency • Memory Dependency • Limited ILP: Hardware Instruction Window

VLIW • Very Long Instruction Word • Shift Hardware complexity to compiler • High Clock Frequency • Energy-Efficient • No need to analyze data dependency • No scheduling of independent instruction

Something New • Dual-Core Architecture [1] • Bus-based snooping • Communicate Using L2 • In Future: • Interconnections • Small operand transfer buffer

Potential Benefits • VLIW core can operate at high clock rate • Simple Superscalar core • More aggressive compiler optimization • Due to the superscalar speculative operations • Simple hardware • Energy Efficient • Scalable

Hybrid Compiler • At TLP aware of: • Execution Bandwidth • Frequencies • At ILP: • Architectural details of Superscalar? • # functional units and latencies of VLIW • Helper threads

Optimization Phases • Phase 1 • Exploit speculative threads (helper threads) • Phase 2 • Extract non-speculative multi-grain parallelism • Partition source code • Predictable (static analysis or profiling) • Unpredictable (suitable for superscalar core) • A lot more …

Did that sound right? • Will the data be in the L2 cache when the VLIW core needs it?

What if?

Pre-Execution • Not a new idea • Using superscalar core to minimize L2 miss stalls • Stalling VLIW pipelines • Predictable load latencies? • Cache profiling

Definitions • Delinquent Loads • Small number of load operations are responsible for the majority of data cache misses. • Delinquent Loads Threshold • A pre-set threshold for number of allowable stall cycles caused by a static load instruction

Pre-Execution Thread • Make load operations non-faulting • Remove all store operations

Evaluation • Simulated Cores [1]

Evaluation (2) • Hybrid compiler built upon Trimaran compiler • A cycle-accurate model • Based on integration of • VLIW simulator from Trimaran • Superscalar simulator: simplescalar

Evaluation (3) • Seven single-threaded applications from • SPEC 2000 INT • SPEC 92 FP

Base, Pre-Execution, Prefetch

L2 Miss Latency

Delinquent Loads Threshold

Relation? • Relation to course project • Project focuses on scalability of optimization techniques • Relation to course • How multi-cores can help single-threaded applications

Reference • [1] Yan J., Zhang W., "Hybrid multi-core architecture for boosting single-threaded performance", ACM SIGARCH Computer Architecture News 35(1): 141-148, 2007

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance

Presentation Transcript

Multi-threaded RTOS

Multi-threaded Active Objects

Multi-threaded Active Objects

Multi-threaded applications

Regression Verification for Multi-Threaded Programs

Multi-core systems System Architecture COMP25212

Tera MTA (Multi-Threaded Architecture)

Multi Threaded Chat Server

Multi-threaded Reachability

Multi-Threaded Transactions

Multi-threaded Reachability

Parallelism (Multi-threaded)

Multi-threaded RTOS

Multi-Threaded Video Rendering

Multi-core systems System Architecture COMP25212

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain Multi-Threading

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Multi-threaded ROOT

Multi-core systems COMP25212 System Architecture

Single Value Processing Multi-Threaded Process