260 likes | 405 Views
Energy-Efficient Speculative Threads: Dynamic Thread Allocation in Same-ISA Heterogeneous Multicore System. Yangchun Luo * , Venkatesan Packirisamy † , Wei-Chung Hsu ‡ , and Antonia Zhai *. *University of Minnesota – Twin Cities † NVIDIA Corporation ‡ National Chiao Tung University, Taiwan.
E N D
Energy-Efficient Speculative Threads: Dynamic Thread Allocation in Same-ISA Heterogeneous Multicore System Yangchun Luo*,Venkatesan Packirisamy†,Wei-Chung Hsu‡, and Antonia Zhai* *University of Minnesota – Twin Cities † NVIDIA Corporation ‡National Chiao Tung University, Taiwan
Background Traditional Program Multicore Processor P0 P1 P2 P3 Multicore requires thread-level parallelism
Speculative Parallelism Sequential Traditional Parallelization Load *q Store *p Store *p Time Time Load *q p != q ?? Thread-LevelSpeculation (TLS) p != q p == q dependence Load 20 Load 88 Time Store 88 Store 88 Speculation Failure Load 88 Parallel execution More potential parallelism
Speculation vs. Energy Efficiency Successful Speculation Failed Speculation p != q p == q Load 20 Load 88 dependence Time Store 88 Store 88 Load 88 Improve performance Waste dynamic power Reduce leakage duration More leaking component Can we exploit performance without compromising energy efficiency?
Impact from Underlying Hardware TLS Energy Efficiency vs. Hardware Configuration [PackirisamyICCD’08] SMT Architecture CMP Architecture … SMT L1 Cache Overall higher efficiency Better in some cases
Optimization Opportunities Resource contention Use CMP Failed threads competes Low instr. level parallelism Use simpler cores TLS exploits both ILP & TLP Unique cache patterns of TLS Multiple cache activated Use smaller caches …
Our Proposal Program Execution Underlying Hardware On-chip Heterogeneity Dynamic Resource Allocation
Same-ISA Heterogeneity (1) Multi-Threading Execution Mode No mixed mode smt (2) Core Computing Power (issue width/SMT support) …… (3) L1 Cache Size (set/associativity) …… Not change L2 cache size What components to integrate?
Design Space Exploration An Unbounded Heterogeneous Multicore 8iss 6iss 1iss 1iss 1iss 6iss 6iss 6iss 8iss 8iss smt 8iss …… smt 256K 8way …… 16K 128K 64K 4way 2way 1way • No Power-On and Off Overheads • No Cache Warm-up Cost
Component Usage Sequential Segments Parallel Segments Coverage 16K-4way 65% 2-issue 20% CMP SMT CMP-based 15% 2-iss 4-iss 6-iss 32K-4way 5% 4-issue 60% 32K 16K 64K SMT-based 41% 64K-4way 25% 6-issue 15% Always favor 4-way set associative cache
Proposed Integration CMP SMT 4-way set associativity 2-iss 4-iss 6-iss 32K 16K 64K 4-issue SMT 2-issue non-SMT …… smt 64K 4-way L1 Resizable by Sets How much improvement? Unified Level 2 Cache
Baseline Improvement Estimation Sequential Program • No overheads & Oracle thread allocation SMT … 16% 9% 7% Total Improvement Upperbound: 29%
Improvement Estimation • No overheads & Oracle thread allocation Proposal: 29% Improvement over SMT … Unbounded: 33% Improvement over SMT … Our proposal captures most of the benefit!
Overhead Sources 1. Startup Overhead Powered-off Powered-on Static power consumed 2. Cache Reconfiguration Overhead Bigger size Smaller size Smaller size Bigger size Mapping changed cold Content discarded Dirty lines Write back to L2
Overhead Impacts • Oracle thread allocation … … … … … … 80% Overheads Heterogeneous vs. Homogeneous +29% -80%
Overhead Mitigation • Oracle thread allocation Benchmark Statistics Throttling Mechanisms • Average <300 instr./thread • Overall coverage ≈75% • Reduce reconfiguration frequency • Only when duration > overhead • Delay device powering-off Fine Granularity +29% Overheads -80% +13%
Outline On-chip Heterogeneity • Design Space Exploration • Heterogeneous Integration • Overhead Impact • Overhead Mitigation Dynamic Resource Allocation
Determining Resource Configuration Proposed Architecture The Difficulty: Intertwined Factors Our Solution: Divide & Conquer Multithread Type L1 Cache Size Core Issue Width SMT vs. CMP 2-issue vs. 4-issue 64KB vs. 16KB Runtime Monitoring • Hardware Performance Counters • Sampling Run: 4-issue SMT with 64K L1
Decision Makings Hardware Performance Monitor low ILP low ILP Cycles non-speculative thread stall due to resource contention IPC Instructions issued from the 2nd half of ROB Reused cache blocks 0.1% 4-issue Core ROB 2-issue Core ROB Right Decisions!
Comparisons Sequential (SEQ) SMT … Sequential Program … Norm. Baseline Unbounded Heterogeneous
Heterogeneous vs. Homogeneous 4% higher perf. 6% less energy ED2P Improvement wrt. SMT 13% Heterogeneous Heterogeneity is beneficial 33% Unbounded 21
Execution Mode Breakdown Thread Migration Cache Resizing 16K 4 issue 64K 16K Coverage 2 issue 16K 4 issue 64K Dynamic allocation is essential
Comparisons Sequential (SEQ) SMT … Sequential Program … Norm. Baseline Unbounded Heterogeneous
Heterogeneous vs. Sequential Baseline SEQ 38% higher perf. 7% more energy ED2P Compared to SEQ -54% Baseline Improve Performance Efficiently Heterogeneous 44%
Related work TLS Energy Efficiency and Hardware Configuration [Packirisamy, Luo, Zhaiet alICCD’08] CMP and SMT favored differently Heterogeneous integration Energy-Efficient TLS on a CMP [Renauet alICS’05] Ours: matching threads with configuration Theirs: can complement our system • Ours is different: • Speculative threads • Fine granularity and overhead mitigation Same-ISA Heterogeneous Multicore [Kumar et alMicro’03] [Kumar et alISCA’04] Dynamic Perf. Tuning for TLS [Luo, Packirisamy, Zhaiet alISCA’09] Integrate to extract efficient threads
Conclusion Heterogeneous TLS Performance Uniprocessor Power TLS Multithreading • Evaluation Summary: • 44% better than uniprocessor • 13% better than homogeneous On-chip Heterogeneity Dynamic Resource Allocation