230 likes | 378 Views
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors. Enric Gibert 1,2 , Jaume Abella 1,2 , Jesús Sánchez 1 , Xavier Vera 1 , Antonio González 1,2. 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona. 1 Intel Barcelona Research Center
E N D
Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert1,2, Jaume Abella1,2, Jesús Sánchez1, Xavier Vera1, Antonio González1,2 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona 1 Intel Barcelona Research Center Intel Labs, Barcelona
Issue #1: Energy Consumption • First class design goal • Heterogeneity • ↓ supply voltage and/or ↑ threshold voltage • Cache memory ARM10 • D-cache 24% dynamic energy • I-cache 22% dynamic energy • Heterogeneity can be exploited in the D-cache for VLIW processors processor front-end processor front-end Higher performance Higher energy processor back-end processor back-end Lower performance Lower energy
Issue #2: Wire Delays • From capacity-bound to communication-bound • One possible solution: clustering • Unified cache clustered VLIW processor • Used as a baseline throughout this work Cache Memory buses FUs FUs FUs … Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER n Global communication buses
Contributions • GOAL: exploit heterogeneity in the L1 D-cache for clustered VLIW processors • Power-efficient distributed L1 data cache • Divide data cache into two modules and assign each to a cluster • Modules may be heterogeneous • Map variables statically between cache modules • Develop instruction scheduling techniques • Results summary • Heterogeneous distributed data cache good design point • Distributed data cache vs. unified data cache • Distributed caches outperform unified schemes in EDD and ED • No single distributed cache configuration is the best • Reconfigurable distributed cache allows additional improvements
Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions
Logical Address Space L2 D-CACHE L2 D-CACHE STACK SP2 var X var Y FIRST MODULE SECOND MODULE SECOND SPACE distributed stack frames HEAP DATA Access memory load *p GLOBAL DATA FU FU Send reply back STACK RF RF RF RF SP1 Send request CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST SPACE Register buses HEAP DATA Register buses GLOBAL DATA Variable-Based Multi-Module Cache Stall clusters Resume execution load X Empty communication buses Memory instructions have a preferred cluster cluster affinity “Wrong” cluster assignment performance, not correctness
Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions
FAST+NONE FAST+FAST FAST FAST FAST L2 D-CACHE FU+RF FU+RF FU+RF FU+RF latency ↑ energy ↓ CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST MODULE SECOND MODULE FAST+SLOW FAST SLOW FU FU FU+RF FU+RF CLUSTER 1 CLUSTER 2 RF RF CLUSTER 1 CLUSTER 2 SLOW+NONE SLOW+SLOW FAST SLOW Register buses 8KB 8KB SLOW SLOW SLOW FU+RF FU+RF FU+RF FU+RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 1 R/W 1 R/W Distributed Cache Configurations
Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions
CACHE CACHE FU+RF FU+RF CLUSTER 1 CLUSTER 2 ST1 LD1 ST2 LD4 LD5 LD3 LD2 Instructions-to-Variables Graph • Built with profiling information • Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1 VAR V2 VAR V3 VAR V4 FIRST SECOND
Greedy Mapping / Scheduling Algorithm Compute IVG Compute affinities using IVG + propagate affinities Compute mapping Schedule code • Initial mapping all to first @ space • Assign affinities to instructions • Express a preferred cluster for memory instructions: [0,1] • Propagate affinities from memory insts. to other insts. • Schedule code + refine mapping
slack 0 slack 0 slack 2 slack 2 slack 0 slack 0 slack 2 slack 2 slack 2 slack 0 slack 0 slack 5 FIRST MODULE SECOND MODULE FU FU RF RF CLUSTER 1 CLUSTER 2 Register buses Computing and Propagating Affinity add1 add2 add3 add4 L=1 L=1 L=1 L=1 LD1 LD2 LD3 LD4 L=1 L=1 L=1 L=1 mul1 add5 L=3 L=1 AFFINITY=0 AFFINITY=1 AFF.=0.4 LD1 LD2 ST1 LD3 LD4 add6 add7 L=1 L=1 V1 V2 V3 V4 ST1 L=1 FIRST SECOND
Affinity=0 Affinity=0.4 IA IC Affinity=0.9 Affinity range (0.3, 0.7) IB CACHE CACHE 100 60 40 ≤ 0.3 ≥ 0.7 FU+RF FU+RF V1 V2 V3 ? CLUSTER 1 CLUSTER 2 Cluster Assignment • Cluster affinity + affinity range used to: • Define a preferred cluster • Guide the instruction-to-cluster assignment process • Strongly preferred cluster • Schedule instruction in that cluster • Weakly preferred cluster • Schedule instruction where global comms. are minimized IA IC IC IB
Talk Outline • Variable-Based Multi-Module Data Cache • DistributedCache Configurations • Instruction Scheduling • Results • Conclusions
latency x2 energy by 1/3 FAST SLOW 8KB 8KB 1 R/W L = 2 1 R/W L = 4 Evaluation Framework • IMPACT compiler infrastructure +16 Mediabench • Cache parameters • CACTI 3.0 + SIA projections + ARM10 datasheets • Data cache consumes 1/3 of the processor energy • Leakage accounts for 50% of the total energy • Results outline • Distributed cache schemes: F+Ø, F+F, F+S, S+S, S+Ø • Affinity range • EDD and ED comparison the lower, the better • F+Ø used as baseline throughout presentation • Comparison with a unified cache scheme • FAST and SLOW unified schemes • State-of-the-art scheduling techniques for these schemes • Reconfigurable distributed cache
Affinity Range • Affinity plays a key role in cluster assignment • 36% - 44% better in EDD than no-affinity • 32% better in ED than no-affinity • (0,1) affinity range is the best • ~92% of memory instructions access a single variable • Binary affinity for memory instructions
Comparison With Unified Cache FAST CACHE SLOW CACHE • Instruction Scheduling Aletà et al. (PACT’02) FUs FUs FUs FUs RF RF RF RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 • Distributed schemes are better than unified schemes • 29-31% better in EDD and 19-29% better in ED
Reconfigurable Distributed Cache • The OS can set each module in one state: • FAST mode / SLOW mode / Turned-off • The OS reconfigures the cache on a context switch • Depending on the applications scheduled in and scheduled out • Two different VDD and VTH for the cache • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] • Simple heuristic to show potential • For each application, choose the estimated best cache configuration
Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions
Conclusions • Distributed Variable-Based Multi-Module Cache • Affinity is crucial for achieving good performance • 36-44% better in EDD and 32% in ED than no-affinity • Heterogeneity (FAST+SLOW) is a good design point • 4-11% better in EDD and from 6% worse to 10% better in ED • No single cache configuration is the best • Reconfigurable cache modules exploit additional 3-4% • Distributed schemes vs. unified schemes • All distributed schemes outperform unified ones • 29-31% better in EDD, 19-29% better in ED