Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert1,2, Jaume Abella1,2, Jesús Sánchez1, Xavier Vera1, Antonio González1,2 2 Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Barcelona 1 Intel Barcelona Research Center Intel Labs, Barcelona

Issue #1: Energy Consumption • First class design goal • Heterogeneity • ↓ supply voltage and/or ↑ threshold voltage • Cache memory  ARM10 • D-cache  24% dynamic energy • I-cache  22% dynamic energy • Heterogeneity can be exploited in the D-cache for VLIW processors processor front-end processor front-end Higher performance Higher energy processor back-end processor back-end Lower performance Lower energy

Issue #2: Wire Delays • From capacity-bound to communication-bound • One possible solution: clustering • Unified cache clustered VLIW processor • Used as a baseline throughout this work Cache Memory buses FUs FUs FUs … Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER n Global communication buses

Contributions • GOAL: exploit heterogeneity in the L1 D-cache for clustered VLIW processors • Power-efficient distributed L1 data cache • Divide data cache into two modules and assign each to a cluster • Modules may be heterogeneous • Map variables statically between cache modules • Develop instruction scheduling techniques • Results summary • Heterogeneous distributed data cache  good design point • Distributed data cache vs. unified data cache • Distributed caches outperform unified schemes in EDD and ED • No single distributed cache configuration is the best • Reconfigurable distributed cache  allows additional improvements

Talk Outline • Variable-Based Multi-Module Data Cache • Distributed Cache Configurations • Instruction Scheduling • Results • Conclusions

Logical Address Space L2 D-CACHE L2 D-CACHE STACK SP2 var X var Y FIRST MODULE SECOND MODULE SECOND SPACE distributed stack frames HEAP DATA  Access memory load *p GLOBAL DATA FU FU  Send reply back STACK RF RF RF RF SP1  Send request CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST SPACE Register buses HEAP DATA Register buses GLOBAL DATA Variable-Based Multi-Module Cache  Stall clusters  Resume execution load X  Empty communication buses Memory instructions have a preferred cluster  cluster affinity “Wrong” cluster assignment  performance, not correctness

FAST+NONE FAST+FAST FAST FAST FAST L2 D-CACHE FU+RF FU+RF FU+RF FU+RF latency ↑ energy ↓ CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST MODULE SECOND MODULE FAST+SLOW FAST SLOW FU FU FU+RF FU+RF CLUSTER 1 CLUSTER 2 RF RF CLUSTER 1 CLUSTER 2 SLOW+NONE SLOW+SLOW FAST SLOW Register buses 8KB 8KB SLOW SLOW SLOW FU+RF FU+RF FU+RF FU+RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 1 R/W 1 R/W Distributed Cache Configurations

CACHE CACHE FU+RF FU+RF CLUSTER 1 CLUSTER 2 ST1 LD1 ST2 LD4 LD5 LD3 LD2 Instructions-to-Variables Graph • Built with profiling information • Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1 VAR V2 VAR V3 VAR V4 FIRST SECOND

Greedy Mapping / Scheduling Algorithm Compute IVG Compute affinities using IVG + propagate affinities Compute mapping Schedule code • Initial mapping  all to first @ space • Assign affinities to instructions • Express a preferred cluster for memory instructions: [0,1] • Propagate affinities from memory insts. to other insts. • Schedule code + refine mapping

slack 0 slack 0 slack 2 slack 2 slack 0 slack 0 slack 2 slack 2 slack 2 slack 0 slack 0 slack 5 FIRST MODULE SECOND MODULE FU FU RF RF CLUSTER 1 CLUSTER 2 Register buses Computing and Propagating Affinity add1 add2 add3 add4 L=1 L=1 L=1 L=1 LD1 LD2 LD3 LD4 L=1 L=1 L=1 L=1 mul1 add5 L=3 L=1 AFFINITY=0 AFFINITY=1 AFF.=0.4 LD1 LD2 ST1 LD3 LD4 add6 add7 L=1 L=1 V1 V2 V3 V4 ST1 L=1 FIRST SECOND

Affinity=0 Affinity=0.4 IA IC Affinity=0.9 Affinity range (0.3, 0.7) IB CACHE CACHE 100 60 40 ≤ 0.3 ≥ 0.7 FU+RF FU+RF V1 V2 V3 ? CLUSTER 1 CLUSTER 2 Cluster Assignment • Cluster affinity + affinity range  used to: • Define a preferred cluster • Guide the instruction-to-cluster assignment process • Strongly preferred cluster • Schedule instruction in that cluster • Weakly preferred cluster • Schedule instruction where global comms. are minimized IA IC IC IB

Talk Outline • Variable-Based Multi-Module Data Cache • DistributedCache Configurations • Instruction Scheduling • Results • Conclusions

latency x2 energy by 1/3 FAST SLOW 8KB 8KB 1 R/W L = 2 1 R/W L = 4 Evaluation Framework • IMPACT compiler infrastructure +16 Mediabench • Cache parameters • CACTI 3.0 + SIA projections + ARM10 datasheets • Data cache consumes 1/3 of the processor energy • Leakage accounts for 50% of the total energy • Results outline • Distributed cache schemes: F+Ø, F+F, F+S, S+S, S+Ø • Affinity range • EDD and ED comparison  the lower, the better • F+Ø used as baseline throughout presentation • Comparison with a unified cache scheme • FAST and SLOW unified schemes • State-of-the-art scheduling techniques for these schemes • Reconfigurable distributed cache

Affinity Range • Affinity plays a key role in cluster assignment • 36% - 44% better in EDD than no-affinity • 32% better in ED than no-affinity • (0,1) affinity range is the best • ~92% of memory instructions access a single variable • Binary affinity for memory instructions

EDD Results

ED Results

Comparison With Unified Cache FAST CACHE SLOW CACHE • Instruction Scheduling Aletà et al. (PACT’02) FUs FUs FUs FUs RF RF RF RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 • Distributed schemes are better than unified schemes • 29-31% better in EDD and 19-29% better in ED

Reconfigurable Distributed Cache • The OS can set each module in one state: • FAST mode / SLOW mode / Turned-off • The OS reconfigures the cache on a context switch • Depending on the applications scheduled in and scheduled out • Two different VDD and VTH for the cache • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] • Simple heuristic to show potential • For each application, choose the estimated best cache configuration

Conclusions • Distributed Variable-Based Multi-Module Cache • Affinity is crucial for achieving good performance • 36-44% better in EDD and 32% in ED than no-affinity • Heterogeneity (FAST+SLOW) is a good design point • 4-11% better in EDD and from 6% worse to 10% better in ED • No single cache configuration is the best • Reconfigurable cache modules  exploit additional 3-4% • Distributed schemes vs. unified schemes • All distributed schemes outperform unified ones • 29-31% better in EDD, 19-29% better in ED

Q&A

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Presentation Transcript

SYNTHESIS OF APPLICATION SPECIFIC VLIW PROCESSORS

Highly-Associative Caches for Low-Power Processors

Supporting Multi-Processors

Multi-core processors

A Distributed Control Path Architecture for VLIW Processors

Clustered Data Cache Designs for VLIW Processors

Coordinating Accesses to Shared Caches in Multi-core Processors Software Approach

Automatically Tuning Task-Based Programs for Multi-core Processors

Improved Policies for Drowsy Caches in Embedded Processors

Heterogeneous Clustered VLIW Microarchitectures

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Optimizing Loop Performance for Clustered VLIW Architectures

Multiple Issue Processors: Superscalar and VLIW

VLIW Processors

Heterogeneous Clustered VLIW Microarchitectures

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Multi-core processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Coordinating Accesses to Shared Caches in Multi-core Processors Software Approach

Multi-core processors