490 likes | 639 Views
Clustered Data Cache Designs for VLIW Processors. PhD Candidate : Enric Gibert Advisors : Antonio González, Jesús Sánchez. Motivation. Two major problems in processor design Wire delays Energy consumption. D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨
E N D
Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez
Motivation • Two major problems in processor design • Wire delays • Energy consumption D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨ in IEEE Computer 30(9), pp. 37-39, 1997 Data from www.sandpile.org
Clustering L2 cache L1 cache Memory buses FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses
Data Cache • Latency • Energy • Leakage will soon dominate energy consumption • Cache memories will probably be the main source of leakage (S. Hill, Hot Chips 13) • In this Thesis: • Latency Reduction Techniques • Energy Reduction Techniques
Contributions of this Thesis • Memory hierarchy for clustered VLIW processors • Latency Reduction Techniques • Distribution of the Data Cache among clusters • Cost-effective cache coherence solutions • Word-Interleaved distributed data cache • Flexible Compiler-Managed L0 Buffers • Energy Reduction Techniques • Heterogeneous Multi-module Data Cache • Unified processors • Clustered processors
Evaluation Framework • IMPACT C compiler • Compile + optimize + memory disambiguation • Mediabench benchmark suite • Microarchitectural VLIW simulator
Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions
L1 cache L1 cache module L1 cache module L1 cache module L1 cache module Memory buses Distributing the Data Cache L2 cache FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses
MultiVLIW L2 cache cache block MSI cache coherence protocol L1 cache module L1 cache module L1 cache module L1 cache module FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses (Sánchez and González, MICRO33)
Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions
new value of X new value of X new value of X new value of X Update X Read X Memory Coherence NEXT MEMORY LEVEL memory buses Cache module Cache module Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! X CLUSTER 2 CLUSTER 3 CLUSTER 1 CLUSTER 4
Coherence Solutions: Overview • Local scheduling solutions applied to loops • Memory Dependent Chains (MDC) • Data Dependence Graph Transformations (DDGT) • Store replication • Load-store synchronization • Software-based solutions with little hardware support • Applicable to different configurations • Word-interleaved cache • Replicated distributed cache • Flexible Compiler-Managed L0 Buffers
Scheme 1: Mem. Dependent Chains • Sets of memory dependent instructions • Memory disambiguation by the compiler • Conservative assumptions • Assign instructions in same set to same cluster LD LD cache module cache module X CLUSTER 2 CLUSTER 3 Register deps ADD CLUSTER 1 CLUSTER 4 Memory deps store to X store to X ST load from X
local instance remote instances Scheme 2: DDG transformations (I) • 2 transformations applied together • Store replication overcome MF and MO • Little support from the hardware cache module cache module cache module cache module X CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 store to X store to X store to X store to X load from X
MA SYNC Scheme 2: DDG transformations (II) • Load-store synchronization overcome MA dependences cache module cache module LD X CLUSTER 2 CLUSTER 4 RF add CLUSTER 1 CLUSTER 3 load from X ST store to X add
Results: Memory Coherence • Memory Dependent Chains (MDC) • Badsince restrictions on the assignment of instructions to clusters • Good when memory disambiguation is accurate • DDG Transformations (DDGT) • Good when there is pressure in the memory buses • Increases number of local accesses • Bad when there is pressure in the register buses • Big increase in inter-cluster communications • Solutions useful for different cache schemes
Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions
Word-Interleaved Cache • Simplify hardware • As compared to MultiVLIW • Avoid replication • Strides +1/-1 element are predominant • Page interleaved • Block interleaved • Word interleaved best suited
remote hit TAG W0 W1 W2 W3 W4 W5 W6 W7 local hit cache block subblock 1 W0 W4 W1 W5 W2 W6 W3 W7 local miss remote miss Architecture L2 cache TAG TAG TAG TAG cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses
25% of local accesses Instruction Scheduling (I): Unrolling a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] cache module cache module cache module cache module CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 for (i=0; i<MAX; i++) { ld r3, @a[i] … } for (i=0; i<MAX; i=i+4) { ld r3, @a[i] ld r3, @a[i+1] ld r3, @a[i+2] ld r3, @a[i+3] … } ld r3, @a[i] ld r3, @a[i+1] 100% of local accesses ld r3, @a[i+2] ld r3, @a[i+3]
large latencies small latencies Cluster 1 C2 C3 C4 Cluster 1 C2 C3 C4 cycle 1 LD cycle 1 LD cycle 2 cycle 2 cycle 3 add cycle 3 cycle 4 cycle 5 add Instruction Scheduling (II) • Assign appropriate latency to memory instruction • Small latencies ILP ↑, stall time ↑ • Large latencies ILP ↓, stall time ↓ • Start with large latency (remote miss) + iteratively reassign appropriate latencies (local miss, remote hit, local hit) LD RF add
Instruction Scheduling (III) • Assign instructions to clusters • Non-memory instructions • Minimize inter-cluster communications • Maximize workload balance among clusters • Memory instructions 2 heuristics • Preferred cluster (PrefClus) • Average preferred cluster of memory dependent set • Minimize inter-cluster communications (MinComs) • Min. Comms. for 1st instruction of the memory dependent set
Memory Accesses • Sources of remote accesses: • Indirect, chains restrictions, double precision, …
Attraction Buffers • Cost-effective mechanism ↑ local accesses cache module cache module cache module cache module a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[0] a[4] Attraction Buffer AB AB AB CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 i=0 local accesses 0% 50% load a[i] i=i+4 loop • Results • ~ 15% INCREASE in local accesses • ~30-35% REDUCTION in stall time • 5-7% REDUCTION in overall execution time
Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions
Why L0 Buffers • Still keep hardware simple, but… • ... Allow dynamic binding between addresses and clusters
unpack logic L0 buffer L0 buffer L0 Buffers • Small number of entries flexibility • Adaptative to application + dynamic address-cluster binding • Controlled by software load/store hints • Mark instructions to access the buffers: which and how • Flexible Compiler-Managed L0 Buffers L1 cache INT FP MEM INT FP MEM CLUSTER 3 CLUSTER 4 Register File Register File CLUSTER 1 CLUSTER 2
unpack logic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 linear mapping interleaved mapping (1 cycle penalty) 4 bytes 4 bytes 4 bytes 1 1 4 bytes 2 2 3 3 4 4 a[0] a[0] a[0] a[0] a[0] a[1] a[1] a[1] a[1] a[1] a[0] a[4] All loads with a 4-element stride a[1] a[5] a[3] a[7] a[2] a[6] load a[3] load a[0] load a[1] load a[2] load a[0] with stride 1 element Mapping Flexibility a[1] a[3] a[0] a[2] a[4] a[5] a[6] a[7] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L1 block (16 bytes) L1 cache L0 Buffer L0 Buffer L0 Buffer L0 Buffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4
Hints and L0-L1 Interface • Memory hints • Access or bypass the L0 Buffers • Data mapping: linear/interleaved • Prefetch hints next/previous blocks • L0 are write-through with respect to L1 • Simplifies replacements • Makes hardware simple • No arbitration • No logic to pack data back correctly • Simplifies coherence among L0 Buffers
Instruction Scheduling • Selective loop unrolling • No unroll vs. unroll by N • Assign latencies to memory instructions • Critical instructions (slack) use L0 Buffers • Do not overflow L0 Buffers • Use counter of L0 Buffer free entries / cluster • Do not schedule critical instruction into cluster with counter == 0 • Memory coherence • Cluster assignment + schedule instructions • Minimize global communications • Maximize workload balance • Critical Priority to clusters where L0 Buffer can be used • Explicit prefetching
Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions
Motivation • Energy consumption 1st class design goal • Heterogeneity • ↓ supply voltage and/or ↑ threshold voltage • Cache memory ARM10 • D-cache 24% dynamic energy • I-cache 22% dynamic energy • Exploit heterogeneity in the L1 D-cache? processor front-end processor front-end structure tuned for performance processor back-end processor back-end structure tuned for energy
Variable-Based Multi-Module STACK SP2 L2D-CACHE SLOW SPACE distributed stack frames HEAP DATA FAST MODULE SLOW MODULE GLOBAL DATA STACK load/store queues SP1 ROB FAST SPACE L1 D-CACHE HEAP DATA @ GLOBAL DATA • It is possible to exploit heterogeneity! Multi-Module Data Cache Instruction-Based Multi-Module (Abella and González, ICCD 2003) L2 D-CACHE FAST CACHE MODULE SLOW CACHE MODULE CRITICALITY TABLE PROCESSOR inst PC
FAST+NONE FAST+FAST FAST FAST FAST L2 D-CACHE FU+RF FU+RF FU+RF FU+RF latency x2 energy by 1/3 CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST MODULE SECOND MODULE FAST+SLOW FAST SLOW FU FU FU+RF FU+RF CLUSTER 1 CLUSTER 2 RF RF CLUSTER 1 CLUSTER 2 SLOW+NONE SLOW+SLOW FAST SLOW Register buses 8KB 8KB SLOW SLOW SLOW FU+RF FU+RF FU+RF FU+RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 L=2 1 R/W L=4 1 R/W Cache Configurations
CACHE CACHE FU+RF FU+RF CLUSTER 1 CLUSTER 2 ST1 LD1 ST2 LD4 LD5 LD3 LD2 Instr.-to-Variable Graph (IVG) • Built with profiling information • Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1 VAR V2 VAR V3 VAR V4 FIRST SECOND
Greedy Mapping Algorithm • Initial mapping all to first @ space • Assign affinities to instructions • Express a preferred cluster for memory instructions: [0,1] • Propagate affinities to other instructions • Schedule code + refine mapping Compute IVG Compute affinities + propagate affinities Compute mapping Schedule code
slack 0 slack 0 slack 2 slack 2 slack 0 slack 0 slack 2 slack 2 slack 2 slack 0 slack 0 slack 5 FIRST MODULE SECOND MODULE FU FU RF RF CLUSTER 1 CLUSTER 2 Register buses Computing and Propagating Affinity add1 add2 add3 add4 L=1 L=1 L=1 L=1 LD1 LD2 LD3 LD4 L=1 L=1 L=1 L=1 mul1 add5 L=3 L=1 AFFINITY=0 AFFINITY=1 AFF.=0.4 LD1 LD2 ST1 LD3 LD4 add6 add7 L=1 L=1 V1 V2 V3 V4 ST1 L=1 FIRST SECOND
Affinity=0 Affinity=0.4 IA IC Affinity=0.9 Affinity range (0.3, 0.7) IB CACHE CACHE 100 60 40 ≤ 0.3 ≥ 0.7 FU+RF FU+RF V1 V2 V3 ? CLUSTER 1 CLUSTER 2 Cluster Assignment • Cluster affinity + affinity range used to: • Define a preferred cluster • Guide the instruction-to-cluster assignment process • Strongly preferred cluster • Schedule instruction in that cluster • Weakly preferred cluster • Schedule instruction where global comms. are minimized IA IC IC IB
Other Results • ED • The SLOW schemes are better • In all cases, these schemes are better than unified cache • 29-31% better in EDD, 19-29% better in ED • No configuration is best for all cases
Reconfigurable Cache Results • The OS can set each module in one state: • FAST mode / SLOW mode / Turned-off • The OS reconfigures the cache on a context switch • Depending on the applications scheduled in and scheduled out • Two different VDD and VTH for the cache • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] • Simple heuristic to show potential • For each application, choose the estimated best cache configuration
Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions
Conclusions • Cache partitioning is a good latency reduction technique • Cache heterogeneity can be used to exploit energy efficiency • The best energy and performance efficient scheme is a distributed data cache • Dynamic vs. Static mapping between addresses and clusters • Dynamic for performance (L0 Buffers) • Static for energy consumption (Variable-Based mapping) • Hardware vs. Software-based memory coherence solutions • Software solutions are viable
List of Publications • Distributed Data Cache Memories • ICS, 2002 • MICRO-35, 2002 • CGO-1, 2003 • MICRO-36, 2003 • IEEE Transactions on Computers, October 2005 • Concurrency & Computation: practice and experience • (to appear late ’05 / ’06) • Heterogeneous Data Cache Memories • Technical report UPC-DAC-RR-ARCO-2004-4, 2004 • PACT, 2005