Clustered Data Cache Designs for VLIW Processors

Clustered Data Cache Designs for VLIW Processors PhD Candidate: Enric Gibert Advisors: Antonio González, Jesús Sánchez

Motivation • Two major problems in processor design • Wire delays • Energy consumption D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨ in IEEE Computer 30(9), pp. 37-39, 1997 Data from www.sandpile.org

Clustering L2 cache L1 cache Memory buses FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

Data Cache • Latency • Energy • Leakage will soon dominate energy consumption • Cache memories will probably be the main source of leakage (S. Hill, Hot Chips 13) • In this Thesis: • Latency Reduction Techniques • Energy Reduction Techniques

Contributions of this Thesis • Memory hierarchy for clustered VLIW processors • Latency Reduction Techniques • Distribution of the Data Cache among clusters • Cost-effective cache coherence solutions • Word-Interleaved distributed data cache • Flexible Compiler-Managed L0 Buffers • Energy Reduction Techniques • Heterogeneous Multi-module Data Cache • Unified processors • Clustered processors

Evaluation Framework • IMPACT C compiler • Compile + optimize + memory disambiguation • Mediabench benchmark suite • Microarchitectural VLIW simulator

Presentation Outline • Latency reduction techniques • Software memory coherence in distributed caches • Word-interleaved distributed cache • Flexible Compiler-Managed L0 Buffers • Energy reduction techniques • Multi-Module cache for clustered VLIW processor • Conclusions

L1 cache L1 cache module L1 cache module L1 cache module L1 cache module Memory buses Distributing the Data Cache L2 cache FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

MultiVLIW L2 cache cache block MSI cache coherence protocol L1 cache module L1 cache module L1 cache module L1 cache module FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses (Sánchez and González, MICRO33)

new value of X new value of X new value of X new value of X Update X Read X Memory Coherence NEXT MEMORY LEVEL memory buses Cache module Cache module Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! X CLUSTER 2 CLUSTER 3 CLUSTER 1 CLUSTER 4

Coherence Solutions: Overview • Local scheduling solutions  applied to loops • Memory Dependent Chains (MDC) • Data Dependence Graph Transformations (DDGT) • Store replication • Load-store synchronization • Software-based solutions with little hardware support • Applicable to different configurations • Word-interleaved cache • Replicated distributed cache • Flexible Compiler-Managed L0 Buffers

Scheme 1: Mem. Dependent Chains • Sets of memory dependent instructions • Memory disambiguation by the compiler • Conservative assumptions • Assign instructions in same set to same cluster LD LD cache module cache module X CLUSTER 2 CLUSTER 3 Register deps ADD CLUSTER 1 CLUSTER 4 Memory deps store to X store to X ST load from X

local instance remote instances Scheme 2: DDG transformations (I) • 2 transformations applied together • Store replication  overcome MF and MO • Little support from the hardware cache module cache module cache module cache module X CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 store to X store to X store to X store to X load from X

MA SYNC Scheme 2: DDG transformations (II) • Load-store synchronization  overcome MA dependences cache module cache module LD X CLUSTER 2 CLUSTER 4 RF add CLUSTER 1 CLUSTER 3 load from X ST store to X add

Results: Memory Coherence • Memory Dependent Chains (MDC) • Badsince restrictions on the assignment of instructions to clusters • Good when memory disambiguation is accurate • DDG Transformations (DDGT) • Good when there is pressure in the memory buses • Increases number of local accesses • Bad when there is pressure in the register buses • Big increase in inter-cluster communications • Solutions useful for different cache schemes

Word-Interleaved Cache • Simplify hardware • As compared to MultiVLIW • Avoid replication • Strides +1/-1 element are predominant • Page interleaved • Block interleaved • Word interleaved  best suited

remote hit TAG W0 W1 W2 W3 W4 W5 W6 W7 local hit cache block subblock 1 W0 W4 W1 W5 W2 W6 W3 W7 local miss remote miss Architecture L2 cache TAG TAG TAG TAG cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

25% of local accesses Instruction Scheduling (I): Unrolling a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] cache module cache module cache module cache module CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 for (i=0; i<MAX; i++) { ld r3, @a[i] … } for (i=0; i<MAX; i=i+4) { ld r3, @a[i] ld r3, @a[i+1] ld r3, @a[i+2] ld r3, @a[i+3] … } ld r3, @a[i] ld r3, @a[i+1] 100% of local accesses ld r3, @a[i+2] ld r3, @a[i+3]

large latencies small latencies Cluster 1 C2 C3 C4 Cluster 1 C2 C3 C4 cycle 1 LD cycle 1 LD cycle 2 cycle 2 cycle 3 add cycle 3 cycle 4 cycle 5 add Instruction Scheduling (II) • Assign appropriate latency to memory instruction • Small latencies  ILP ↑, stall time ↑ • Large latencies  ILP ↓, stall time ↓ • Start with large latency (remote miss) + iteratively reassign appropriate latencies (local miss, remote hit, local hit) LD RF add

Instruction Scheduling (III) • Assign instructions to clusters • Non-memory instructions • Minimize inter-cluster communications • Maximize workload balance among clusters • Memory instructions  2 heuristics • Preferred cluster (PrefClus) • Average preferred cluster of memory dependent set • Minimize inter-cluster communications (MinComs) • Min. Comms. for 1st instruction of the memory dependent set

Memory Accesses • Sources of remote accesses: • Indirect, chains restrictions, double precision, …

Attraction Buffers • Cost-effective mechanism  ↑ local accesses cache module cache module cache module cache module a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[0] a[4] Attraction Buffer AB AB AB CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 i=0 local accesses 0%  50% load a[i] i=i+4 loop • Results • ~ 15% INCREASE in local accesses • ~30-35% REDUCTION in stall time • 5-7% REDUCTION in overall execution time

Performance

Why L0 Buffers • Still keep hardware simple, but… • ... Allow dynamic binding between addresses and clusters

unpack logic L0 buffer L0 buffer L0 Buffers • Small number of entries  flexibility • Adaptative to application + dynamic address-cluster binding • Controlled by software  load/store hints • Mark instructions to access the buffers: which and how • Flexible Compiler-Managed L0 Buffers L1 cache INT FP MEM INT FP MEM CLUSTER 3 CLUSTER 4 Register File Register File CLUSTER 1 CLUSTER 2

unpack logic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 linear mapping interleaved mapping (1 cycle penalty) 4 bytes 4 bytes 4 bytes 1 1 4 bytes 2 2 3 3 4 4 a[0] a[0] a[0] a[0] a[0] a[1] a[1] a[1] a[1] a[1] a[0] a[4] All loads with a 4-element stride a[1] a[5] a[3] a[7] a[2] a[6] load a[3] load a[0] load a[1] load a[2] load a[0] with stride 1 element Mapping Flexibility a[1] a[3] a[0] a[2] a[4] a[5] a[6] a[7] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L1 block (16 bytes) L1 cache L0 Buffer L0 Buffer L0 Buffer L0 Buffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

Hints and L0-L1 Interface • Memory hints • Access or bypass the L0 Buffers • Data mapping: linear/interleaved • Prefetch hints  next/previous blocks • L0 are write-through with respect to L1 • Simplifies replacements • Makes hardware simple • No arbitration • No logic to pack data back correctly • Simplifies coherence among L0 Buffers

Instruction Scheduling • Selective loop unrolling • No unroll vs. unroll by N • Assign latencies to memory instructions • Critical instructions (slack) use L0 Buffers • Do not overflow L0 Buffers • Use counter of L0 Buffer free entries / cluster • Do not schedule critical instruction into cluster with counter == 0 • Memory coherence • Cluster assignment + schedule instructions • Minimize global communications • Maximize workload balance • Critical  Priority to clusters where L0 Buffer can be used • Explicit prefetching

Number of Entries

Performance

Global Comparative

Motivation • Energy consumption  1st class design goal • Heterogeneity • ↓ supply voltage and/or ↑ threshold voltage • Cache memory  ARM10 • D-cache  24% dynamic energy • I-cache  22% dynamic energy • Exploit heterogeneity in the L1 D-cache? processor front-end processor front-end structure tuned for performance processor back-end processor back-end structure tuned for energy

Variable-Based Multi-Module STACK SP2 L2D-CACHE SLOW SPACE distributed stack frames HEAP DATA FAST MODULE SLOW MODULE GLOBAL DATA STACK load/store queues SP1 ROB FAST SPACE L1 D-CACHE HEAP DATA @ GLOBAL DATA • It is possible to exploit heterogeneity! Multi-Module Data Cache Instruction-Based Multi-Module (Abella and González, ICCD 2003) L2 D-CACHE FAST CACHE MODULE SLOW CACHE MODULE CRITICALITY TABLE PROCESSOR inst PC

FAST+NONE FAST+FAST FAST FAST FAST L2 D-CACHE FU+RF FU+RF FU+RF FU+RF latency x2 energy by 1/3 CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 FIRST MODULE SECOND MODULE FAST+SLOW FAST SLOW FU FU FU+RF FU+RF CLUSTER 1 CLUSTER 2 RF RF CLUSTER 1 CLUSTER 2 SLOW+NONE SLOW+SLOW FAST SLOW Register buses 8KB 8KB SLOW SLOW SLOW FU+RF FU+RF FU+RF FU+RF CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 L=2 1 R/W L=4 1 R/W Cache Configurations

CACHE CACHE FU+RF FU+RF CLUSTER 1 CLUSTER 2 ST1 LD1 ST2 LD4 LD5 LD3 LD2 Instr.-to-Variable Graph (IVG) • Built with profiling information • Variables = global, local, heap LD1 LD2 ST1 LD3 ST2 LD4 LD5 VAR V1 VAR V2 VAR V3 VAR V4 FIRST SECOND

Greedy Mapping Algorithm • Initial mapping  all to first @ space • Assign affinities to instructions • Express a preferred cluster for memory instructions: [0,1] • Propagate affinities to other instructions • Schedule code + refine mapping Compute IVG Compute affinities + propagate affinities Compute mapping Schedule code

slack 0 slack 0 slack 2 slack 2 slack 0 slack 0 slack 2 slack 2 slack 2 slack 0 slack 0 slack 5 FIRST MODULE SECOND MODULE FU FU RF RF CLUSTER 1 CLUSTER 2 Register buses Computing and Propagating Affinity add1 add2 add3 add4 L=1 L=1 L=1 L=1 LD1 LD2 LD3 LD4 L=1 L=1 L=1 L=1 mul1 add5 L=3 L=1 AFFINITY=0 AFFINITY=1 AFF.=0.4 LD1 LD2 ST1 LD3 LD4 add6 add7 L=1 L=1 V1 V2 V3 V4 ST1 L=1 FIRST SECOND

Affinity=0 Affinity=0.4 IA IC Affinity=0.9 Affinity range (0.3, 0.7) IB CACHE CACHE 100 60 40 ≤ 0.3 ≥ 0.7 FU+RF FU+RF V1 V2 V3 ? CLUSTER 1 CLUSTER 2 Cluster Assignment • Cluster affinity + affinity range  used to: • Define a preferred cluster • Guide the instruction-to-cluster assignment process • Strongly preferred cluster • Schedule instruction in that cluster • Weakly preferred cluster • Schedule instruction where global comms. are minimized IA IC IC IB

EDD Results

Other Results • ED • The SLOW schemes are better • In all cases, these schemes are better than unified cache • 29-31% better in EDD, 19-29% better in ED • No configuration is best for all cases

Reconfigurable Cache Results • The OS can set each module in one state: • FAST mode / SLOW mode / Turned-off • The OS reconfigures the cache on a context switch • Depending on the applications scheduled in and scheduled out • Two different VDD and VTH for the cache • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002] • Simple heuristic to show potential • For each application, choose the estimated best cache configuration

Conclusions • Cache partitioning is a good latency reduction technique • Cache heterogeneity can be used to exploit energy efficiency • The best energy and performance efficient scheme is a distributed data cache • Dynamic vs. Static mapping between addresses and clusters • Dynamic for performance (L0 Buffers) • Static for energy consumption (Variable-Based mapping) • Hardware vs. Software-based memory coherence solutions • Software solutions are viable

List of Publications • Distributed Data Cache Memories • ICS, 2002 • MICRO-35, 2002 • CGO-1, 2003 • MICRO-36, 2003 • IEEE Transactions on Computers, October 2005 • Concurrency & Computation: practice and experience • (to appear late ’05 / ’06) • Heterogeneous Data Cache Memories • Technical report UPC-DAC-RR-ARCO-2004-4, 2004 • PACT, 2005

Questions…

Clustered Data Cache Designs for VLIW Processors

Clustered Data Cache Designs for VLIW Processors

Presentation Transcript

SYNTHESIS OF APPLICATION SPECIFIC VLIW PROCESSORS

Estimation techniques for clustered hierarchical data

Cache Utilization-Aware Scheduling for Multicore Processors

Adaptive Cache Compression for High-Performance Processors

Undersubscribed Threading on Clustered Cache Architectures

A Loop Accelerator for Low Power Embedded VLIW Processors

A Distributed Control Path Architecture for VLIW Processors

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Cache Memory Design for Network Processors

Heterogeneous Clustered VLIW Microarchitectures

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Optimizing Loop Performance for Clustered VLIW Architectures

Multiple Issue Processors: Superscalar and VLIW

Cache Replacement in Modern Processors

VLIW Processors

Rank-Sum Tests for Clustered Data

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Cache Coherence Techniques for Multicore Processors

Heterogeneous Clustered VLIW Microarchitectures

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors