Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric Gibert1 Jesús Sánchez2 Antonio González1,2 1Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2Intel Barcelona Research Center Intel Labs Barcelona

Motivation • Capacity vs. Communication-bound • Clustered microarchitectures • Simpler + faster • Power consumption • Communications not homogeneous • Clustering  embedded/DSP domain

L2 cache L2 cache L2 cache Memory buses L1 cache module L1 cache L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module L1 cache module Memory buses FUs FUs FUs FUs FUs FUs FUs FUs FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 1 CLUSTER 1 CLUSTER 2 CLUSTER 2 CLUSTER 2 CLUSTER 3 CLUSTER 3 CLUSTER 3 CLUSTER 4 CLUSTER 4 CLUSTER 4 Register-to-register communication buses Register-to-register communication buses Register-to-register communication buses Clustered Microarchitectures

Contributions • Distribution of data cache • Architecture design + data mapping • Word-interleaved scheme [ICS’02] • Appropriate scheduling techniques [MICRO’02] • Memory coherence • Scheduling techniques for mem. coherence • Local software-based techniques • Applied to word-interleaved cache • Complex conf. (with Attraction Buffers – refer to paper) • Simple conf. (without Attraction Buffers) • Applicable to any other cache configuration

Talk Outline • Architecture and Scheduling Algorithms • Memory Coherence Problem • Solutions • Memory Dependent Chains (MDC) • DDG Transformations (DDGT) • Evaluation • Conclusions

TAG W0 W1 W2 W3 W4 W5 W6 W7 subblock 1 local hit remote hit local miss remote miss Word-Interleaved Distribution L2 cache cache block TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7 cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

ld r3, a[i] ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] Scheduling Techniques a[0] a[4] a[1] a[5] a[2] a[6] a[3] a[7] cache module cache module cache module cache module CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Modulo scheduling Loop unrolling Assignment of latencies Padding + Profiling for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ... }

Cluster Assignment • Non-memory instructions • Minimize register communications • Maximize workload balance • Memory instructions  2 heuristics: • PrefClus Heuristic • Preferred Cluster = most accessed cluster • Profiling + Padding • MinComs Heuristic • Minimize register communications • Maximize workload balance • Post-pass phase to increase local accesses

Store to a[0] Store to a[0] Store to a[0] Store to a[0] Read a[0] Update a[0] CLUSTER 3 CLUSTER 2 a[0] a[3] a[7] a[4] Memory Coherence Problem NEXT MEMORY LEVEL memory buses Cache module Cache module Remote accesses Misses Replacements Others NON-DETERMINISTIC BUS LATENCY!!! CLUSTER 1 CLUSTER 4

Solutions Outline • Local scheduling solutions  applied at a loop granularity • Memory Dependent Chains (MDC) • Data Dependence Graph Transformations (DDGT) • Store replication • Load-store synchronization • Software-based solutions • Applicable to other configurations • Replicated distributed cache • MultiVLIW [MICRO00] …

Memory Dependent Chains • Sets of aliased instructions: • Memory Dependent Chains (MDC) • Instructions in sameset: • Assigned to same cluster • Restrictions on cluster assignment • PrefClus: average preferred cluster • MinComs: minimize comms. when scheduling first node MF = memory-flow MA = memory-anti RF = register-flow n1 load RF n6 load n2 load MA RF RF MF MF n7 div RF n3 add MA RF RF n8 add n4 store

CLUSTER 3 CLUSTER 2 store to a[0] a[0] a[3] a[4] a[7] load from a[0] Memory Dependent Chains NEXT MEMORY LEVEL memory buses Cache module Cache module CLUSTER 1 CLUSTER 4

local instance remote instances DDGT: Store Replication • Overcome MEM_FLOW (MF) and MEM_OUT (MO) store replication store A store A store A’ store A’’ store A’’’ MF MF load B load B store replication store A store A store A’ store A’’ store A’’’ MO MO store B store B store B’ store B’’ store B’’’

Increase number of register communications!!! CLUSTER 3 CLUSTER 2 local instance remote instances a[0] a[3] a[4] a[7] DDGT: Store Replication NEXT MEMORY LEVEL memory buses Cache module Cache module CLUSTER 1 CLUSTER 4

load A load A load-store sync. RF RF MA add add SYNC store B store B load A load A RF load-store sync. MA fake cons store B MA RF RF SYNC MO store B store C store C MO DDGT: ld-st Synchronization • Overcome MEM_ANTI (MA) dependences • Special cases: • Store is already REG_FLOW dependent on the load • Impossible recurrences

always accesses data in cluster 1 always accesses data in cluster 2 cycle 1 RF MRT add cycle 3 A B IIres=2 C C1 C2 C3 C4 Latency LH = 1 cycle Latency RH = 5 cycles MRT MRT A A B IIres=3 IIres=2 B C C C C C C1 C2 C3 C4 C1 C2 C3 C4 MDC Solution: Case Study • Impact on compute time • May increase the IIres load A load B MA MF MF store C • Impact on stall time • May increase remote accesses • Extra stall cycles = 3 cycles / iteration

MRT X X X IIres=2 X C1 C2 C3 C4 MRT X X X A IIres=3 B B X B B C1 C2 C3 C4 DDGT Solution: Case Study • Impact on compute time • More instructions (IIres) • Store replication • Fake consumers (few) • Register communications load A set of memory instructions X MA MF store B • Impact on stall time • Small • New dependences may decrease slack of some memory instructions

Evaluation Framework • IMPACT C compiler • Compile + optimize + memory disambiguation • Mediabench benchmark suite

Evaluation Framework

Local vs. Remote Accesses

Execution Time

Configuration 2 # Buses Latency # Buses Latency Register buses 4 2 Memory buses 2 4 More pressure on memory buses DDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%… Other Configurations • Configuration 1 # Buses Latency # Buses Latency Register buses 2 4 Memory buses 4 2 More pressure on register buses MDC outperforms DDGT in all cases  MDC requires less register communications

Conclusions • Memory coherence problem • Two software-based solutions: MDC and DDGT • Applied to a word-interleaved cache clustered VLIW processor • MDC vs DDGT • Results depending on architecture configuration • MDC outperforms DDGT in most cases • DDGT better by up to 20% in specific configuration • Sets of memory dependent insts. are small • DDGT  freedom in cluster assignment • Increase local accesses by 15%  reduce stall time

Questions?

Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2