An Interleaved Cache Clustered VLIW Processor

An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez* and A. González* Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) *Also at Intel Barcelona Research Center June 2002

Motivation Motivation • Capacity-bound vs. Communication-bound • Solution: clustered microarchitectures • Partition some hardware resources • Simpler + faster • Power consumption • Communications not homogeneous • Goal: clustering the memory hierarchy in statically scheduled processors

Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions

Register-to-register buses Reg. File Reg. File Reg. File F.U. F.U. F.U. Cluster 1 Cluster 2 Cluster n ... L1 data cache L1 data cache L1 data cache Coherency network Next memory level State-of-the-art: MultiVLIW • Sánchez and González [MICRO’00]

TAG W0 W1 W2 W3 W4 W5 W6 W7 Subblock 1 TAG TAG TAG W2 W1 W3 W5 W7 W6 cache module cache module cache module Basic Interleaved Cache Clustered VLIW Processor NEXT MEMORY LEVEL cache block memory buses TAG W0 W4 cache module FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 2 CLUSTER 3 CLUSTER 4 CLUSTER 1 Register-to-register buses

Modulo Scheduling • Extract ILP from loops  overlap execution of iterations LOOP L A II A B A’ SC B C B’ A’’ Kernel C C’ B’’ C’’

Base Scheduling Algorithm • Used for Unified Cache II=II+1 0 >0 How Many? Select possible clusters START Next node Best profit in output edges 1 How Many? Sort nodes Schedule it >1 Least loaded

Interleaved Cache Scheduling Algorithm • Unroll loop to maximize instructions with a stride multiple of NxI  access ONE cache module • Assign latencies to memory instructions • Assign memory instructions to clusters: • IPBC (Interleaved Pre-Build Chains)  minimize stall time • IBC (Interleaved Build Chains)  minimize compute time

store load load store load store Memory Dependent Instructions IPBC  preferred info is used vs. IBC  minimize register comms. Preferred=1 add Preferred=2 load memory dependant chain 2 memory dependant chain 1 Preferred=1 add Preferred=2 store

ATTRACTION BUFFER ADDRESS TAG W TAG W2 W6 word select = hit data Enhacement: Attraction Buffers ADDRESS CACHE MODULE Local Data ABuffer data hit local logic data hit data hit

CLUSTER 3 CLUSTER 2 CLUSTER 1 a[3] a[0] a[7] a[4] An Example for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3] } 16 byte strides (NxI multiple) N = 4 clusters, I= 4 bytes Unroll x4 a[0] a[1] a[2] a[3] ... Local module ABuffer CLUSTER 4 ld r31, a[0]

Enhacement: Attraction Buffers • Why remote accesses? Why Attraction Buffers? • Double precision accesses  low benefit • Indirect accesses: a[b[i]] low benefit • “Unclear” preferred cluster  big benefit for (i=0; i<MAX; i++) for (k=i; k<i+MAX; k+=4) ld a[k], ld a[k+1], ld a[k+2], ld a[k+3] • Memory dependent chains big benefit • IBC: preferred cluster info is not used  big benefit

Experimental Framework • IMPACT C compiler • Modulo scheduling on hyperblock loops • BASE for a Unified Cache • IPBC and IBC for an Interleaved Cache • IPBC and IBC for the MultiVLIW • The same unrolling factor has been used for all architecture configurations! • Mediabench benchmark suite

Experimental Framework

Results (I) • IPBC vs IBC similar cycle count results • MultiVLIW vs Interleaved  similar results BUT… … lower complexity!

Results (II) • Memory dependent chains • Interleaved cacheworkload unbalance +  remote accesses • MultiVLIW  workload unbalance • Working on techniques to overcome scheduling restrictions

Results (III) • Local hits are increased by 15% • Stall time reduced by 30%

Conclusions • Scheduling Algorithms • Good latency assignment process (stall time accounts for 9% of execution time) • Coherence kept through memory dependent chains (5% cycle count degradation) • Attraction Buffers • Effective to increase local hits (15% average) + reduce stall time (30% average) • Reduce remote hits to previously accessed subblocks (70% average) • Cycle count results • similar to Unified Cache and MultiVLIW

Questions

An Interleaved Cache Clustered VLIW Processor