240 likes | 379 Views
An Interleaved Cache Clustered VLIW Processor. E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) * Also at Intel Barcelona Research Center June 2002. Motivation. Motivation. Capacity -bound vs. Communication -bound
E N D
An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez* and A. González* Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) *Also at Intel Barcelona Research Center June 2002
Motivation Motivation • Capacity-bound vs. Communication-bound • Solution: clustered microarchitectures • Partition some hardware resources • Simpler + faster • Power consumption • Communications not homogeneous • Goal: clustering the memory hierarchy in statically scheduled processors
Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions
Register-to-register buses Reg. File Reg. File Reg. File F.U. F.U. F.U. Cluster 1 Cluster 2 Cluster n ... L1 data cache L1 data cache L1 data cache Coherency network Next memory level State-of-the-art: MultiVLIW • Sánchez and González [MICRO’00]
Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions
TAG W0 W1 W2 W3 W4 W5 W6 W7 Subblock 1 TAG TAG TAG W2 W1 W3 W5 W7 W6 cache module cache module cache module Basic Interleaved Cache Clustered VLIW Processor NEXT MEMORY LEVEL cache block memory buses TAG W0 W4 cache module FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 2 CLUSTER 3 CLUSTER 4 CLUSTER 1 Register-to-register buses
Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions
Modulo Scheduling • Extract ILP from loops overlap execution of iterations LOOP L A II A B A’ SC B C B’ A’’ Kernel C C’ B’’ C’’
Base Scheduling Algorithm • Used for Unified Cache II=II+1 0 >0 How Many? Select possible clusters START Next node Best profit in output edges 1 How Many? Sort nodes Schedule it >1 Least loaded
Interleaved Cache Scheduling Algorithm • Unroll loop to maximize instructions with a stride multiple of NxI access ONE cache module • Assign latencies to memory instructions • Assign memory instructions to clusters: • IPBC (Interleaved Pre-Build Chains) minimize stall time • IBC (Interleaved Build Chains) minimize compute time
store load load store load store Memory Dependent Instructions IPBC preferred info is used vs. IBC minimize register comms. Preferred=1 add Preferred=2 load memory dependant chain 2 memory dependant chain 1 Preferred=1 add Preferred=2 store
Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions
ATTRACTION BUFFER ADDRESS TAG W TAG W2 W6 word select = hit data Enhacement: Attraction Buffers ADDRESS CACHE MODULE Local Data ABuffer data hit local logic data hit data hit
CLUSTER 3 CLUSTER 2 CLUSTER 1 a[3] a[0] a[7] a[4] An Example for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16) ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] r41 = OP(r31) r42 = OP(r32) r43 = OP(r33) r44 = OP(r34) st r41, b[i] st r42, b[i+1] st r43, b[i+2] st r44, b[i+3] } 16 byte strides (NxI multiple) N = 4 clusters, I= 4 bytes Unroll x4 a[0] a[1] a[2] a[3] ... Local module ABuffer CLUSTER 4 ld r31, a[0]
Enhacement: Attraction Buffers • Why remote accesses? Why Attraction Buffers? • Double precision accesses low benefit • Indirect accesses: a[b[i]] low benefit • “Unclear” preferred cluster big benefit for (i=0; i<MAX; i++) for (k=i; k<i+MAX; k+=4) ld a[k], ld a[k+1], ld a[k+2], ld a[k+3] • Memory dependent chains big benefit • IBC: preferred cluster info is not used big benefit
Talk Outline • State-of-the-art: multiVLIW • Interleaved Cache Clustered VLIW • Scheduling Algorithms • Enhancement: Attraction Buffers • Experimental Framework • Results • Conclusions
Experimental Framework • IMPACT C compiler • Modulo scheduling on hyperblock loops • BASE for a Unified Cache • IPBC and IBC for an Interleaved Cache • IPBC and IBC for the MultiVLIW • The same unrolling factor has been used for all architecture configurations! • Mediabench benchmark suite
Results (I) • IPBC vs IBC similar cycle count results • MultiVLIW vs Interleaved similar results BUT… … lower complexity!
Results (II) • Memory dependent chains • Interleaved cacheworkload unbalance + remote accesses • MultiVLIW workload unbalance • Working on techniques to overcome scheduling restrictions
Results (III) • Local hits are increased by 15% • Stall time reduced by 30%
Conclusions • Scheduling Algorithms • Good latency assignment process (stall time accounts for 9% of execution time) • Coherence kept through memory dependent chains (5% cycle count degradation) • Attraction Buffers • Effective to increase local hits (15% average) + reduce stall time (30% average) • Reduce remote hits to previously accessed subblocks (70% average) • Cycle count results • similar to Unified Cache and MultiVLIW