230 likes | 247 Views
Explore mechanisms for flexible mapping of data to L0 buffers in clustered VLIW processors, controlled by the compiler to optimize memory access and instruction scheduling.
E N D
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert1 Jesús Sánchez2 Antonio González1,2 1Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2Intel Barcelona Research Center Intel Labs - UPC Barcelona
OPTION 1: Distribute L1 OPTION 2: Memory Buffers L2 cache L2 cache L2 cache L1 cache L1 cache Memory buses Memory buses Memory buses L1 cache module L1 cache module memory buffer memory buffer FUs FUs FUs FUs ... ... FUs FUs Reg. File Reg. File Reg. File FUs Reg. File FUs Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Reg. File Reg. File CLUSTER 1 CLUSTER n CLUSTER 1 CLUSTER n Register-to-register communication buses Motivation
Contributions • Small L0 Buffer in each cluster • Flexible mechanisms to map data to the buffers • Compiler-controlled memory inst. hints • Instruction scheduling techniques (VLIW) • Mark “critical” instructions to use the buffers • Use appropriate memory instruction hints • Data coherence among buffers [CGO’03] • 3 mechanisms: same cluster, partial store replication and not use buffers
Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions
L0 buffer L0 buffer unpack logic L0 Buffers L1 cache INT FP MEM INT FP MEM CLUSTER 3 CLUSTER 4 Reg. File Reg. File CLUSTER 1 CLUSTER 2 Register-to-register communication buses
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 linear mapping interleaved mapping (1 cycle penalty) unpack logic 4 bytes 4 bytes 4 bytes 4 bytes 1 1 2 2 3 3 4 4 a[0] a[0] a[0] a[0] a[0] a[1] a[1] a[1] a[1] a[1] All loads with a 4-element stride a[0] a[4] a[1] a[5] load a[3] load a[0] load a[1] load a[2] load a[0] with stride 1 element a[3] a[7] a[2] a[6] Mapping Flexibility Mapping Flexibility L1 cache a[0] a[1] a[2] a[5] a[6] a[7] a[3] a[4] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L1 block (16 bytes) L0 Buffer L0 Buffer L0 Buffer L0 Buffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4
a[2] a[3] a[0] a[1] load a[0] load *p load sequential a[0] cycle i no access load (sequential access) load (parallel access) load no access *p cycle i+1 Memory Hints • Access Directives : no access, sequential, parallel • Mapping Hints • : linear, interleaved • Prefetching Hints : none, positive, negative L1 cache L0 buffer L0 buffer CLUSTER 3 CLUSTER 4 INT FP MEM INT FP MEM CLUSTER 1 CLUSTER 2
unpack logic pack logic replacement load store L0 - L1 Interaction • L0 Buffers are write-through L1 cache • 1) Simplifies replacements • no bus arbitration • flush instruction L0 buffer 2) No pack logic INT FP MEM CLUSTER 2 CLUSTER 3 CLUSTER 4 3) Data consistency Reg. File CLUSTER 1
Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions
Not use buffers (NB) 1 cluster (1C) load A load B load A load B L1 L1 L0 buffer L0 buffer L0 buffer L0 buffer store C store C CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 load D load D SCHEDULE SCHEDULE cycle i load A load B cycle i load A load B cycle i+1 store C cycle i+1 store C store E store E cycle i+2 load D cycle i+2 load D cycle i+3 store E cycle i+3 store E Memory Coherence
Scheduling Algorithm (I) • Overview • Candidate instructions strided mem. insts. • Assign “critical” candidate instructions to buffers • Do not overflow buffers • Global comms. + workload • Loop unrolling • Factors: 1 or N • Unroll N: may benefit from • interleaved mapping
L0 buffer L0 buffer CLUSTER 1 CLUSTER 2 load B load C load a[i+1] load D load E NFreeEntries = {2, 2} NFreeEntries = {2, 2} II=II+1 NFreeEntries = {1, 0} NFreeEntries = {1, 1} Initialize Data Next Node 1 P = Possible Clusters NFreeEntries Latencies (slack) NFreeEntries + Recompute Criticality + Reassign Latencies Sort P and Compute Latencies load B load a[i] load B store C store C load A load B load C load D load A load A load E load E load F load G load H Schedule in a Cluster of P load B load C mem deps load a[i+1] RF load D load D load E load D load E add Scheduling Algorithm (II) • Sort P • L0 availability • Min. global comms. • Max. workload • Compute latencies • NFreeEntries • Coherence empty ! empty Swing MS Sort Nodes possible impossible
Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions
Evaluation Framework (I) • IMPACT C compiler • Compile + optimize + memory disambiguation • Extended with proposed instruction scheduler • Mediabench benchmark suite
prefetch a[2] a[2] is needed a[2] a[3] a[2] reaches L0 Improving L0 Hit Rate • Solution: prefetch two blocks in advance • Use more L0 buffer entries • Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR) II=2 L0 buffer a[0] a[1] time Iteration 2 Iteration 3 Iteration 4 Iteration 1 CLUSTER 1
Word-interleaved MultiVLIW L2 cache L2 cache W0 W1 W2 W3 W4 W5 W6 W7 L1 cache block W0 W2 W4 W6 W1 W3 W5 W7 L1 module L1 module L1 module L1 module Func. Units Func. Units Func. Units Func. Units Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 [MICRO33] [MICRO35] Distributed Cache Cache-coherent protocol
Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions
Conclusions • Flexible Compiler-Managed L0 Buffers • Mapping flexibility • Memory instruction hints • Instruction Scheduling Techniques • Mark “critical” insts. + do not overflow buffers • Memory coherence solutions [CGO’03] • Performance Results • 16% better than unified L1 cache without buffers • Outperforms word-interleaved cache [MICRO35] • Competitive compared to MultiVLIW [MICRO33]