Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert1 Jesús Sánchez2 Antonio González1,2 1Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2Intel Barcelona Research Center Intel Labs - UPC Barcelona

OPTION 1: Distribute L1 OPTION 2: Memory Buffers L2 cache L2 cache L2 cache L1 cache L1 cache Memory buses Memory buses Memory buses L1 cache module L1 cache module memory buffer memory buffer FUs FUs FUs FUs ... ... FUs FUs Reg. File Reg. File Reg. File FUs Reg. File FUs Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Reg. File Reg. File CLUSTER 1 CLUSTER n CLUSTER 1 CLUSTER n Register-to-register communication buses Motivation

Contributions • Small L0 Buffer in each cluster • Flexible mechanisms to map data to the buffers • Compiler-controlled  memory inst. hints • Instruction scheduling techniques (VLIW) • Mark “critical” instructions to use the buffers • Use appropriate memory instruction hints • Data coherence among buffers [CGO’03] • 3 mechanisms: same cluster, partial store replication and not use buffers

Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions

L0 buffer L0 buffer unpack logic L0 Buffers L1 cache INT FP MEM INT FP MEM CLUSTER 3 CLUSTER 4 Reg. File Reg. File CLUSTER 1 CLUSTER 2 Register-to-register communication buses

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 linear mapping interleaved mapping (1 cycle penalty) unpack logic 4 bytes 4 bytes 4 bytes 4 bytes 1 1 2 2 3 3 4 4 a[0] a[0] a[0] a[0] a[0] a[1] a[1] a[1] a[1] a[1] All loads with a 4-element stride a[0] a[4] a[1] a[5] load a[3] load a[0] load a[1] load a[2] load a[0] with stride 1 element a[3] a[7] a[2] a[6] Mapping Flexibility Mapping Flexibility L1 cache a[0] a[1] a[2] a[5] a[6] a[7] a[3] a[4] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L1 block (16 bytes) L0 Buffer L0 Buffer L0 Buffer L0 Buffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

a[2] a[3] a[0] a[1] load a[0] load *p load sequential a[0] cycle i no access load (sequential access) load (parallel access) load no access *p cycle i+1 Memory Hints • Access Directives : no access, sequential, parallel • Mapping Hints • : linear, interleaved • Prefetching Hints : none, positive, negative L1 cache L0 buffer L0 buffer CLUSTER 3 CLUSTER 4 INT FP MEM INT FP MEM CLUSTER 1 CLUSTER 2

unpack logic pack logic replacement load store L0 - L1 Interaction • L0 Buffers are write-through L1 cache • 1) Simplifies replacements • no bus arbitration • flush instruction L0 buffer 2) No pack logic INT FP MEM CLUSTER 2 CLUSTER 3 CLUSTER 4 3) Data consistency Reg. File CLUSTER 1

Not use buffers (NB) 1 cluster (1C) load A load B load A load B L1 L1 L0 buffer L0 buffer L0 buffer L0 buffer store C store C CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 load D load D SCHEDULE SCHEDULE cycle i load A load B cycle i load A load B cycle i+1 store C cycle i+1 store C store E store E cycle i+2 load D cycle i+2 load D cycle i+3 store E cycle i+3 store E Memory Coherence

Scheduling Algorithm (I) • Overview • Candidate instructions  strided mem. insts. • Assign “critical” candidate instructions to buffers • Do not overflow buffers • Global comms. + workload • Loop unrolling • Factors: 1 or N • Unroll N: may benefit from • interleaved mapping

L0 buffer L0 buffer CLUSTER 1 CLUSTER 2 load B load C load a[i+1] load D load E NFreeEntries = {2, 2} NFreeEntries = {2, 2} II=II+1 NFreeEntries = {1, 0} NFreeEntries = {1, 1} Initialize Data Next Node 1 P = Possible Clusters NFreeEntries Latencies (slack) NFreeEntries + Recompute Criticality + Reassign Latencies Sort P and Compute Latencies load B load a[i] load B store C store C load A load B load C load D load A load A load E load E load F load G load H Schedule in a Cluster of P load B load C mem deps load a[i+1] RF load D load D load E load D load E add Scheduling Algorithm (II) • Sort P • L0 availability • Min. global comms. • Max. workload • Compute latencies • NFreeEntries • Coherence empty ! empty Swing MS Sort Nodes possible impossible

Evaluation Framework (I) • IMPACT C compiler • Compile + optimize + memory disambiguation • Extended with proposed instruction scheduler • Mediabench benchmark suite

Evaluation Framework (II)

Number of L0 Entries

L0 Hit Rate

prefetch a[2] a[2] is needed a[2] a[3] a[2] reaches L0 Improving L0 Hit Rate • Solution: prefetch two blocks in advance • Use more L0 buffer entries • Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR) II=2 L0 buffer a[0] a[1] time Iteration 2 Iteration 3 Iteration 4 Iteration 1 CLUSTER 1

Word-interleaved MultiVLIW L2 cache L2 cache W0 W1 W2 W3 W4 W5 W6 W7 L1 cache block W0 W2 W4 W6 W1 W3 W5 W7 L1 module L1 module L1 module L1 module Func. Units Func. Units Func. Units Func. Units Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 [MICRO33] [MICRO35] Distributed Cache Cache-coherent protocol

Performance Results

Conclusions • Flexible Compiler-Managed L0 Buffers • Mapping flexibility • Memory instruction hints • Instruction Scheduling Techniques • Mark “critical” insts. + do not overflow buffers • Memory coherence solutions [CGO’03] • Performance Results • 16% better than unified L1 cache without buffers • Outperforms word-interleaved cache [MICRO35] • Competitive compared to MultiVLIW [MICRO33]

Questions?

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Presentation Transcript

SYNTHESIS OF APPLICATION SPECIFIC VLIW PROCESSORS

Compiler Optimizations for Modern VLIW/EPIC Architectures

CD player components: Processors, Buffers …

Compiler Support for Superscalar Processors

A Loop Accelerator for Low Power Embedded VLIW Processors

A Distributed Control Path Architecture for VLIW Processors

Aging-Aware Compiler-Directed VLIW Assignment for GPGPU Architectures

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Clustered Data Cache Designs for VLIW Processors

Novel Multimedia Instruction Capabilities in VLIW Media Processors

Compiler Issues for Embedded Processors

Heterogeneous Clustered VLIW Microarchitectures

Optimizing Loop Performance for Clustered VLIW Architectures

Multiple Issue Processors: Superscalar and VLIW

Compiler-directed Data Partitioning for Multicluster Processors

VLIW Processors

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Heterogeneous Clustered VLIW Microarchitectures

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors