1 / 23

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Explore mechanisms for flexible mapping of data to L0 buffers in clustered VLIW processors, controlled by the compiler to optimize memory access and instruction scheduling.

groved
Download Presentation

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert1 Jesús Sánchez2 Antonio González1,2 1Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2Intel Barcelona Research Center Intel Labs - UPC Barcelona

  2. OPTION 1: Distribute L1 OPTION 2: Memory Buffers L2 cache L2 cache L2 cache L1 cache L1 cache Memory buses Memory buses Memory buses L1 cache module L1 cache module memory buffer memory buffer FUs FUs FUs FUs ... ... FUs FUs Reg. File Reg. File Reg. File FUs Reg. File FUs Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Reg. File Reg. File CLUSTER 1 CLUSTER n CLUSTER 1 CLUSTER n Register-to-register communication buses Motivation

  3. Contributions • Small L0 Buffer in each cluster • Flexible mechanisms to map data to the buffers • Compiler-controlled  memory inst. hints • Instruction scheduling techniques (VLIW) • Mark “critical” instructions to use the buffers • Use appropriate memory instruction hints • Data coherence among buffers [CGO’03] • 3 mechanisms: same cluster, partial store replication and not use buffers

  4. Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions

  5. L0 buffer L0 buffer unpack logic L0 Buffers L1 cache INT FP MEM INT FP MEM CLUSTER 3 CLUSTER 4 Reg. File Reg. File CLUSTER 1 CLUSTER 2 Register-to-register communication buses

  6. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 linear mapping interleaved mapping (1 cycle penalty) unpack logic 4 bytes 4 bytes 4 bytes 4 bytes 1 1 2 2 3 3 4 4 a[0] a[0] a[0] a[0] a[0] a[1] a[1] a[1] a[1] a[1] All loads with a 4-element stride a[0] a[4] a[1] a[5] load a[3] load a[0] load a[1] load a[2] load a[0] with stride 1 element a[3] a[7] a[2] a[6] Mapping Flexibility Mapping Flexibility L1 cache a[0] a[1] a[2] a[5] a[6] a[7] a[3] a[4] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 L1 block (16 bytes) L0 Buffer L0 Buffer L0 Buffer L0 Buffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

  7. a[2] a[3] a[0] a[1] load a[0] load *p load sequential a[0] cycle i no access load (sequential access) load (parallel access) load no access *p cycle i+1 Memory Hints • Access Directives : no access, sequential, parallel • Mapping Hints • : linear, interleaved • Prefetching Hints : none, positive, negative L1 cache L0 buffer L0 buffer CLUSTER 3 CLUSTER 4 INT FP MEM INT FP MEM CLUSTER 1 CLUSTER 2

  8. unpack logic pack logic replacement load store L0 - L1 Interaction • L0 Buffers are write-through L1 cache • 1) Simplifies replacements • no bus arbitration • flush instruction L0 buffer 2) No pack logic INT FP MEM CLUSTER 2 CLUSTER 3 CLUSTER 4 3) Data consistency Reg. File CLUSTER 1

  9. Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions

  10. Not use buffers (NB) 1 cluster (1C) load A load B load A load B L1 L1 L0 buffer L0 buffer L0 buffer L0 buffer store C store C CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 load D load D SCHEDULE SCHEDULE cycle i load A load B cycle i load A load B cycle i+1 store C cycle i+1 store C store E store E cycle i+2 load D cycle i+2 load D cycle i+3 store E cycle i+3 store E Memory Coherence

  11. Scheduling Algorithm (I) • Overview • Candidate instructions  strided mem. insts. • Assign “critical” candidate instructions to buffers • Do not overflow buffers • Global comms. + workload • Loop unrolling • Factors: 1 or N • Unroll N: may benefit from • interleaved mapping

  12. L0 buffer L0 buffer CLUSTER 1 CLUSTER 2 load B load C load a[i+1] load D load E NFreeEntries = {2, 2} NFreeEntries = {2, 2} II=II+1 NFreeEntries = {1, 0} NFreeEntries = {1, 1} Initialize Data Next Node 1 P = Possible Clusters NFreeEntries Latencies (slack) NFreeEntries + Recompute Criticality + Reassign Latencies Sort P and Compute Latencies load B load a[i] load B store C store C load A load B load C load D load A load A load E load E load F load G load H Schedule in a Cluster of P load B load C mem deps load a[i+1] RF load D load D load E load D load E add Scheduling Algorithm (II) • Sort P • L0 availability • Min. global comms. • Max. workload • Compute latencies • NFreeEntries • Coherence empty ! empty Swing MS Sort Nodes possible impossible

  13. Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions

  14. Evaluation Framework (I) • IMPACT C compiler • Compile + optimize + memory disambiguation • Extended with proposed instruction scheduler • Mediabench benchmark suite

  15. Evaluation Framework (II)

  16. Number of L0 Entries

  17. L0 Hit Rate

  18. prefetch a[2] a[2] is needed a[2] a[3] a[2] reaches L0 Improving L0 Hit Rate • Solution: prefetch two blocks in advance • Use more L0 buffer entries • Speedups: 1.12 in epicdec (+7% HR) and 1.04 in rasta (+12% HR) II=2 L0 buffer a[0] a[1] time Iteration 2 Iteration 3 Iteration 4 Iteration 1 CLUSTER 1

  19. Word-interleaved MultiVLIW L2 cache L2 cache W0 W1 W2 W3 W4 W5 W6 W7 L1 cache block W0 W2 W4 W6 W1 W3 W5 W7 L1 module L1 module L1 module L1 module Func. Units Func. Units Func. Units Func. Units Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 1 CLUSTER 2 [MICRO33] [MICRO35] Distributed Cache Cache-coherent protocol

  20. Performance Results

  21. Talk Outline • Flexible Compiler-Managed L0 Buffers • Instruction Scheduling Techniques • Evaluation • Conclusions

  22. Conclusions • Flexible Compiler-Managed L0 Buffers • Mapping flexibility • Memory instruction hints • Instruction Scheduling Techniques • Mark “critical” insts. + do not overflow buffers • Memory coherence solutions [CGO’03] • Performance Results • 16% better than unified L1 cache without buffers • Outperforms word-interleaved cache [MICRO35] • Competitive compared to MultiVLIW [MICRO33]

  23. Questions?

More Related