1 / 33

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor. Enric Gibert 1 Jes ús Sánchez 1,2 Antonio González 1,2. 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona. 2 Intel Barcelona Research Center Intel Labs

Download Presentation

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor Enric Gibert1 Jesús Sánchez1,2 Antonio González1,2 1Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2Intel Barcelona Research Center Intel Labs Barcelona

  2. Motivation • Capacity vs. Communication-bound • Clustered microarchitectures • Simpler + faster • Power consumption • Communications not homogeneous • Clustering  embedded/DSP domain

  3. Clustered Microarchitectures L2 cache L1 cache GOAL: distribute the memory hierarchy!!! Memory buses FUs FUs FUs FUs Reg. File Reg. File Reg. File Reg. File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  4. Contributions • Distribution of data cache: • Interleaved cache clustered VLIW processor • Hardware enhancement: • Attraction Buffers • Effective instruction scheduling techniques • Modulo scheduling • Loop unrolling + smart assignment of latencies + padding

  5. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  6. TAG+STATE+DATA TAG+STATE+DATA TAG+STATE+DATA TAG+STATE+DATA MultiVLIW L2 cache cache block Cache-Coherence Protocol!!! cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  7. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  8. remote hit TAG W0 W1 W2 W3 W4 W5 W6 W7 local hit subblock 1 local miss remote miss Interleaved Cache L2 cache cache block TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7 cache module cache module cache module cache module Func. Units Func. Units Func. Units Func. Units Register File Register File Register File Register File CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Register-to-register communication buses

  9. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  10. successful not successful not successful BASE Scheduling Algorithm II=II+1 0 >0 How Many? Select possible clusters START Next node Best profit in output edges Sort nodes Least loaded successful >1 1 How Many? Schedule it

  11. Scheduling Algorithm • For word-interleaved cache clustered processors • Scheduling steps: • Loop unrolling • Assignment of latencies to memory instructions • latencies  stall time  + compute time  • latencies  stall time  + compute time  • Order instructions (DDG nodes) • Cluster assignment and scheduling

  12. 25% local accesses 25% local accesses 100% local accesses ld r3, a[i] ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3] STEP 1: Loop Unrolling a[0] a[4] a[1] a[5] a[2] a[6] a[3] a[7] cache module cache module cache module cache module CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 • Selectiveunrolling: • No unrolling • UnrollxN • OUF unrolling for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ... } for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i] } Strides multiple of NxI Optimum Unrolling Factor (OUF)

  13. MII=23 MII=33 MII=28 MII=9 L=15 L=15 L=15 L=1 L=15 L=15 L=15 L=5 MII=10 MII=22 MII=22 MII=22 REC2 L=10 L=15 L=1 L=5 REC1 STEP 2: Latency Assignment n6 load LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles n1 load n2 load n7 div distance=1 L=1 L=8 memory dependences register-flow deps. n8 add n3 add distance=1 L=1 L=1 n4 store L=1 n5 sub

  14. STEPS 3 and 4 • Step 3: Order instructions • Step 4: Cluster assignment and scheduling

  15. CLUSTER 3 CLUSTER 2 a[0] a[3] a[4] a[7] Scheduling Restrictions NEXT MEMORY LEVEL memory buses Cache module Cache module NON-DETERMINISTIC BUS LATENCY!!! CLUSTER 1 CLUSTER 4

  16. STEPS 3 and 4 • Step 3: Order instructions • Step 4: Cluster assignment and scheduling • Non-memory instructions  same as BASE • Minimize register communications + maximize workload • Memory instructions: • Memory instructions in same chain  same cluster • IPBC (Interleaved Preferred Build Chains) • Average “preferred cluster” of the chain • Padding  meaningful preferred cluster information • Stack frames • Dynamically allocated data • IBC (Interleaved Build Chains) • Minimize register communications of 1st instr. of chain NxI boundary

  17. Preferred = 2 Preferred = 1 L=1 L=5 Preferred = 1 L=1 Preferred=2 Memory Dependent Chains n6 load LH=1 cycle RH=5 cycles LM=10 cycles RM=15 cycles n1 load n2 load n7 div distance=1 L=1 L=8 memory dependences register-flow deps. n8 add n3 add distance=1 L=1 L=1 order={n5, n4, n3, n2, n1, n8, n7, n6} n4 store L=1 n5 sub

  18. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  19. a[3] a[7] Local accesses = 0% Local accesses = 50% Attraction Buffers • Cost-effective mechanism   local accesses a[0] a[4] a[1] a[5] a[2] a[6] a[3] a[7] cache module cache module cache module cache module ABuffer CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 ld r3, a[3] ld r3, a[7] ... stride 16 bytes

  20. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  21. Evaluation Framework • IMPACT C compiler • Mediabench benchmark suite

  22. Evaluation Framework

  23. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  24. Local Accesses OUF=Optimum UF P=Padding NC=No Chains

  25. Why Remote Accesses? • Double precision accesses (mpeg2dec) • Unclear “preferred cluster” information • Indirect accesses (e.g. a[b[i]]) (jpegdec, jpegenc, pegwitdec, pegwitenc) • Different alignment (epicenc, jpegdec, jpegenc) • Strides not multiple of NxI (selective unrolling, …) • Memory dependent chains (epicdec, pgpdec, pgpenc, rasta) • for (k=0; k<MAX; k++){ • for (i=k; i<MAX; i++) • load a[i] • }

  26. Stall Time

  27. Cycle Count Results

  28. Talk Outline • MultiVLIW • Interleaved-cache clustered VLIW processor • Instruction scheduling algorithms and techniques • Hardware enhancement: Attraction Buffers • Simulation framework • Results • Conclusions

  29. Conclusions • Interleaved cache clustered VLIW processor • Effective instruction scheduling techniques • Smart assignment of latencies • Loop unrolling + padding (27%  local hits) • Source of remote accesses and stall time • Attraction Buffers (  stall time up to 34%) • Cycle count results: • MultiVLIW (7% slowdown but simpler hardware) • Unified cache (11% speedup)

  30. Questions?

  31. Question: Latency Assignment MII(REC1)=20 MII(DDG)=10

  32. Question: Padding void foo(int *array, int *accum) { *accum = 0; for (i=0; i<MAX; i++) *accum += array[i]; } void main() { int *a, value; a = malloc(MAX*sizeof(int)); foo(a, &value); } accum a[1] a[5] ... a[0] a[4] ... a[2] a[6] ... a[3] a[7] ... CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

  33. Question: Coherence • Memory Dependent Chains • Modified data • Present in only one Attraction Buffer • Data present in multiple Attraction Buffers • Replicated in read-only manner • Local scheduling technique • At end of loop  flush Attraction Buffer’s contents ABuffer ABuffer ABuffer ABuffer a[2] a[2] a[2] CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4

More Related