1 / 67

Spatial Computation

Spatial Computation. Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003. CPU Problems. Design Complexity Power Global Signals Limited issue window ) limited ILP. Communication vs. Computation. wire. gate. 5ps. 20ps. Power consumption on wires is also dominant. Global Communication.

chaka
Download Presentation

Spatial Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003

  2. CPU Problems • Design Complexity • Power • Global Signals • Limited issue window ) limited ILP

  3. Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant

  4. Global Communication Instruction unit Reg Network

  5. Our Approach: ASHApplication-Specific Hardware

  6. 1) Unroll Pipeline Instruction unit Reg Network Reg Network Reg Network originalprocessor

  7. Resource Binding Time 1. 1. Programs 2. 2. Programs CPU ASH

  8. 2) Specialize Pipeline Fixed program Instruction unit Reg Network Reg Network Reg Network

  9. 2) Specialize Pipeline:Functional Units Fixed program Instruction unit Reg Network Reg Network Reg Network

  10. 2) Specialize Pipeline:Interconnection Network Fixed program Instruction unit Reg Reg Reg

  11. 2) Specialize Pipeline:Register Files Fixed program Instruction unit 0 1

  12. 2) Specialize Pipeline: Shrink Wires Fixed program Instruction unit 0 1

  13. 2) Specialize Pipeline:No Instruction Fetch, Decode, Issue 0 1

  14. Loops 0 1

  15. Memory Spatial Computation LSQ To memory 0 1

  16. Outline • Introduction • CASH: Compiling for ASH • ASH vs CPU • Analyzing the Results • Conclusions

  17. Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable/custom hw

  18. Asynchronous Computation + latch data ack data valid

  19. FSM Distributed Control Logic ack rdy + - more info

  20. Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y Conditionals ) Speculation

  21. p ! Split (branch) Control Flow ) Data Flow data Merge data data predicate Gateway

  22. 0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;

  23. Outline • Introduction • Compiling for ASH • ASH vs CPU • Analyzing the Results • Conclusions

  24. ASH vs: • 4- & 8-wide VLIWs • Superscalar, media kernels • Superscalar, SpecInt95

  25. OpenDIVX IDCT, Normalized Running Time

  26. OpenDIVX IDCT,Sustained IPC includes speculative ops no data

  27. Media Kernels, vs 4-way OOO

  28. Media Kernels, IPC

  29. Cost of Performance

  30. wrong! This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good)

  31. SpecInt95, ASH vs 4-way OOO

  32. Outline • Introduction: spatial computation • CASH: Compiling for ASH • ASH vs CPU • Dissection • Conclusions

  33. The (Loop) Body for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized

  34. definition Dynamic Critical Path sizeof(X[j]) load predicate loop predicate for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

  35. MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence

  36. If Branch Prediction Correct LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained

  37. SpecInt95, perfect prediction

  38. Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

  39. Prediction + Load Speculation ack edge ~4 cycles! Load not pipelined (self-anti-dependence) for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break;

  40. register renaming OOO Pipe Snapshot LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: IF DA EX WB CT L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 L3 L5 L1 L2

  41. Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; if (X[j+1].r == i) break; } Y[i] = X[j].q; } when 1 iteration

  42. ASH Problems • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static • No branch prediction • No dynamic unrolling • No register renaming • Calls/returns not lenient • No virtualization • No dynamic optimization

  43. Outline • Introduction: spatial computation • CASH: Compiling for ASH • ASH vs CPU • Result Analysis • Conclusions

  44. Conclusions • ASH promising for media processing; to evaluate • power • performance • cost • Prediction does much more than avoid issue stalls • von Neumann model of computation very powerful • hardware resources are not everything

  45. Backup Slides • Evaluation model • Control logic • Pipeline balancing • Lenient execution • Dynamic Critical Path

  46. How Performance Is Evaluated C Mem L2 1/4M L1 8K LSQ 2 limited BW (2 words/c) Unlimited ILP 8 72

  47. Simulation Parameters • Compared to 4-wide OOO SimpleScalar • Same operation latencies • Same cache hierarchy • No measurements in library functions/OS • 3-cycle multiply, 20 cycle divide back

  48. Control Logic rdyin C C ackin D ackout rdyout D datain dataout Reg back back to talk

  49. Outline • Introduction • Compiling for ASH • ASH at run-time • ASH vs CPU • Conclusions

  50. Critical Paths b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y

More Related