1 / 56

Spatial Computation Computing without General-Purpose Processors

Spatial Computation Computing without General-Purpose Processors. Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University. M a y 1 0 , 2 0 0 5. 100. 10. 1. 1980. 1982. 1984. 1986. 1988.

tanith
Download Presentation

Spatial Computation Computing without General-Purpose Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatial ComputationComputing without General-Purpose Processors Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University May10, 2005

  2. 100 10 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 Outline • Intro: Problems of current architectures • Compiling Application-Specific Hardware • ASH Evaluation • Conclusions 1000 Performance

  3. Resources [Intel] • We do not worry about not having hardware resources • We worry about being able to use hardware resources

  4. gate wire 20ps 5ps 1010 109 108 Chip size 107 106 ALUs Designer productivity 105 104 1999 2003 2007 1991 2001 1987 1993 1995 1997 2005 1989 2009 1983 1985 1981 Complexity Cannot rely on global signals (clock is a global signal)

  5. 1010 109 108 Chip size 107 106 ALUs Designer productivity 105 104 1999 2003 2007 1991 2001 1987 1993 1995 1997 2005 1989 2009 1983 1985 1981 Complexity Automatic translation C ! HW Simple, short, unidirectional interconnect Simple hw, mostly idle gate wire 20ps 5ps No interpretation Distributed control, Asynchronous Cannot rely on global signals (clock is a global signal)

  6. CPU ASH Low ILP computation + OS + VM High-ILP computation $ Memory Our Proposal:Application-Specific Hardware • ASH addresses these problems • ASH is not a panacea • ASH “complementary” to CPU

  7. Outline • Problems of current architectures • CASH: Compiling Application-Specific Hardware • ASH Evaluation • Conclusions

  8. HW backend Dataflow machine Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw

  9. Computation Dataflow Program IR Circuits a a 7 x = a & 7; ... y = x >> 2; & &7 2 x >> >>2 No interpretation

  10. Basic Computation=Pipeline Stage + latch data ack valid

  11. + + + 2 3 4 + + + + latch 5 6 7 8 Asynchronous Computation + data ack valid 1

  12. globalFSM Distributed Control Logic ack rdy + - short, local wires

  13. SSA = no arbitration MUX: Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! f y Conditionals ) Speculation Critical path

  14. p ! Split (branch) Control Flow ) Data Flow data f Merge (label) data data predicate Gateway

  15. 0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; back

  16. i Pipelining 1 + * 100 <= int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; pipelined multiplier (8 stages) sum + step 1

  17. i Pipelining 1 + * 100 <= sum + step 2

  18. i Pipelining 1 + * 100 <= sum + step 3

  19. i Pipelining 1 + * 100 <= sum + step 4

  20. i Pipelining 1 + i=1 100 <= i=0 sum + step 5

  21. i Pipelining 1 + * 100 i=1 <= i=0 sum + step 6 back

  22. i’s loop Longlatency pipe predicate sum’s loop i Pipelining 1 + * 100 <= sum + step 7

  23. i’s loop sum’s loop i Pipelining 1 + * 100 critical path <= Predicate ackedge is on the critical path. sum +

  24. i’s loop sum’s loop i Pipeline balancing 1 + * 100 <= decoupling FIFO sum + step 7

  25. i Pipeline balancing 1 + * 100 critical path <= i’s loop decoupling FIFO sum sum’s loop + back back to talk

  26. Procedures Caller Callee Call Argument Return Continuation

  27. Memory Access LD Monolithic Memory pipelined arbitrated network ST LD local communication global structures Future work: fragment this!

  28. Outline • Problems of current architectures • Compiling ASH • ASH Evaluation • Conclusions

  29. Evaluating ASH Mediabench kernels (1 hot function/benchmark) C CASHcore Verilog back-end commercial tools Synopsys,Cadence P/R 180nm std. cell library, 2V ~1999 technology Mem ModelSim (Verilog simulation) performancenumbers ASIC

  30. Compile Time C 200 lines CASHcore 20 seconds Verilog back-end 10 seconds 20 minutes Synopsys,Cadence P/R 1 hour Mem ASIC

  31. ASH Area (mm2) P4: 217 minimal RISC core

  32. ASH vs 600MHz CPU [4-wide OOO, .18 mm]

  33. LSQ • Enabling dependent operations requires round-trip to memory. • Exploring novel memory access protocols. Bottleneck: Memory Protocol LD Memory ST

  34. Power (mW) Xeon [+cache] 67000 mP 4000 DSP 110

  35. Energy-delay

  36. Energy Efficiency (op/nJ)

  37. 1000x Energy Efficiency Dedicated hardware ASH media kernels Asynchronous P FPGA General-purpose DSP Microprocessors 0 . 1 1 0 1 1 0 0 0 0 0 1 1 0 0 . Energy Efficiency [Operations/nJ]

  38. Outline • Problems of current architectures • Compiling ASH • Evaluation • Related work, Conclusions

  39. Bilbliography • Dataflow: A Complement to SuperscalarMihai Budiu, Pedro Artigas, and Seth Copen GoldsteinISPASS 2005 • Spatial ComputationMihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen GoldsteinASPLOS 2004 • C to Asynchronous Dataflow Circuits: An End-to-End ToolflowGirish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen GoldsteinIWLS 2004 • Optimizing Memory Accesses For Spatial ComputationMihai Budiu and Seth Copen GoldsteinCGO 2003 • Compiling Application-Specific HardwareMihai Budiu and Seth Copen GoldsteinFPL 2002

  40. Related Work • Optimizing compilers • High-level synthesis • Reconfigurable computing • Dataflow machines • Asynchronous circuits • Spatial computation We target an extreme point in the design space: no interpretation,fully distributed computation and control

  41. ASH Design Point • Design an ASIC in a day • Fully automatic synthesis to layout • Fully distributed control and computation (spatial computation) • Replicate computation to simplify wires • Energy/op rivals custom ASIC • Performance rivals superscalar • E£t 100 times better than any processor

  42. Conclusions Spatial computation strengths

  43. Backup Slides • Absolute performance • Control logic • Exceptions • Leniency • Normalized area • ASH weaknesses • Splitting memory • Recursive calls • Leakage • Why not compare to… • Targeting FPGAs

  44. Absolute Performance CPU range back

  45. Pipeline Stage ackout C rdyin ackin rdyout = D Reg dataout datain back

  46. Exceptions • Strictly speaking, C has no exceptions • In practice hard to accommodate exceptions in hardware implementations • An advantage of software flexibility: PC is single point of execution control CPU ASH Low ILP computation + OS + VM + exceptions High-ILP computation $$$ Memory back

  47. Critical Paths b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y

  48. - > Lenient Operations b x 0 if (x > 0) y = -x; else y = b*x; * ! y Solves the problem of unbalanced paths back back to talk

  49. Normalized Area back

  50. ASH Weaknesses • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static • No branch prediction • No dynamic unrolling • No register renaming • Calls/returns not lenient back

More Related