1 / 72

Spatial Computation

computation. Low ILP computation + OS + VM. CPU. ASH. Memory. 15. Outline ... Asynchronous Computation. data. valid. ack. 1. 2. 3. 4. 8. 7. 6. 5. latch. 22. Distributed Control ...

Kelvin_Ajay
Download Presentation

Spatial Computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatial ComputationComputing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University July 8, 2004

  2. Spatial Computation Spatial Computation • A computation model based on: • application-specific hardware • no interpretation • minimal resource sharing Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University

  3. The Engine Behind This Talk main( ) { signal(SIGINT, welcome); while (slides( ) && time( )) { talk( ); } }

  4. Research Scope Object: future architectures Tool:compilers Evaluation:simulators

  5. incremental evolution new solutions Research Methodology Y (e.g., cost) “reasonable limits” state-of-the-art X (e.g., power) Constraint Space

  6. 100 10 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 Outline 1000 • Introduction: problems of current architectures • Compiling Application-Specific Hardware • Pipelining • ASH Evaluation • Conclusions Performance

  7. Resources [Intel] • We do not worry about not having hardware resources • We worry about being able to use hardware resources

  8. Design Complexity 1010 109 108 107 Chip size Transistors 106 105 Designer productivity 104 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009

  9. Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant

  10. Power Consumption Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)

  11. ALUs Energy Efficiency Pentium 4

  12. Clock Speed 3GHz 6GHz 10GHz Cannot rely on global signals (clock is a global signal)

  13. VERY rigid to changes (e.g. x86 vs Itanium) Instruction-Set Architecture Software ISA Hardware

  14. CPU ASH Low ILP computation + OS + VM High-ILP computation $ Memory Our Proposal • ASH addresses these problems • ASH is not a panacea • ASH “complementary” to CPU

  15. Outline • Problems of current architectures • CASH: Compiling ASH • program representation • compiling C programs • Pipelining • ASH Evaluation • Conclusions

  16. SW HW ISA HW backend Dataflow machine Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw

  17. Application-Specific Hardware Soft C program Compiler Dataflow IR SW backend Machine code CPU [predication]

  18. Key: Intermediate Representation Our IR Traditionally • SSA + predication + speculation • Uniform for scalars and memory • Explicitly encodes may-depend • Executable • Precise semantics • Dataflow IR • Close to asynchronous target may-dep. CFG ... def-use

  19. Computation = Dataflow Programs Circuits a 7 x = a & 7; ... y = x >> 2; & 2 x >> • Operations ) functional units • Variables ) wires • No interpretation

  20. Basic Computation + latch data ack valid

  21. + + + 2 3 4 + + + + latch 5 6 7 8 Asynchronous Computation + data ack valid 1

  22. globalFSM Distributed Control Logic ack rdy + - short, local wires asynchronous control

  23. Outline • Problems of current architectures • CASH: Compiling ASH • program representation • compiling C programs • Pipelining • ASH Evaluation • Conclusions

  24. SSA = no arbitration MUX: Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! f y critical path Conditionals ) Speculation

  25. p ! Split (branch) Control Flow ) Data Flow data f Merge (label) data data predicate Gateway

  26. 0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;

  27. sequencing of side-effects no speculation Predication and Side-Effects addr token to memory Load pred data token

  28. Memory Access LD Monolithic Memory pipelined arbitrated network ST LD local communication global structures Future work: fragment this! complexity related work

  29. CASH Optimizations • SSA-based optimizations • unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining • Memory optimizations • dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling • Boolean optimizations • Espresso CAD tool, bitwidth analysis

  30. Outline • Problems of current architectures • Compiling ASH • Pipelining • Evaluation: CASH vs. clocked designs • Conclusions

  31. i Pipelining 1 + * 100 <= int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; pipelined multiplier (8 stages) sum + step 1

  32. i Pipelining 1 + * 100 <= sum + step 2

  33. i Pipelining 1 + * 100 <= sum + step 3

  34. i Pipelining 1 + * 100 <= sum + step 4

  35. i Pipelining 1 + i=1 100 <= i=0 sum + step 5

  36. i Pipelining 1 + * 100 i=1 <= i=0 sum + step 6

  37. i’s loop Longlatency pipe predicate sum’s loop i Pipelining 1 + * 100 <= sum + step 7

  38. i’s loop sum’s loop i Pipelining 1 + * 100 critical path <= Predicate ackedge is on the critical path. sum +

  39. i’s loop sum’s loop i Pipeline balancing 1 + * 100 <= decoupling FIFO sum + step 7

  40. i Pipeline balancing 1 + * 100 critical path <= i’s loop decoupling FIFO sum sum’s loop +

  41. Outline • Problems of current architectures • Compiling ASH • Pipelining • Evaluation: CASH vs. clocked designs • Conclusions

  42. Evaluating ASH Mediabench kernels (1 hot function/benchmark) C CASHcore Verilog back-end Synopsys,Cadence P/R 180nm std. cell library, 2V ~1999 technology Mem ModelSim (Verilog simulation) performancenumbers ASIC

  43. ASH Area P4: 217 minimal RISC core normalized area

  44. ASH vs 600MHz CPU [.18 mm]

  45. LSQ • Token release to dependents: requires round-trip to memory. • Limit study: round trip zero time ) up to 6x speed-up. • Exploring protocol for in-order data delivery & fast token release. Bottleneck: Memory Protocol LD Memory ST

  46. Power Xeon [+cache] 67000 mP 4000 DSP 110

  47. 1000x Energy Efficiency Dedicated hardware ASH media kernels Asynchronous P FPGAs General-purpose DSP Microprocessors 0 . 1 1 0 1 1 0 0 0 0 0 1 1 0 0 . Energy Efficiency [Operations/nJ]

  48. Outline Problems of current architectures • Compiling ASH • Pipelining • ASH Evaluation • Future/related work & conclusions

  49. Related Work Asynchronouscircuits Nanotechnology Dataflowmachines Embeddedsystems High-levelsynthesis Reconfigurablecomputing Computerarchitecture Compilation

  50. Future Work • Optimizations for area/speed/power • Memory partitioning • Concurrency • Compiler-guided layout • Explore extensible ISAs • Hybridization with superscalar mechanisms • Reconfigurable hardware support for ASH • Formal verification

More Related