720 likes | 1.06k Views
computation. Low ILP computation + OS + VM. CPU. ASH. Memory. 15. Outline ... Asynchronous Computation. data. valid. ack. 1. 2. 3. 4. 8. 7. 6. 5. latch. 22. Distributed Control ...
E N D
Spatial ComputationComputing without General-Purpose Processors Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University July 8, 2004
Spatial Computation Spatial Computation • A computation model based on: • application-specific hardware • no interpretation • minimal resource sharing Mihai Budiu mihaib@cs.cmu.edu Carnegie Mellon University
The Engine Behind This Talk main( ) { signal(SIGINT, welcome); while (slides( ) && time( )) { talk( ); } }
Research Scope Object: future architectures Tool:compilers Evaluation:simulators
incremental evolution new solutions Research Methodology Y (e.g., cost) “reasonable limits” state-of-the-art X (e.g., power) Constraint Space
100 10 1 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 Outline 1000 • Introduction: problems of current architectures • Compiling Application-Specific Hardware • Pipelining • ASH Evaluation • Conclusions Performance
Resources [Intel] • We do not worry about not having hardware resources • We worry about being able to use hardware resources
Design Complexity 1010 109 108 107 Chip size Transistors 106 105 Designer productivity 104 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009
Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant
Power Consumption Toasted CPU: about 2 sec after removing cooler. (Tom’s Hardware Guide)
ALUs Energy Efficiency Pentium 4
Clock Speed 3GHz 6GHz 10GHz Cannot rely on global signals (clock is a global signal)
VERY rigid to changes (e.g. x86 vs Itanium) Instruction-Set Architecture Software ISA Hardware
CPU ASH Low ILP computation + OS + VM High-ILP computation $ Memory Our Proposal • ASH addresses these problems • ASH is not a panacea • ASH “complementary” to CPU
Outline • Problems of current architectures • CASH: Compiling ASH • program representation • compiling C programs • Pipelining • ASH Evaluation • Conclusions
SW HW ISA HW backend Dataflow machine Application-Specific Hardware C program Compiler Dataflow IR Reconfigurable/custom hw
Application-Specific Hardware Soft C program Compiler Dataflow IR SW backend Machine code CPU [predication]
Key: Intermediate Representation Our IR Traditionally • SSA + predication + speculation • Uniform for scalars and memory • Explicitly encodes may-depend • Executable • Precise semantics • Dataflow IR • Close to asynchronous target may-dep. CFG ... def-use
Computation = Dataflow Programs Circuits a 7 x = a & 7; ... y = x >> 2; & 2 x >> • Operations ) functional units • Variables ) wires • No interpretation
Basic Computation + latch data ack valid
+ + + 2 3 4 + + + + latch 5 6 7 8 Asynchronous Computation + data ack valid 1
globalFSM Distributed Control Logic ack rdy + - short, local wires asynchronous control
Outline • Problems of current architectures • CASH: Compiling ASH • program representation • compiling C programs • Pipelining • ASH Evaluation • Conclusions
SSA = no arbitration MUX: Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! f y critical path Conditionals ) Speculation
p ! Split (branch) Control Flow ) Data Flow data f Merge (label) data data predicate Gateway
0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;
sequencing of side-effects no speculation Predication and Side-Effects addr token to memory Load pred data token
Memory Access LD Monolithic Memory pipelined arbitrated network ST LD local communication global structures Future work: fragment this! complexity related work
CASH Optimizations • SSA-based optimizations • unreachable/dead code, gcse, strength reduction, loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining • Memory optimizations • dependence & alias analysis, register promotion, redundant load/store elimination, memory access pipelining, loop decoupling • Boolean optimizations • Espresso CAD tool, bitwidth analysis
Outline • Problems of current architectures • Compiling ASH • Pipelining • Evaluation: CASH vs. clocked designs • Conclusions
i Pipelining 1 + * 100 <= int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; pipelined multiplier (8 stages) sum + step 1
i Pipelining 1 + * 100 <= sum + step 2
i Pipelining 1 + * 100 <= sum + step 3
i Pipelining 1 + * 100 <= sum + step 4
i Pipelining 1 + i=1 100 <= i=0 sum + step 5
i Pipelining 1 + * 100 i=1 <= i=0 sum + step 6
i’s loop Longlatency pipe predicate sum’s loop i Pipelining 1 + * 100 <= sum + step 7
i’s loop sum’s loop i Pipelining 1 + * 100 critical path <= Predicate ackedge is on the critical path. sum +
i’s loop sum’s loop i Pipeline balancing 1 + * 100 <= decoupling FIFO sum + step 7
i Pipeline balancing 1 + * 100 critical path <= i’s loop decoupling FIFO sum sum’s loop +
Outline • Problems of current architectures • Compiling ASH • Pipelining • Evaluation: CASH vs. clocked designs • Conclusions
Evaluating ASH Mediabench kernels (1 hot function/benchmark) C CASHcore Verilog back-end Synopsys,Cadence P/R 180nm std. cell library, 2V ~1999 technology Mem ModelSim (Verilog simulation) performancenumbers ASIC
ASH Area P4: 217 minimal RISC core normalized area
LSQ • Token release to dependents: requires round-trip to memory. • Limit study: round trip zero time ) up to 6x speed-up. • Exploring protocol for in-order data delivery & fast token release. Bottleneck: Memory Protocol LD Memory ST
Power Xeon [+cache] 67000 mP 4000 DSP 110
1000x Energy Efficiency Dedicated hardware ASH media kernels Asynchronous P FPGAs General-purpose DSP Microprocessors 0 . 1 1 0 1 1 0 0 0 0 0 1 1 0 0 . Energy Efficiency [Operations/nJ]
Outline Problems of current architectures • Compiling ASH • Pipelining • ASH Evaluation • Future/related work & conclusions
Related Work Asynchronouscircuits Nanotechnology Dataflowmachines Embeddedsystems High-levelsynthesis Reconfigurablecomputing Computerarchitecture Compilation
Future Work • Optimizations for area/speed/power • Memory partitioning • Concurrency • Compiler-guided layout • Explore extensible ISAs • Hybridization with superscalar mechanisms • Reconfigurable hardware support for ASH • Formal verification