780 likes | 1k Views
Spatial Computation. Mihai Budiu CMU CS. Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003. SCS. Spatial Computation. A model of general-purpose computation based on Application-Specific Hardware. Thesis committee:
E N D
Spatial Computation Mihai Budiu CMU CS Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS
Spatial Computation A model of general-purpose computationbased on Application-Specific Hardware. Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS
Thesis Statement Application-Specific Hardware (ASH): • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs withhigh ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. not!
Outline • Introduction • Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions
CPU Problems • Complexity • Power • Global Signals • Limited ILP
Design Complexity from Michael Flynn’s FCRC 2003 talk
Communication vs. Computation wire gate 5ps 20ps Power consumption on wires is also dominant
Resource Binding Time 1. 1. Programs 2. 2. Programs CPU ASH
Hardware Interface software software ISA virtual ISA gates hardware hardware CPU ASH
Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable/custom hw
systems theory Contributions Computerarchitecture Embeddedsystems Reconfigurablecomputing Compilation Asynchronouscircuits High-levelsynthesis Nanotechnology Dataflowmachines
Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions
Computation = Dataflow Programs Circuits a 7 x = a & 7; ... y = x >> 2; & 2 x >> • Operations ) functional units • Variables ) wires • No interpretation
Basic Operation + latch data ack valid
+ + + 2 3 4 + + + + latch 5 6 7 8 Asynchronous Computation + data ack valid 1
FSM Distributed Control Logic ack rdy + - short, local wires asynchronous control
Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y critical path Conditionals ) Speculation
p ! Split (branch) Control Flow ) Data Flow data Merge (label) data data predicate Gateway
0 i * 0 +1 < 100 sum + return sum; ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum;
sequencing of side-effects no speculation Predication and Side-Effects addr token to memory Load pred data token
Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs withhigh ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. not!
Outline • Introduction • CASH: Compiling for ASH • An optimization on the SIDE • Media processing on ASH • ASH vs. superscalar processors • Conclusions skip to
Availability Dataflow Analysis y y = a*b; ... if (x) { ... ... = a*b; }
Dataflow Analysis Is Conservative if (x) { ... y = a*b; } ... ... = a*b; y?
Static Instantiation, Dynamic Evaluation flag = false; if (x) { ... y = a*b; flag = true; } ... ... = flag ? y : a*b;
SIDE Register Promotion Impact Loads % reduction Stores
Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions
Performance Evaluation Mem L2 1/4M ASH L1 8K LSQ limited BW CPU: 4-way OOO Assumption: all operations have the same latency.
Low-Level Evaluation C CASHcore Results shown so far. All results in thesis. Verilog back-end Synopsys,Cadence P/R 180nm std. cell library, 2V ~1999 technology Results in the next two slides. ASIC
Area Reference: P4 in 180nm has 217mm2
Power vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W
Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs withhigh ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. not!
Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • dataflow pipelining • ASH vs. superscalar processors • Conclusions skip to
i 1 Pipelining + * 100 <= int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; pipelined multiplier (8 stages) sum + cycle=1
i 1 Pipelining + * 100 <= sum + cycle=2
i 1 Pipelining + * 100 <= sum + cycle=3
i 1 Pipelining + * 100 <= sum + cycle=4
i 1 Pipelining + i=1 100 <= i=0 sum + cycle=5 pipeline balancing
Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions
wrong! This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better(if compilers equally good).
ASH crit path CPU crit path Predicted not taken Effectively a noop for CPU! result available before inputs Predicted taken. Branch Prediction i 1 + for (i=0; i < N; i++) { ... if (exception) break; } < exception ! &
ASH Problems • Both branch and join not free • Static dataflow(no re-issue of same instr) • Memory is “far” • Fully static • No branch prediction • No dynamic unrolling • No register renaming • Calls/returns not lenient • ...
Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs withhigh ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. not!
Outline Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions