370 likes | 453 Views
Compiling Application-Specific Hardware. Mihai Budiu Seth Copen Goldstein Carnegie Mellon University. Resources. Problems. Complexity Power Global Signals Limited issue window => limited ILP. We propose a scalable architecture. Outline. Introduction
E N D
Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University
Problems • Complexity • Power • Global Signals • Limited issue window => limited ILP We propose a scalable architecture
Outline • Introduction • ASH: Application Specific Hardware • Compiling for ASH • Conclusions
Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable hardware
Our Solution General: applicable to today’s software - programming languages - applications Automatic: compiler-driven Scalable:- run-time: with clock, hardware - compile-time: with program size Parallelism: exploit application parallelism
Asynchronous Computation + data ack data valid
New • Entire C applications • Dynamically scheduled circuits • Custom dataflow machines • - application-specific • - direct execution (no interpretation) • - spatial computation
Outline • Scalability • Application Specific Hardware • CASH: Compiling in ASH • Conclusions
Circuits Memory partitioning Interconnection net CASH: Compiling for ASH C Program RH
Primitives Arithmetic/logic Multiplexors Merge Eta (gateway) Memory + data predicates data predicate ld st
Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y Decoded mux Conditionals => Speculation
- > Critical Paths b x 0 if (x > 0) y = -x; else y = b*x; * ! y
- > Lenient Operations b x 0 if (x > 0) y = -x; else y = b*x; * ! y Solve the problem of unbalanced paths
0 i * 0 +1 < 100 sum + ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; Control flow => data flow
Compilation • Translate C to dataflow machines • Optimizations software-, hardware-, dataflow-specific • Expose parallelism • predication • speculation • localized synchronization • pipelining
i 1 Pipelining + * 100 <= pipelined multiplier sum +
i 1 Pipelining + * 100 <= sum +
i 1 Pipelining + * 100 <= sum +
i 1 Pipelining + * 100 <= sum +
i’s loop Longlatency pipe sum’s loop i 1 Pipelining + * 100 <= sum +
i 1 Pipelining + * 100 <= sum +
i’s loop Longlatency pipe predicate sum’s loop i 1 Pipelining + * 100 <= sum +
i’s loop sum’s loop i 1 Pipelining + * 100 critical path <= Predicate ackedge is on the critical path. sum +
i’s loop sum’s loop i 1 Pipelining + * 100 <= decoupling FIFO sum +
i 1 Pipelining + * 100 critical path <= i’s loop decoupling FIFO sum sum’s loop +
ASH Features • What you code is what you get • no hidden control logic • lean hardware (no CAM, multi-ported files, etc.) • no global signals • Compiler has complete control • Dynamic scheduling => latency tolerant • Natural ILP and loop pipelining
ASH promises to scale with: • circuit speed • transistors • program size Conclusions • ASH: compiler-synthesized hardware from HLL • Exposes program parallelism • Dataflow techniques applied to hardware
Backup slides • Hyperblocks • Predication • Speculation • Memory access • Procedure calls • Recursive calls • Resources • Performance
Hyperblocks Procedure back
Predication hyperblock if (!p) ....... p !p if (p) ....... q q back
Speculation if (!p) ...... if (!p) ...... ops w/ side-effects q q back
Memory Access predicate address token Interconnection network load Load-store queue data token data address pred token store Memory token back
Procedure calls Interconnection network Extract args args call P result caller ret Procedure P back
Recursion save live values recursive call restore live values stack hyperblock back
Resources • Estimated SpecINT95 and Mediabench • Average < 100 bit-operations/line of code • Routing resources harder to estimate • Detailed data in paper back
Performance • Preliminary comparison with 4-wide OOO • Assumed same FU latencies • Speed-up on kernels from Mediabench back