Compiling Application-Specific Hardware

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University

Resources

Problems • Complexity • Power • Global Signals • Limited issue window => limited ILP We propose a scalable architecture

Outline • Introduction • ASH: Application Specific Hardware • Compiling for ASH • Conclusions

Application-Specific Hardware C program Dataflow IR Compiler dataflow machine Reconfigurable hardware

Our Solution General: applicable to today’s software - programming languages - applications Automatic: compiler-driven Scalable:- run-time: with clock, hardware - compile-time: with program size Parallelism: exploit application parallelism

Asynchronous Computation + data ack data valid

New • Entire C applications • Dynamically scheduled circuits • Custom dataflow machines • - application-specific • - direct execution (no interpretation) • - spatial computation

Outline • Scalability • Application Specific Hardware • CASH: Compiling in ASH • Conclusions

Circuits Memory partitioning Interconnection net CASH: Compiling for ASH C Program RH

Primitives Arithmetic/logic Multiplexors Merge Eta (gateway) Memory + data predicates data predicate ld st

Forward Branches b x 0 if (x > 0) y = -x; else y = b*x; * - > ! y Decoded mux Conditionals => Speculation

- > Critical Paths b x 0 if (x > 0) y = -x; else y = b*x; * ! y

- > Lenient Operations b x 0 if (x > 0) y = -x; else y = b*x; * ! y Solve the problem of unbalanced paths

0 i * 0 +1 < 100 sum + ! ret Loops int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; Control flow => data flow

Compilation • Translate C to dataflow machines • Optimizations software-, hardware-, dataflow-specific • Expose parallelism • predication • speculation • localized synchronization • pipelining

i 1 Pipelining + * 100 <= pipelined multiplier sum +

i 1 Pipelining + * 100 <= sum +

i’s loop Longlatency pipe sum’s loop i 1 Pipelining + * 100 <= sum +

i 1 Pipelining + * 100 <= sum +

i’s loop Longlatency pipe predicate sum’s loop i 1 Pipelining + * 100 <= sum +

i’s loop sum’s loop i 1 Pipelining + * 100 critical path <= Predicate ackedge is on the critical path. sum +

i’s loop sum’s loop i 1 Pipelining + * 100 <= decoupling FIFO sum +

i 1 Pipelining + * 100 critical path <= i’s loop decoupling FIFO sum sum’s loop +

ASH Features • What you code is what you get • no hidden control logic • lean hardware (no CAM, multi-ported files, etc.) • no global signals • Compiler has complete control • Dynamic scheduling => latency tolerant • Natural ILP and loop pipelining

ASH promises to scale with: • circuit speed • transistors • program size Conclusions • ASH: compiler-synthesized hardware from HLL • Exposes program parallelism • Dataflow techniques applied to hardware

Backup slides • Hyperblocks • Predication • Speculation • Memory access • Procedure calls • Recursive calls • Resources • Performance

Hyperblocks Procedure back

Predication hyperblock if (!p) ....... p !p if (p) ....... q q back

Speculation if (!p) ...... if (!p) ...... ops w/ side-effects q q back

Memory Access predicate address token Interconnection network load Load-store queue data token data address pred token store Memory token back

Procedure calls Interconnection network Extract args args call P result caller ret Procedure P back

Recursion save live values recursive call restore live values stack hyperblock back

Resources • Estimated SpecINT95 and Mediabench • Average < 100 bit-operations/line of code • Routing resources harder to estimate • Detailed data in paper back

Performance • Preliminary comparison with 4-wide OOO • Assumed same FU latencies • Speed-up on kernels from Mediabench back

Compiling Application-Specific Hardware