380 likes | 482 Views
Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware . Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco 2009. Dynamic Application Specific Customization of HW. JIT compiler configures logic. your app here.
E N D
Liquid Metal’s OPTIMUS:Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco 2009
Dynamic Application Specific Customization of HW JIT compiler configures logic your app here • Inspired by ASIC paradigm: • High Performance • Low Power
Liquid Metal: “JIT the Hardware” • Single language for programming HW & SW • Run in a standard JVM, or synthesize in HW • Fluidly move computation between HW & SW • Do for HW (viz. FPGAs) what FORTRAN did for computing • Address critical technology trends ASIC-like Reconfigurable
Lime: the Liquid Metal Language • Design Principles: • Object-oriented, Java-like, Java-compatible • Raise level of abstraction • Parallel constructs that simplify code • Target synthesis while retaining generality
4 reasons not another *C to HDL approach • Emphasis on programmer productivity • Leverage rich Java IDEs, libraries, and analysis • Not an auto-parallelization approach • Lime is explicitly parallel and synthesizable • Fast fail-safe mechanism • Lime may be refined into parallel SW implementation • Intrinsic opportunity for online optimizations • Static optimizations with dynamic refinement
Rate “matching” operators Streaming primitives Template-like Generics Lime Overview HW (FPGA): Lime: Tasks, Value types Computation is well encapsulated Data-flow driven computation Multiple “clock domains Ordinal-indexed arrays, bounded loops Memory usage statically determined before layout Bit-level control and reasoning Abstract OO programming down to the bit-level!
Streams: Exposing Computational Structure • Stream primitives are integral to the language • Tasks in streams are strongly isolated • Only the endpoints may perform side-effects • Provide macro-level functional programming abstraction… • … While allowing traditional imperative programming inside
A Brief Introduction to Stream Operations A finite stream literal: int stream s1 = { 1, 1, 2, 3, 5, 8 }; An infinite stream of 3’s: int stream s2 = task 3; Stream expressions: int stream s3 = s2 * 17; double stream s4 = Math.sin(s1); double stream s5 = s3 + s4; These operations create and connect tasks. Execution occurs later: lazy computation, functional.
Simple Audio Processing value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength]; for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude; return (value int[]) samples; } int stream sqwaves = task squareWave(1000, 44100, 80)); task AudioSink(44100).play(sqwaves);
Liquid Metal Tool Chain Lime StreamingIR QuicklimeFront-EndCompiler FPGA Model OptimusBack-EndCompiler CrucibleBack-EndCompiler HDL C Xilinx VHDL Compiler Cell SDK Xilinx bitfile Cell binary LM VM Virtex5 FPGA Cell BE LM VM LM VM
Streaming Intermediate Representation (SIR) Task: Pipeline: SplitJoin: Switch: joiner splitter switch joiner Feedback Loop: • Task may be stateless or have state • Task mapped to “module” with FIFO I/O • Task graphs are hierarchical & structured joiner splitter
SIR Compiler Optimizations • Address FPGA compilation challenges • Finite, non-virtualizable device • Complex optimization space • Throughput, latency, power, area • Very long synthesis times (minutes-hours)
Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405 • Liquid Metal on Virtex 4 FPGA, 1.6W • C reference implementation on PPC 405, 0.5W
Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405 • Liquid Metal on Virtex 4 FPGA, 1.6W • C reference implementation on PPC 405, 0.5W
Handel-C Comparison • Compared DES and DCT with hand-optimized Handel-C implementation • Performance • 5% faster before optimizations • 12x faster after optimizations • Area • 66% larger before optimizations • 90% larger after optimizations
Overview • Compilation Flow • Scheduling • Optimizations • Results
Source Controller Top Level Compilation Source a[ ] Init Work ix i1 i i0 A A Controller M0 … Round-Robin Splitter(8,8,8,8) Controller Round-Robin Splitter(8,8,8,8) Filter Work M1 Init Work B C E D … B C D E Filter Filter Filter Filter Controller Controller Controller Controller Filter Filter Filter Filter Mn Work Work Work Work . . . G F H I O0 O0 Om F G H I Controller Round-Robin Joiner(1,1,1,1) Round-Robin Joiner(1,1,1,1) Work J J Sink Sink Controller Work
Filter Compilation Live data ins Control in bb1 Control Token Register Ack Basic Block sum = 0 i = 0 mux mux 1 Memory/Queue ports bb2 FIFO Read Register Ack temp = pop( ) 2 Control Token Register Live out Data Live data outs bb3 sum = sum + temp i = i + 1 Branch bb2 if i < 8 Control outs 3 Ack Register Control Token Ack Live out Data push(sum) 4 FIFO Write bb4 Register
Operation Compilation temp sum i 1 im i0 ADD ADD 1 1 … sum = sum + temp i = i + 1 Branch bb2 if i < 8 FU predicate 8 temp CMP … 1 o0 on Control out 4 Control out 3 Register Control in …
Static Stream Scheduling 20 • Each queue has to be deep enough to hold values generated from a single execution of the connected filter • Double buffering is needed • Buffer access is non-blocking • A controller module is needed to orchestrate the schedule • Controller uses finite state machine to execute the steady state schedule Filter 1 Push 2 Filter 2 Pop 3
Greedy Stream Scheduling • Filters fire eagerly. • Blocking channel access. • Allows for potentially smaller channels • Controller is not needed • Results produced with lower latency. Filter 1 Filter 2
Optimizations • Streaming optimizations (macro functional) • Channel allocations, Channel access fusion, Critical Path Balancing, Filter fission and fusion, etc. • Doing these optimization needs global information about the stream graph • Typically performed manually using existing tools • Classic optimizations (micro functional) • Flip-flop elimination, Common subexpression elimination, Constant folding, Loop unrolling, etc. • Typically included in existing compilers and tools
Channel Allocation • Larger channels: • More SRAM • More control logic • Less stalls • Interlocking makes sure that each filter gets the right data or blocks. • What is the right channel size?
Channel Allocation Algorithm • Set the size of the channels to infinity. • Warm-up the queues. • Record the steady state instruction schedules for each pair. • Unroll the schedules to have the same number of pushes and pops. • Find the maximum number of overlapping lifetimes.
Channel Allocation Example Producer Consumer Source Filter 1 Filter 2 Sink Max overlap = 3
Channel Access Fusion • Each channel access (push or pop) takes one cycle. • Communication to computation ratio • Longer critical path latency • Limit task-level parallelism
Channel Access Fusion Algorithm w w 30 r w • Clustering channel access operations • Loop Unrolling • Code Motion • Balancing the groups • Similar to vectorization • Wide channels r r r r r r r r r w w w Write Mult. = 1 Read Mult. = 8 Write Mult. = 8 Read Mult. = 8 Write Mult. = 4 Read Mult. = 1
Access Fusion Example int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); } } • Some caveats int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum); int sum = 0; for (inti = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum); int sum = 0; for (inti = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);
Critical Path Balancing • Critical path is set by the longest combinational path in the filters • Optimus uses its internal FPGA model to estimate how this impacts throughput and latency • Balancing Algorithm: • Optimus take target clock as input • Start with least number of basic blocks • Form USE/DEF chains for the filter • Use the internal FPGA model to measure critical path latency • Break the paths whose latency exceeds the target
Critical Path Balancing Example 1 1 Add Add Mul 1 Mul Mul Mul Add Mul Mul 2 Mul Mul Sub Sub Mul Add Sub 1 2 Sub Add Sub Add Add Sub Add Sub Add Sub Add Sub Add Sub Add 3 Mul Mul 3 4 Shift Add Add Shift Shift
Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and SW Liquid Metal VM: JIT the hardware! Liquid Metal Program all with Lime LiquidMetal VM CPU ??? FPGA GPU Multicore
Streaming IR • Expose structure: computation and communication • Uniform framework for pipeline and data parallelism • Canonical representation for stream-aware optimizations
Streaming Optimizations Macro-functional • Fold streaming IR graphs into FPGA… • Fusion, fission, replication • …subject to latency, area, and throughput constraints Micro-functional • Micro-pipelining • Channel allocation • Access fusion • Flip-flop elimination
Ongoing Effort • Application development • Streaming for enterprise and consumer • Real-time applications • Compiler and JIT • Pre-provisioning profitable HW implementations • Runtime opportunities to “JIT” the HW • Advanced dynamic reconfiguration support in VM • Predictive, hides latency • New platforms • Tightly coupled, higher bandwidth, lower latency communication • Heterogeneous MPSoC systems – FPGA + processors