1 / 38

Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware

Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware . Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco 2009. Dynamic Application Specific Customization of HW. JIT compiler configures logic. your app here.

tamma
Download Presentation

Liquid Metal’s OPTIMUS: Synthesis of Efficient Streaming Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Liquid Metal’s OPTIMUS:Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco 2009

  2. Dynamic Application Specific Customization of HW JIT compiler configures logic your app here • Inspired by ASIC paradigm: • High Performance • Low Power

  3. Liquid Metal: “JIT the Hardware” • Single language for programming HW & SW • Run in a standard JVM, or synthesize in HW • Fluidly move computation between HW & SW • Do for HW (viz. FPGAs) what FORTRAN did for computing • Address critical technology trends ASIC-like Reconfigurable

  4. Lime: the Liquid Metal Language • Design Principles: • Object-oriented, Java-like, Java-compatible • Raise level of abstraction • Parallel constructs that simplify code • Target synthesis while retaining generality

  5. 4 reasons not another *C to HDL approach • Emphasis on programmer productivity • Leverage rich Java IDEs, libraries, and analysis • Not an auto-parallelization approach • Lime is explicitly parallel and synthesizable • Fast fail-safe mechanism • Lime may be refined into parallel SW implementation • Intrinsic opportunity for online optimizations • Static optimizations with dynamic refinement

  6. Rate “matching” operators Streaming primitives Template-like Generics Lime Overview HW (FPGA): Lime: Tasks, Value types Computation is well encapsulated Data-flow driven computation Multiple “clock domains Ordinal-indexed arrays, bounded loops Memory usage statically determined before layout Bit-level control and reasoning Abstract OO programming down to the bit-level!

  7. Streams: Exposing Computational Structure • Stream primitives are integral to the language • Tasks in streams are strongly isolated • Only the endpoints may perform side-effects • Provide macro-level functional programming abstraction… • … While allowing traditional imperative programming inside

  8. A Brief Introduction to Stream Operations A finite stream literal: int stream s1 = { 1, 1, 2, 3, 5, 8 }; An infinite stream of 3’s: int stream s2 = task 3; Stream expressions: int stream s3 = s2 * 17; double stream s4 = Math.sin(s1); double stream s5 = s3 + s4; These operations create and connect tasks. Execution occurs later: lazy computation, functional.

  9. Simple Audio Processing value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength]; for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude; return (value int[]) samples; } int stream sqwaves = task squareWave(1000, 44100, 80)); task AudioSink(44100).play(sqwaves);

  10. Liquid Metal Tool Chain Lime StreamingIR QuicklimeFront-EndCompiler FPGA Model OptimusBack-EndCompiler CrucibleBack-EndCompiler HDL C Xilinx VHDL Compiler Cell SDK Xilinx bitfile Cell binary LM VM Virtex5 FPGA Cell BE LM VM LM VM

  11. Streaming Intermediate Representation (SIR)‏ Task: Pipeline: SplitJoin: Switch: joiner splitter switch joiner Feedback Loop: • Task may be stateless or have state • Task mapped to “module” with FIFO I/O • Task graphs are hierarchical & structured joiner splitter

  12. SIR Compiler Optimizations • Address FPGA compilation challenges • Finite, non-virtualizable device • Complex optimization space • Throughput, latency, power, area • Very long synthesis times (minutes-hours)

  13. Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405 • Liquid Metal on Virtex 4 FPGA, 1.6W • C reference implementation on PPC 405, 0.5W

  14. Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405 • Liquid Metal on Virtex 4 FPGA, 1.6W • C reference implementation on PPC 405, 0.5W

  15. Handel-C Comparison • Compared DES and DCT with hand-optimized Handel-C implementation • Performance • 5% faster before optimizations • 12x faster after optimizations • Area • 66% larger before optimizations • 90% larger after optimizations

  16. Overview • Compilation Flow • Scheduling • Optimizations • Results

  17. Source Controller Top Level Compilation Source a[ ] Init Work ix i1 i i0 A A Controller M0 … Round-Robin Splitter(8,8,8,8) Controller Round-Robin Splitter(8,8,8,8) Filter Work M1 Init Work B C E D … B C D E Filter Filter Filter Filter Controller Controller Controller Controller Filter Filter Filter Filter Mn Work Work Work Work . . . G F H I O0 O0 Om F G H I Controller Round-Robin Joiner(1,1,1,1) Round-Robin Joiner(1,1,1,1) Work J J Sink Sink Controller Work

  18. Filter Compilation Live data ins Control in bb1 Control Token Register Ack Basic Block sum = 0 i = 0 mux mux 1 Memory/Queue ports bb2 FIFO Read Register Ack temp = pop( ) 2 Control Token Register Live out Data Live data outs bb3 sum = sum + temp i = i + 1 Branch bb2 if i < 8 Control outs 3 Ack Register Control Token Ack Live out Data push(sum) 4 FIFO Write bb4 Register

  19. Operation Compilation temp sum i 1 im i0 ADD ADD 1 1 … sum = sum + temp i = i + 1 Branch bb2 if i < 8 FU predicate 8 temp CMP … 1 o0 on Control out 4 Control out 3 Register Control in …

  20. Static Stream Scheduling 20 • Each queue has to be deep enough to hold values generated from a single execution of the connected filter • Double buffering is needed • Buffer access is non-blocking • A controller module is needed to orchestrate the schedule • Controller uses finite state machine to execute the steady state schedule Filter 1 Push 2 Filter 2 Pop 3

  21. Greedy Stream Scheduling • Filters fire eagerly. • Blocking channel access. • Allows for potentially smaller channels • Controller is not needed • Results produced with lower latency. Filter 1 Filter 2

  22. Latency Comparison

  23. Area Comparison

  24. Optimizations • Streaming optimizations (macro functional) • Channel allocations, Channel access fusion, Critical Path Balancing, Filter fission and fusion, etc. • Doing these optimization needs global information about the stream graph • Typically performed manually using existing tools • Classic optimizations (micro functional) • Flip-flop elimination, Common subexpression elimination, Constant folding, Loop unrolling, etc. • Typically included in existing compilers and tools

  25. Channel Allocation • Larger channels: • More SRAM • More control logic • Less stalls • Interlocking makes sure that each filter gets the right data or blocks. • What is the right channel size?

  26. Channel Allocation Algorithm • Set the size of the channels to infinity. • Warm-up the queues. • Record the steady state instruction schedules for each pair. • Unroll the schedules to have the same number of pushes and pops. • Find the maximum number of overlapping lifetimes.

  27. Channel Allocation Example Producer Consumer Source Filter 1 Filter 2 Sink Max overlap = 3

  28. Channel Allocation

  29. Channel Access Fusion • Each channel access (push or pop) takes one cycle. • Communication to computation ratio • Longer critical path latency • Limit task-level parallelism

  30. Channel Access Fusion Algorithm w w 30 r w • Clustering channel access operations • Loop Unrolling • Code Motion • Balancing the groups • Similar to vectorization • Wide channels r r r r r r r r r w w w Write Mult. = 1 Read Mult. = 8 Write Mult. = 8 Read Mult. = 8 Write Mult. = 4 Read Mult. = 1

  31. Access Fusion Example int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); } } • Some caveats int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum); int sum = 0; for (inti = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum); int sum = 0; for (inti = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);

  32. Access Fusion

  33. Critical Path Balancing • Critical path is set by the longest combinational path in the filters • Optimus uses its internal FPGA model to estimate how this impacts throughput and latency • Balancing Algorithm: • Optimus take target clock as input • Start with least number of basic blocks • Form USE/DEF chains for the filter • Use the internal FPGA model to measure critical path latency • Break the paths whose latency exceeds the target

  34. Critical Path Balancing Example 1 1 Add Add Mul 1 Mul Mul Mul Add Mul Mul 2 Mul Mul Sub Sub Mul Add Sub 1 2 Sub Add Sub Add Add Sub Add Sub Add Sub Add Sub Add Sub Add 3 Mul Mul 3 4 Shift Add Add Shift Shift

  35. Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and SW Liquid Metal VM: JIT the hardware! Liquid Metal Program all with Lime LiquidMetal VM CPU ??? FPGA GPU Multicore

  36. Streaming IR • Expose structure: computation and communication • Uniform framework for pipeline and data parallelism • Canonical representation for stream-aware optimizations

  37. Streaming Optimizations Macro-functional • Fold streaming IR graphs into FPGA… • Fusion, fission, replication • …subject to latency, area, and throughput constraints Micro-functional • Micro-pipelining • Channel allocation • Access fusion • Flip-flop elimination

  38. Ongoing Effort • Application development • Streaming for enterprise and consumer • Real-time applications • Compiler and JIT • Pre-provisioning profitable HW implementations • Runtime opportunities to “JIT” the HW • Advanced dynamic reconfiguration support in VM • Predictive, hides latency • New platforms • Tightly coupled, higher bandwidth, lower latency communication • Heterogeneous MPSoC systems – FPGA + processors

More Related