1 / 72

High-Level Synthesis for FPGA-Based Processor/Accelerator Systems

This paper discusses the benefits of high-level synthesis for designing FPGA-based processor/accelerator systems, including increased performance and energy efficiency. It also introduces LegUp, a high-level synthesis tool, and highlights the advantages of using FPGAs for circuit implementation.

landaverde
Download Presentation

High-Level Synthesis for FPGA-Based Processor/Accelerator Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems Jason Anderson Tools to Tackle Big Data – Big Data Workshop3 July 2014 Dept. of Electrical and Computer EngineeringUniversity of Toronto

  2. LegUp Research Team • Undergrad Researchers: Mathew Hall, Stefan Hadjis, Joy Chen • Faculty: Stephen Brown and myself • Industry Liaison: Tomasz Czajkowski, Altera AndrewCanis JamesChoi Nazanin Calagar Lanny Lian Blair Fort

  3. Computations in Two Ways

  4. Computations in Two Ways Write Software

  5. Computations in Two Ways Write Software

  6. Computations in Two Ways Write Software

  7. Computations in Two Ways Write Software

  8. Computations in Two Ways Write Software Design Custom Circuits

  9. Computations in Two Ways Write Software Design Custom Circuits

  10. Computations in Two Ways Write Software Design Custom Circuits

  11. Design Methodology

  12. Design Methodology Write software

  13. Design Methodology Write software • Easy

  14. Design Methodology Write software • Easy • Flexibility  lower performance

  15. Design Methodology Write software • Easy • Flexibility  lower performance Design Custom Circuits

  16. Design Methodology Write software • Easy • Flexibility  lower performance Design Custom Circuits • Efficient, low power

  17. Design Methodology Write software • Easy • Flexibility  lower performance Design Custom Circuits • Efficient, low power • Need specialized knowledge

  18. Hardware’s Potential • Implementing computations in FPGA hardware can have speed/energy advantages over software: • Lithography simulation: 15X speed-up [Cong & Zou, TRETS’09] • Linear system solver: 2.2X speed-up, 5X more energy efficient [Zhang, Betz, Rose, TRETS’12] • Monte Carlo simulation for photodynamic therapy: 80X faster, 45X more energy efficient [Lo et al., J. Biomed Optics’09] • Options pricing: 4.6X faster, 25X more energy efficient [Tse, Thomas, Luk, TVLSI’12]

  19. So Why Doesn’t Everybody Use Hardware? • Hardware design is difficult and skills are rare: • Requires use of hardware description languages: Verilog and VHDL • Low-level of abstraction (individual bits) • 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware design for software engineers *US Bureau of Labour Statistics 2012

  20. A Solution • High-Level Synthesis • Design circuits using software languages • From a software program, high-level synthesis tool automatically “synthesizes” circuit that does the same computations as the program • Benefits of software programmability and hardware performance

  21. LegUp High-Level Synthesis for FPGAs • LegUp is a high-level synthesis tool we have been developing since 2009. • Takes a C program as input, and produces a circuit. • 1000+ downloads of our tool since its first release in 2011. • http://legup.eecg.toronto.edu

  22. legup.eecg.toronto.edu

  23. Why Use FPGAs to Implement Circuits? • Building fully fabricated custom chips is hard • Very complex design process • Costs $millions to prototype a chip • Takes 2-3 months to fabricate • Only done for high volume applications or apps that require high speed or lowest power • Alternative: pre-fabricated, programmable chips Field-Programmable Gate Arrays (FPGAs)

  24. Field-Programmable Gate Arrays • Pre-fabricated chip consists of “array” of logic blocks Surrounded by programmable interconnect • Hardware “becomes” what you want by programming blocks and interconnect (electrically) Configurable logic block CLB CLB CLB CLB Common blocks: multiplier, DSP, processor,PCI, ADC, DLL Block RAM CLB CLB CLB CLB Hard IP Block Channels ofprogrammableinterconnect CLB CLB CLB CLB Block RAM CLB CLB CLB CLB CLB CLB CLB CLB Hard IP Block SRAM block(e.g., 18 kbits) Block RAM CLB CLB CLB CLB

  25. A Real FPGA – Altera Stratix III

  26. FPGA Advantages over “Hard” Chips • “Manufacture” takes seconds vs. months • Design, test and manufacture: $single-digit millions vs. $tens of millions • Giving: • Faster time-to-market for products • FPGA vendor handles difficult design & manufacture issues • FPGA vendor shares inventory risk across many customers • FPGA vendor does test • Two largest FPGA vendors: Xilinx and Altera

  27. FPGAs and High-Level Synthesis • FPGAs mainly accessible to HW engineers • Vendors want to expand user-base: make FPGAs useable as computing platforms • Area/power/delay gap between HLS-generated HW and manually crafted HW • In custom Si, user must “pay” for area gap • Power/performance one of main reasons to go custom • FPGAs likely the IC media through which HLS goes “mainstream”

  28. LegUp: Top-Level Vision int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); } .... Processor (MIPS/ARM) C Compiler Program code Self-Profiling Processor Profiling Data: Execution Cycles Power Cache Misses Altered SW binary (calls HW accelerators) High-levelsynthesis Suggested programsegments to target to HW P Hardenedprogramsegments FPGA fabric

  29. LegUp: Key Features • C to Verilog high-level synthesis • Many benchmarks (incl. 12 CHStone) • Automated verification tests • Support for four different FPGAs: • Altera Cyclone II, Stratix IV, Cyclone IV, Cyclone V-SoC • Open source, freely downloadable

  30. How Does High-Level Synthesis Work?

  31. Digital Circuits • Example: you buy a “1 GHz processor”

  32. Digital Circuits • Example: you buy a “1 GHz processor” 1 GHz = 1 nanosecond time-steps Some computation is done in each time step

  33. Digital Circuits • Example: you buy a “1 GHz processor” 1 GHz = 1 nanosecond time-steps Some computation is done in each time step time

  34. Digital Circuits • Example: you buy a “1 GHz processor” 1 GHz = 1 nanosecond time-steps Some computation is done in each time step 1ns time

  35. Example Circuit A B 1ns + Calculate A+B

  36. Example Circuit A B 1ns + Store computation after each step

  37. Example Circuit A B C D E F 1ns – + *

  38. Example Circuit A B C D E F 1ns – + * 1ns *

  39. Example Circuit A B C D E F 1ns – + * 1ns * 1ns – (A+B)*(C–D) – (E*F)

  40. Scheduling: Key Aspect of HLS • How to assign the computations of a program into the hardware time steps? C language snippet: z = a+b; x = c+d; q = z+x; q = q-2; r = q*2; Programs do not contain the notionof “time steps”. Here, we have: 3 add operations 1 subtract operation 1 multiplication operation

  41. Scheduling Questions: • Which operations can be scheduled in the same time step? • Which operations are dependent on others? • If addition takes 5ns, subtraction takes 5ns and multiplication takes 10ns, how to schedule? • Target clock step length is 10ns C language snippet: z = a+b; x = c+d; q = z+x; q = q-2; r = q*2;

  42. Scheduling d a b c 10ns + + 10ns 2 + - 10ns 2 *

  43. Scheduling d a b c 10ns + + parallel operations 10ns 2 + chaining - 10ns 2 *

  44. HLS Challenges • Performance of HLS-generated circuits not as good as human-designed circuits • However, HLS-generated circuits are already better than SW in many cases • Much of our research is aimed towards improving HLS quality

  45. Loop Pipelining

  46. Loop Pipelining • Cycles: 3N • Adders: 3 • Utilization: 33% for (inti = 0; i < N; i++) { sum[i] = a + b + c + d } cycle a b + 1 c + 2 d + 3

  47. Loop Pipelining Steady State • Cycles: N+2 (~1 cycle per iteration) • Adders: 3 • Utilization: 100% in steady state

  48. Loop Pipelining • Ideally, we could start a loop iteration every clock cycle • Initiation interval (II) = 1 • However, • Loops may have dependencies across iterations • There may be constraints on resources • e.g. only two memory accesses in a cycle • Loop pipelining seeks to minimize II subject to constraints

  49. Exploiting Spatial Parallelism

  50. Motivation • Speed benefits of HW arise from spatial parallelism • Extracting parallelism from a sequential program is difficult • Auto-parallelizing compilers do not work well! • Easier to start from parallel code • Pthreads/OpenMP can help!

More Related