340 likes | 351 Views
This paper discusses the potential of FPGAs in next-generation embedded systems, and explores methods for hiding FPGAs from programmers and enabling just-in-time FPGA compilation. The goal is to make FPGAs more accessible and widely used in software development.
E N D
The New Software: Invisible Ubiquitous FPGAs that Enable Next-Generation Embedded Systems Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, and Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), David Sheldon (3rd yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy (1st yr PhD)
Outline • FPGAs – The New Software • Why they’re great • Why they’re not ubiquitous yet • Hiding FPGAs from programmers • Warp processing • Binary decompilation • Just-in-time FPGA compilation • Towards Standard Binaries for FPGAs Frank Vahid, UC Riverside
2x2 switch matrix 1 0 x a 0 1 y 0 b 1 11 01 01 SM SM SM SM SM SM 00 11 11 ... 00 01 01 ... FPGA LUT LUT 11 11 10 SM SM SM SM SM SM 1 1 1 0 4x2 Memory 1 1 0 0 a b a1 a0 00 01 10 11 d1 d0 F G FPGAs • FPGA -- Field-Programmable Gate Array • Implement circuit by downloading bits • N-address memory (“LUT”) implements N-input combinational logic • Register-controlled switch matrix (SM) connects LUTs • FPGA fabric • Thousands of LUTs and SMs, increasingly additional hard core components like multipliers, RAM, etc. • CAD tools automatically map desired circuit onto FPGA fabric Implement circuit by downloading particular bits a b LUT F G Frank Vahid, UC Riverside
01110100... 001010010 … … 001010010 … … 001010010 … … "Software" "Hardware" Processor Processor Processor FPGAs are "Programmable" like Microprocessors – Just Download Bits FPGA "Binaries" Microprocessor Binaries More commonly known as "bitstream" Bits loaded into LUTs and SMs Bits loaded into program memory FPGA 0111 … 0010 … Frank Vahid, UC Riverside
Circuit for Bit Reversal Original X Value Bit Reversed X Value . . . . . . . . . . . Compilation Binary . . . . . . . . . . . sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ... Bit Reversed X Value Bit Reversed X Value Processor FPGA • Requires only 1 cycle (speedup of 32x to 128x) • Requires between 32 and 128 cycles Processor Processor FPGA – Why (Sometimes) Better than Microprocessor C Code for Bit Reversal x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); Frank Vahid, UC Riverside
* * * * * * * * * * * * + + + + + + + + + + + + FPGA Processor Processor Processor FPGA: Why (Sometimes) Better than Microprocessor C Code for FIR Filter Circuit for FIR Filter • 1000’s of instructions • Several thousand cycles for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. • ~ 7 cycles • Speedup > 100x In general, FPGA better due to circuit's concurrency, from bit-level to task level Frank Vahid, UC Riverside
Extensive Studies over Past Decade • Large speedups on many important applications • See ACM/SIGDA Int. Symp. on FPGAs • So why aren't FPGAs ubiquitous? Frank Vahid, UC Riverside
Why FPGAs aren’t Ubiquitous • Cost – But improving yearly • Power – But improving yearly, and energy benefits too • Extra chip – But integration continues • Programming methodology 1 million system gate FPGA cost Source: Xilinx Frank Vahid, UC Riverside
Why FPGAs aren’t Mainstream • Cost • Power • Extra chip • Programming methodology • Though tremendous progress in past decade Application (C/C++/Java/SystemC/Handel-C/Streams-C/…) Automated hardware/software partitioning C/C++/Java C/C++/Java/VHDL/Verilog/SystemC/Handel-C/Streams-C... Behavioral synthesis (1990s) Register transfers Compilation (1960s, 1970s) RT synthesis (1980s, 1990s) Logic equations / FSMs Assembly code Logic synthesis, physical design (1970s, 1980s) Assembling, linking (1950s, 1960s) Microprocessor binary FPGA binary Downloading Downloading Implementation Microprocessors FPGA circuits Frank Vahid, UC Riverside
Applic. Binary Binary Binary Special Compiler Includes synthesis, tech. map, place & route Architectures FPGA Binary Microproc Binary Standard binaries Applications Tools Proc. FPGA So What’s the Holdup? • FPGAs require special compilers • Limits adoption – desktop world dominates • 100 software writers for every CAD user • Millions of compiler seats worldwide, vs. 15,000 CAD seats • Can't ignore "ecosystem" from separation of applications, tools, and architectures • Just consider history of popular processors Standard Compiler Frank Vahid, UC Riverside
Outline • FPGAs – The New Software • Why they’re great • Why they’re not ubiquitous yet • Hiding FPGAs from programmers • Warp processing • Binary decompilation • Just-in-time FPGA compilation • Towards Standard Binaries for FPGAs Frank Vahid, UC Riverside
Binary SW Binary Binary Standard Compiler Profiling Translator Proc. FPGA Can we Hide FPGAs from Programmers and Standard Tools? • Example • Radically different x86 architectures hidden from programmers and tools • All execute standard x86 binaries • On-chip tools dynamically translate binary to particular architecture • Idea: Hide FPGA from programmers and tools • Download standard binary • Have on-chip tools dynamically translate binary (portions) to FPGA • We call this Warp Processing Traditional partitioning done here Translator Translator RISC architecture VLIW architecture Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Idea 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Idea 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. HW Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr Binary Binary Warp Processing Challenges • Two key challenges • Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? • Can we just-in-time (JIT) compile to FPGAs using limited on-chip compute resources? Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Decompilation • If we don't decompile • High-level information (e.g., loops, arrays) lost during compilation • Direct translation of assembly to circuit – big overhead • Need to recover high-level information Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Control Structure Recovery Function Recovery Array Recovery Data Flow Analysis Control/Data Flow Graph Creation Synthesis Std. HW Binary Binary Updater long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Decompilation • Solution – Recover high-level information from binary: decompilation • Adapted extensive previous work (for different purposes) • Developed new decompilation methods also • Ph.D. work of Greg Stitt (Ph.D. UCR 2006) • Numerous publications: http://www.cs.ucr.edu/~vahid/pubs Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Frank Vahid, UC Riverside
Decompilation Results vs. C • Compared with synthesis from C • Synthesis after decompilation often quite similar • Almost identical performance, small area overhead FPGA 2005 Frank Vahid, UC Riverside
Decompilation Results on Optimized H.264In-depth Study with Freescale • Used highly-optimized benchmark • Results: Binary approach competitive • Speedups compared to ARM9 software • Binary: 2.48, C: 2.53 • Decompilation recovered nearly all high-level information needed for partitioning and synthesis Frank Vahid, UC Riverside
Tangent: Simple Coding Guidelines Bring Speedups Closer to Ideal • Interesting discovery during H264 study – C style limited speedup • Orthogonal to binary vs. C issue – coding style hurt both • Developed simple coding guidelines • Rewritten software: 20 minutes, and only ~3% slower than original • New speedups: Binary: 6.55, C: 6.56 • Binary still competitive with C • Following guidelines not required, but helps any approach targeting FPGAs Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. HW Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr Binary Binary Warp Processing Challenges • Two key challenges • Can we decompile binaries to recover enough high-level constructs to create fast circuits on FPGAs? • Can we just-in-time(JIT) compile to FPGAs using limited on-chip compute resources? Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Xilinx ISE Synthesis Std. HW Binary Binary Updater 9.1 s JIT FPGA compilation 60 MB FPGA binary Binary Micropr. Binary Binary Riverside JIT FPGA tools 3.6MB 0.2 s 3.6MB Riverside JIT FPGA tools on a 75MHz ARM7 1.4s JIT FPGA Compilation • Developed ultra-lean CAD heuristics for synthesis, placement, routing, and technology mapping; simultaneously developed CAD-oriented FPGA • e.g., Our router (ROCR) 10x faster and 20x less memory than popular VPR tool, at cost of 30% longer critical path. Similar results for synth & placement • Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now Asst. Prof. at Univ. of Arizona) • Numerous publications: http://www.cs.ucr.edu/~vahid/pubs DAC’04 Frank Vahid, UC Riverside
Average kernel speedup of 41, vs. 21 for Virtex-E Simpler FPGA fabric yields faster HW circuits Overall Warp Processing ResultsPerformance Speedup (Most Frequent Kernel Only) Currently prototyping our simpler FPGA fabric with Intel, scheduled for Q3 shuttle SW Only Execution Overall application speedup average is 7.4 Frank Vahid, UC Riverside
Outline • FPGAs – The New Software • Why they’re great • Why they’re not ubiquitous yet • Hiding FPGAs from programmers • Warp processing • Binary decompilation • Just-in-time FPGA compilation • Towards Standard Binaries for FPGAs Frank Vahid, UC Riverside
SW Standard Binary Binary Binary Architectures Standard binaries Translator Applications Tools Proc. FPGA FPGA Ubiquity via Obscurity • Warp processing hides FPGA from languages and tools • ANY microprocessor platform extendible with FPGA • Maintains "ecosystem": application, tool, and architecture developers • New platforms with FPGAs appearing Standard Compiler Profiling New processor platforms with FPGA evolving Frank Vahid, UC Riverside
Standard Binary Standard FPGA binary? SystemC? SW Binary Binary Binary Binary Standard FPGA Compiler Ecosystem for FPGAs presently sorely missing Translator Proc. FPGA FPGA Standard Binaries? • Microprocessor binary represents one form of a "standard binary for FPGAs" • Missing is explicit concurrency • Parallelism, pipelining, queues, etc. • As FPGAs appear in more platforms, might a more general FPGA binary evolve? Standard Compiler Profiling Architectures Standard FPGA binaries Standard binaries Applications Tools Frank Vahid, UC Riverside
FPGA Binary Binary Binary Binary Translator Proc. FPGA * * * * * * * * * * * * + + + + + + Low-end PDA FPGA + + + Translator + + 100 sec FPGA + High-end PDA Translator FPGA 1 sec FPGA Standard Binaries? • Translator makes best use of existing FPGA resources • Can even add FPGA, like adding memory, to improve performance • Add more FPGA to your PDA to implement compute-intensive application? Frank Vahid, UC Riverside
Summary • FPGAs may be the new software • Hiding FPGA via warp processing is feasible • Decompilation can recover high-level constructs to yield speedups competitive with source-level • JIT FPGA compilation can be made sufficiently lean • Future: Standard binaries for FPGAs? • Extensive work to be done Publications can be found at: http://www.cs.ucr.edu/~vahid/pubs Frank Vahid, UC Riverside