570 likes | 586 Views
This presentation discusses the use of warp processing to convert binaries into circuits using FPGAs. It covers the techniques underlying warp processing, the overall results, and directions for future work.
E N D
Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits Frank Vahid Professor Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Work supported by the National Science Foundation, the Semiconductor Research Corporation, Xilinx, Intel, Motorola/Freescale Contributing Students: Roman Lysecky (PhD 2005, now asst. prof. at U. Arizona), Greg Stitt (PhD 2006), Kris Miller (MS 2007), David Sheldon (3rd yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy (1st yr PhD)
Outline • FPGAs • Overview • Hard to program --> Binary-level partitioning • Warp processing • Techniques underlying warp processing • Overall warp processing results • Directions and Summary Frank Vahid, UC Riverside
Implement particular circuit just by downloading particular bits a b F G “Lookup table” -- LUT LUT LUT LUT * * SM * SM + LUT LUT LUT 1 1 0 1 4x2 Memory 1 1 0 0 a b a1 a0 00 01 10 11 d1 d0 F G FPGAs • FPGA -- Field-Programmable Gate Array • Off-the-shelf chip, evolved in early 1990s • Implements custom circuit just by downloading stream of bits (“software”) • Basic idea: N-address memory can implement N-input combinational logic • (Note: no “gate array” inside) • Memory called Lookup Table, or LUT • FPGA “fabric” • Thousands of small (~3-input) LUTs – larger LUTs are inefficient • Thousands of switch matrices (SM) for programming interconnections • Possibly additional hard core components, like multipliers, RAM, etc. • CAD tools automatically map desired circuit onto FPGA fabric Frank Vahid, UC Riverside
Configurable logic block (CLB) -- (LUT plus flip-flops) Microprocessor Binaries a b FPGA Binaries 01110100... 001010010 … … 001010010 … … 001010010 … … x c 0 0 0 1 addr 0 1 a 1 0 b Bits loaded into LUTs, CLBs, and SMs 0 1 Bits loaded into program memory c y 1 0 1 0 1 1 x y 11 01 01 FPGA SM SM SM SM SM SM 0010 … 1 SM (Switch Matrix) 00 11 11 ... 00 01 01 ... 0010 … CLB CLB CLB CLB 0 a or b a 11 11 10 SM SM SM SM SM SM a or b Processor Processor b FPGAs: "Programmable" like Microprocessors -- Download Bits Frank Vahid, UC Riverside
FPGAs as Coprocessors • Coprocessor -- Accelarates application kernel by implementing as circuit • ASIC coprocessor known to speedup many application kernels • Energy advantages too (e.g., Henkel’98, Rabaey’98, Stitt/Vahid’04) • FPGA coprocessor also gives speedup/energy benefits (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) • Con: more silicon (~20x), ~4x performance overhead (Rose FPGA'06) • Pro: platform fully programmable • Shorter time-to-market, smaller non-recurring engineering (NRE) cost, low cost devices available, late changes (even in-product) Application Proc. ASIC Application Proc. FPGA Frank Vahid, UC Riverside
FPGAs as Coprocessors Surprisingly Competitive to ASIC • FPGA 34% energy savings versus ASIC’s 48% (Stitt/Vahid IEEE D&T’02, IEEE TECS’04) • A jet isn’t as fast as a rocket, but it sure beats driving Frank Vahid, UC Riverside
Hardware for Bit Reversal Original X Value Bit Reversed X Value . . . . . . . . . . . Compilation Binary . . . . . . . . . . . sll $v1[3],$v0[2],0x10 srl $v0[2],$v0[2],0x10 or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x8 and $v1[3],$v1[3],$t5[13] sll $v0[2],$v0[2],0x8 and $v0[2],$v0[2],$t4[12] or $v0[2],$v1[3],$v0[2] srl $v1[3],$v0[2],0x4 and $v1[3],$v1[3],$t3[11] sll $v0[2],$v0[2],0x4 and $v0[2],$v0[2],$t2[10] ... Bit Reversed X Value Bit Reversed X Value Processor FPGA • Requires between 32 and 128 cycles • Requires only 1 cycle (speedup of 32x to 128x) Processor Processor FPGA – Why (Sometimes) Better than Microprocessor C Code for Bit Reversal x = (x >>16) | (x <<16); x = ((x >> 8) & 0x00ff00ff) | ((x << 8) & 0xff00ff00); x = ((x >> 4) & 0x0f0f0f0f) | ((x << 4) & 0xf0f0f0f0); x = ((x >> 2) & 0x33333333) | ((x << 2) & 0xcccccccc); x = ((x >> 1) & 0x55555555) | ((x << 1) & 0xaaaaaaaa); In general, because of concurrency, from bit-level to task level Frank Vahid, UC Riverside
. . . . . . . . . . . . . . * * * * * * * * * * * * + + + + + + . . . . . . . + + + . . . . . . . . . . . . . . + + + FPGA Processor Processor Processor FPGAs: Why (Sometimes) Better than Microprocessor Hardware for FIR Filter • 1000’s of instructions • Several thousand cycles C Code for FIR Filter for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. • ~ 7 cycles • Speedup > 100x Frank Vahid, UC Riverside
Applic. Binary Binary Binary Special Compiler Profiling Micropr. Binary FPGA Binary Netlist FPGAs are Hard to Program • Synthesis from hardware description languages (HDLs) • VHDL, Verilog • Great for parallelism • But non-standard languages, manual partitioning • SystemC a good step • C/C++ partitioning compilers • Use language subset • Growing in importance • But special compiler limits adoption Includes synthesis, tech. map, pace & route 100 software writers for every CAD user Only about 15,000 CAD seats worldwide; millions of compiler seats Proc. FPGA Frank Vahid, UC Riverside
Binary SW Binary Binary Standard Compiler Profiling Binary-level Partitioner Modified Binary Netlist Netlist Binary-Level Partitioning Helps • Binary-level partitioning • Stitt/Vahid, ICCAD’02 • Recent commercial product: Critical Blue [www.criticalblue.com] • Partition and synthesize starting from SW binary • Advantages • Any compiler, any language, multiple sources, assembly/object support, legacy code support • Better incorporation into toolflow • Disadvantage • Quality loss due to lack of high-level language constructs? (More later) Traditional partitioning done here Includes synthesis, tech. map, place & route Less disruptive, back-end tool Proc. FPGA Frank Vahid, UC Riverside
Outline • FPGAs • Overview • Hard to program --> Binary-level partitioning • Warp processing • Techniques underlying warp processing • Overall warp processing results • Directions and Summary Frank Vahid, UC Riverside
Warp Processing • Observation: Dynamic binary recompilation to a different microprocessor architecture is a mature commercial technology • e.g., Modern Pentiums translate x86 to VLIW Question: If we can recompile binaries to FPGA circuits, can we dynamically recompile binaries to FPGA circuits? Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Idea 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Idea 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Frank Vahid, UC Riverside
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Idea On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + + Frank Vahid, UC Riverside
Warp Processing Idea Likely multiple microprocessors per chip, serviced by one on-chip CAD block µP µP Profiler µP µP µP I Mem µP µP D$ FPGA On-chip CAD Frank Vahid, UC Riverside
Warp Processing: Trend Towards Processor/FPGA Programmable Platforms • FPGAs with hard core processors • FPGAs with soft core processors • Computer boards with FPGAs Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera Xilinx Spartan. Source: Xilinx Frank Vahid, UC Riverside Cray XD1. Source: FPGA journal, Apr’05
SW Binary Binary Binary Traditional partitioning done here Standard Compiler Profiling Proc. FPGA CAD Tools CAD Tools CAD Tools Profiling Profiling Profiling Architectures Proc. FPGA Standard binaries Applications Tools Proc. FPGA Proc. FPGA Warp Processing: Trend Towards Processor/FPGA Programmable Platforms • Programming a key challenge • Soln 1: Compile high-level language to custombinaries using both microprocessor and FPGA • Soln 2: Use standard microprocessor binaries, dynamically re-compile (warp) • Cons: • Less high-level information when compiling, less optimization • Pros: • Available to all software developers, not just specialists • Data dependent optimization • Most importantly, standard binaries enable “ecosystem” among tools, architecture, and applications Standard binary (and ecosystem) concept presently absent in FPGAs and other new programmable platforms Frank Vahid, UC Riverside
Outline • FPGAs • Overview • Hard to program --> Binary-level partitioning • Warp processing • Techniques underlying warp processing • Overall warp processing results • Directions and Summary Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Profiler µP I$ D$ Std. HW Binary Binary Updater FPGA On-chip CAD JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Warp Processing Steps (On-Chip CAD) Technology mapping, placement, and routing Frank Vahid, UC Riverside
Profiling & partitioning Binary Binary Profiler Decompilation µP I$ Synthesis D$ Std. HW Binary Binary Updater FPGA JIT FPGA compilation On-chip CAD FPGA binary Binary Micropr. Binary Binary Warp Processing – Profiling and Partitioning • Applications spend much time in small amount of code • 90-10 rule • Observed 75-4 rule for MediaBench, NetBench • Developed efficient hardware profiler • Gordon-Ross/Vahid, CASES'04, IEEE Trans. on Comp 06 • Partitioning straightforward • Try most critical code first Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Warp Processing – Decompilation • Synthesis from binary has a key challenge • High-level information (e.g., loops, arrays) lost during compilation • Direct translation of assembly to circuit – huge overheads • Need to recover high-level information Overhead of microprocessor/FPGA solution WITHOUT decompilation, vs. microprocessor alone Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Control Structure Recovery Function Recovery Array Recovery Data Flow Analysis Control/Data Flow Graph Creation Synthesis Std. HW Binary Binary Updater long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Warp Processing – Decompilation • Solution –Recover high-level information from binary: decompilation • Extensive previous work (for different purposes) • Adapted • Developed new decompilation methods also Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } Frank Vahid, UC Riverside
Loop Rerolling for (int i=0; i<3;i++) reg1 += array[i]; New Decompilation Method: Loop Rerolling • Problem: Compiler unrolling of loops (to expose parallelism) causes synthesis problems: • Huge input (slow), can’t unroll to desired amount, can’t use advanced loop methods (loop pipelining, fusion, splitting, ...) • Solution: New decompilation method: Loop Rerolling • Identify unrolled iterations, compact into one iteration Loop Unrolling Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2, 100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add reg1, reg1, reg2 for (int i=0; i < 3; i++) accum += a[i]; Frank Vahid, UC Riverside
String Representation Unrolled Loop Map to String Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 BABCABCD Suffix Tree Derived from bioinformatics techniques b c abc d abcabcd abcd c d abcd d Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) abcd d Loop Rerolling: Identify Unrolled Iterations • Find consecutively repeating instruction sequences Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Frank Vahid, UC Riverside
Warp Processing – Decompilation • Study • Synthesis after decompilation often quite similar • Almost identical performance, small area overhead FPGA 2005 Frank Vahid, UC Riverside
2. Deriving high-level constructs from binaries • Recent study of decompilation robustness • In presence of compiler optimizations, and instruction sets • Energy savings of 77%/76%/87% for MIPS/ARM/Microblaze ICCAD’05 DATE’04 Frank Vahid, UC Riverside
Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005. Frank Vahid, UC Riverside
Decompilation Effectiveness In-Depth Study • Performed in-depth study with Freescale • H.264 video decoder • Highly-optimized proprietary code, not reference code • Huge difference • Research question: Is synthesis from binaries competitive on highly-optimized code? • Several-month study MPEG 2 H.264: Better quality, or smaller files, using more computation Frank Vahid, UC Riverside
Optimized H.264 • Larger than most benchmarks • H.264: 16,000 lines • Previous work: 100 to several thousand lines • Highly-optimized • H.264: Many man-hours of manual optimization • 10x faster than reference code used in previous works • Different profiling results • Previous examples • ~90% time in several loops • H.264 • ~90% time in ~45 functions • Harder to speedup Frank Vahid, UC Riverside
C vs. Binary Synthesis on Opt. H.264 • Binary partitioning competitive with source partitioning • Speedups compared to ARM9 software • Binary: 2.48, C: 2.53 • Decompilation recovered nearly all high-level information needed for partitioning and synthesis Frank Vahid, UC Riverside
Binary Binary Profiler µP I$ D$ Reduce Irredundant Expand FPGA On-chip CAD on-set dc-set off-set Warp Processing – Synthesis • ROCM - Riverside On-Chip Minimizer • Standard register-transfer synthesis • Logic synthesis – make it lean • Combination of approaches from Espresso-II [Brayton, et al., 1984][Hassoun & Sasoa, 2002] and Presto [Svoboda & White, 1979] • Cost/benefit analysis of operations • Result • Single expand phase instead of multiple iterations • Eliminate need to compute off-set – reduces memory usage • On average only 2% larger than optimal solution Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Frank Vahid, UC Riverside
Binary Binary Profiler µP I$ D$ FPGA On-chip CAD Warp Processing – JIT FPGA Compilation • Hard – Routing is extremely compute/memory intensive • Solution – Jointly design CAD and FPGA architecture • Cost/benefit analysis • Highly iterative process Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Frank Vahid, UC Riverside
Profiler µP I$ D$ FPGA On-chip CAD 0 1 2 3 0L 1L 2L 3L e a b c d f 3L 3L 2L 2L 1L 1L LUT LUT 0L 0L Adj. CLB Adj. CLB 3 3 2 2 1 1 0 0 o1 o2 o3 o4 0 1 2 3 3L 0L 1L 2L Warp-Targeted FPGA Architecture • CAD-specialized configurable logic fabric • Simplified switch matrices • Directly connected to adjacent CLB • All nets are routed using only a single pair of channels • Allows for efficient routing • Routing is by far the most time-consuming on-chip CAD task • Simplified CLBs • Two 3 input, 2 output LUTs • Each CLB connected to adjacent CLB to simplify routing of carry chains • Currently being prototyped by Intel (scheduled for 2006 Q3 shuttle) Frank Vahid, UC Riverside DATE’04
Binary Binary Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Warp Processing – Technology Mapping • ROCTM - Technology Mapping/Packing • Decompose hardware circuit into DAG • Nodes correspond to basic 2-input logic gates (AND, OR, XOR, etc.) • Hierarchical bottom-up graph clustering algorithm • Breadth-first traversal combining nodes to form single-output LUTs • Combine LUTs with common inputs to form final 2-output LUTs • Pack LUTs in which output from one LUT is input to second LUT Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Routing Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Logic Synthesis CLB Tech. Mapping/Packing Placement JIT FPGA Compilation Routing Warp Processing – Placement • ROCPLACE - Placement • Dependency-based positional placement algorithm • Identify critical path, placing critical nodes in center of CLF • Use dependencies between remaining CLBs to determine placement • Attempt to use adjacent CLB routing whenever possible Dynamic Hardware/Software Partitioning: A First Approach, DAC’03 A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE’04 Frank Vahid, UC Riverside
Binary Binary Profiling & partitioning Decompilation Synthesis Std. HW Binary Binary Updater JIT FPGA compilation FPGA binary Binary Micropr. Binary Binary Warp Processing – Routing • ROCR - Riverside On-Chip Router • Requires much less memory than VPR as resource graph is smaller • 10x faster execution time than VPR (Timing driven) • Produces circuits with critical path 10% shorter than VPR (Routablilty driven) Logic Synthesis Tech. Mapping/Packing Placement JIT FPGA Compilation Routing Frank Vahid, UC Riverside Dynamic FPGA Routing for Just-in-Time FPGA Compilation, DAC’04
Outline • FPGAs • Overview • Hard to program --> Binary-level partitioning • Warp processing • Techniques underlying warp processing • Overall warp processing results • Directions and Summary Frank Vahid, UC Riverside
Profiler ARM I$ D$ FPGA On-Chip CAD ARM I$ D$ Xilinx Virtex-E FPGA Experiments with Warp Processing • Warp Processor • ARM/MIPS plus our fabric • Riverside on-chip CAD tools to map critical region to configurable fabric • Requires less than 2 seconds on lean embedded processor to perform synthesis and JIT FPGA compilation • Traditional HW/SW Partitioning • ARM/MIPS plus Xilinx Virtex-E FPGA • Manually partitioned software using VHDL • VHDL synthesized using Xilinx ISE 4.1 Frank Vahid, UC Riverside
Average kernel speedup of 41, vs. 21 for Virtex-E WCLA simplicity results in faster HW circuits Warp ProcessorsPerformance Speedup (Most Frequent Kernel Only) SW Only Execution Frank Vahid, UC Riverside
Average speedup of 7.4 Warp ProcessorsPerformance Speedup (Overall, Multiple Kernels) • Energy reduction of 38% - 94% Assuming 100 MHz ARM, and fabric clocked at rate determined by synthesis SW Only Execution Frank Vahid, UC Riverside
Xilinx ISE 9.1 s 60 MB DPM (CAD) 3.6MB 0.2 s 3.6MB DPM (CAD) (75MHz ARM7) 1.4s Warp Processors - ResultsExecution Time and Memory Requirements Frank Vahid, UC Riverside
Outline • FPGAs • Overview • Hard to program --> Binary-level partitioning • Warp processing • Techniques underlying warp processing • Overall warp processing results • Directions and Summary Frank Vahid, UC Riverside
Direction: Coding Guidelines for Partitioning? • In-depth H264 study led to a question: Why aren’t speedups (from binary or C) closer to “ideal” (0-time per fct) • We thus examined dozens of benchmarks in more detail • Are there simple coding guidelines that result in better speedups when kernels are synthesized to circuits? Frank Vahid, UC Riverside
Synthesis-Oriented Coding Guidelines • Pass by value-return • Declare a local array and copy in all data needed by a function (makes lack of aliases explicit) • Function specialization • Create function version having frequent parameter-values as constants Rewritten Original void f(int width, int height ) { . . . . for (i=0; i < width, i++) for (j=0; j < height; j++) . . . . . . } void f_4_4() { . . . . for (i=0; i < 4, i++) for (j=0; j < 4; j++) . . . . . . } Bounds are explicit so loops are now unrollable Frank Vahid, UC Riverside