550 likes | 700 Views
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis. Greg Stitt Department of Electrical and Computer Engineering University of Florida. Introduction. Improved performance enables new applications Past decade - Mp3 players, portable game consoles, cell phones, etc.
E N D
Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida
Introduction • Improved performance enables new applications • Past decade - Mp3 players, portable game consoles, cell phones, etc. • Future architectures - Speech/image recognition, self-guiding cars, computation biology, etc.
Introduction • FPGAs (Field Programmable Gate Arrays) – Implement custom circuits • 10x, 100x, even 1000x for scientific and embedded apps • [Najjar 04][He, Lu, Sun 05][Levine, Schmit 03][Prasanna 06][Stitt, Vahid 05], … • But, FPGAs not mainstream • Warp Processing Goal: Bring FPGAs into mainstream • Make FPGAs “Invisible” FPGAs capable of large performance improvements Performance uP FPGA
Hardware for loop Designer creates custom hardware using hardware description language (HDL) Hardware/software partitioning selects performance critical regions for hardware implementation [Ernst, Henkel 93] [Gupta, DeMicheli 97] [Vahid, Gajski 94] [Eles et al. 97] [Sangiovanni-Vincentelli 94] Compiler . . . . . . . . . . . . . . * * * * * * * * * * * * + + + + + + . . . . . . . + + + . . . . . . . . . . . . . . + + + • ~ 10 cycles • Speedup = 1000 cycles/ 10 cycles = 100x FPGA Processor Processor Processor Introduction – Hardware/Software Partitioning • ~1000 cycles C Code for FIR Filter for (i=0; i < 16; i++) y[i] += c[i] * x[i] .. .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. ..
Compiler High-level Synthesis Hw/Sw Partitioning Linker Decompilation Decompilation Libraries/ Object Code Libraries/ Object Code Software Hardware Software Hardware High-level Code Updated Binary uP FPGA Bitstream Bitstream Introduction – High-level Synthesis • Problem: Describing circuit using HDL is time consuming/difficult • Solution: High-level synthesis • Create circuit from high-level code • [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] • Allows developers to use higher-level specification • Potentially, enables synthesis for software developers
High-level Synthesis Linker Decompilation Libraries/ Object Code Libraries/ Object Code Software Hardware Software Hardware High-level Code Updated Binary uP FPGA Bitstream Bitstream Introduction – High-level Synthesis • Problem: Describing circuit using HDL is time consuming/difficult • Solution: High-level synthesis • Create circuit from high-level code • [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] • Allows developers to use higher-level specification • Potentially, enables synthesis for software developers
High-level Synthesis Decompilation . . . . . . . . . . . . . . * * * * * * * * * * * * + + + + + + . . . . . . . + + + . . . . . . . . . . . . . . + + + Introduction – High-level Synthesis • Problem: Describing circuit using HDL is time consuming/difficult • Solution: High-level synthesis • Create circuit from high-level code • [Gupta, DeMicheli 92][Camposano, Wolf 91][Rabaey 96][Gajski, Dutt 92] • Allows developers to use higher-level specification • Potentially, enables synthesis for software developers for (i=0; i < 16; i++) y[i] += c[i] * x[i]
Outline • Introduction • Warp Processing Overview • Enabling Technology – Binary Synthesis • Key techniques for synthesis from binaries • Decompilation • Current and Future Directions • Multi-threaded Warp Processing • Custom Communication
Specialized Compiler Linker Synthesis Decompilation Decompilation Non-Standard Software Tool Flow Libraries/ Object Code Libraries/ Object Code Software Hardware Software Hardware Specialized Language High-level Code Updated Binary Updated Binary Bitstream Bitstream Problems with High-Level Synthesis • Problem: High-level synthesis is unattractive to software developers • Requires specialized language • SystemC, NapaC, HandelC, … • Requires specialized compiler • Spark, ROCCC, CatapultC, … • Limited commercial success • Software developers reluctant to change tools uP FPGA
Synthesis Compiler Synthesis Linker Decompilation Decompilation Decompilation Standard Software Tool Flow Libraries/ Object Code Libraries/ Object Code Libraries/ Object Code Libraries/ Object Code Move compilation before synthesis Software Software Hardware Hardware Software Software Hardware Hardware High-Level Code Software Binary High-level Code Updated Binary Updated Binary Updated Binary Bitstream Bitstream uP FPGA Warp Processing – “Invisible” Synthesis • Solution: Make synthesis “invisible” • 2 Requirements • Standard software tool flow • Perform compilation before synthesis • Hide synthesis tool • Move synthesis on chip • Similar to dynamic binary translation • [Transmeta] • But, translate to hw
Synthesis Compiler Synthesis Linker Decompilation Decompilation Decompilation Libraries/ Object Code Libraries/ Object Code Libraries/ Object Code Libraries/ Object Code Software Software Hardware Hardware Software Software Hardware Hardware Software Binary High-level Code High-Level Code Updated Binary Updated Binary Updated Binary Bitstream Bitstream Warp Processing – “Invisible” Synthesis • Solution: Make synthesis “invisible” • 2 Requirements • Standard software tool flow • Perform compilation before synthesis • Hide synthesis tool • Move synthesis on chip • Similar to dynamic binary translation • [Transmeta] • But, translate to hw Warp processor looks like standard uP but invisibly synthesizes hardware uP FPGA
Synthesis Synthesis Compiler gcc, g++, javac, keil Linker Decompilation Decompilation Decompilation Decompilation Libraries/ Object Code Libraries/ Object Code Libraries/ Object Code Libraries/ Object Code Software Software Hardware Hardware Software Software Hardware Hardware C, C++, Java, Matlab High-level Code High-Level Code Software Binary Updated Binary Updated Binary Updated Binary Updated Binary Bitstream Bitstream Warp Processing – “Invisible” Synthesis • Advantages • Supports all languages,compilers, IDEs • Supports synthesis of assembly code • Support synthesis of library code • Also, enables dynamic optimizations Warp processor looks like standard uP but invisibly synthesizes hardware uP FPGA
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background: Basic Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Background: Basic Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Background: Basic Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background: Basic Idea 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Background: Basic Idea 5 On-chip CAD converts critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background: Basic Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background: Basic Idea 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background: Basic Idea On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +
Expandable Logic – Warp tools detect amount of FPGA, invisibly adapt application to use less/more hardware. Expandable RAM – System detects RAM during start, improves performance invisibly RAM DMA FPGA FPGA Cache Cache Profiler FPGA FPGA µP µP Warp Tools Expandable Logic RAM Expandable Logic Expandable RAM uP Performance
Unacceptable Performance Expandable Logic • Allows for customization of platforms • User can select FPGAs based on used applications Performance Application Portable Gaming
. . . . . . . . • User can customize FPGAs to the desired amount of performance • Performance improvement is invisible – doesn’t require new binary from the developer Expandable Logic • Allows for customization of platforms • User can select FPGAs based on used applications Performance Application Portable Gaming
Acceptable Performance Expandable Logic • Allows for customization of platforms • User can select FPGAs based on used applications Performance Application No-FPGA Web Browser • Platform designer doesn’t have to decide on fixed amount of FPGA. • User doesn’t have to pay for FPGA that isn’t needed
Binary HW Updated Binary Binary Binary Binary Technology Mapping Synthesis Logic Optimization Placement & Routing Profiler uP I$ D$ FPGA On-chip CAD Warp Processing Background: Basic Technology • Challenge: CAD tools normally require powerful workstations • Develop extremely efficient on-chip CAD tools • Requires efficient synthesis • Requires specialized FPGA, physical design tools (JIT FPGA compilation) • [Lysecky FCCM05/DAC04], University of Arizona JIT FPGA compilation
Manually performed 60 MB On-chip CAD 46x improvement 30% perf. penalty 3.6MB 0.2 s On a 75Mhz ARM7: only 1.4 s Warp Processing Background: On-Chip CAD Synthesis Tech. Map RT Syn. Log. Opt. Route Place 9.1 s Xilinx ISE
Warp Processing: Initial Results - Embedded Applications • Average speedup of 6.3x • Achieved completely transparently • Also, energy savings of 66%
Outline • Introduction • Warp Processing Overview • Enabling Technology – Binary Synthesis • Key techniques for synthesis from binaries • Decompilation • Current and Future Directions • Multi-threaded Warp Processing • Custom Communication
Compiler Binary Synthesis FPGA Processor Binary Synthesis for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. • Warp processors perform synthesis from software binary – “binary synthesis” • Problem: No high-level information • Synthesis needs high-level constructs • > 10x slowdown • Can we recover high-level information for synthesis? • Make binary synthesis (and Warp processing) competitive with high-level synthesis Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi r2, r1, 128 Jnz r2, -5 No high-level constructs – arrays, loops, etc. Hardware can be > 10x to 100x
Decompilation • We realized decompilation recovers high-level information • But, generally used for binary translation or source-code recovery • May not be suitable for synthesis • We studied existing approaches • [Cifuentes 94, 99, 01][Mycroft 99,01] • DisC, dcc, Boomerang, Mocha, SourceAgain • Determined relevant techniques • Adapted existing techniques for synthesis
Decompilation – Control/Data Flow Graph Recovery • Recovery of control/data flow graph (CDFG) • Format used by synthesis • Difficult because of indirect jumps • Cannot statically analyze control flow • But, heuristics are over 99% successful on standard benchmarks • [Cifuentes 99, 00] Corresponding Assembly Control/Data Flow Graph Creation Original C Code reg3 := 0 reg4 := 0 Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4
Decompilation – Data Flow Analysis • Original purpose - remove temporary registers • Area overhead – 130% • Need new techniques for binary synthesis Data Flow Analysis Corresponding Assembly Original C Code reg3 := 0 reg4 := 0 Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4
Optimized DFG reg4 reg5 reg4 reg5 Sub = reg3 0 Branch? Not needed, wastes area = Branch? Optimized DFG 16 Load Byte 8-bit reg4 5-bit reg5 8-bit + 8-bit reg3 Only 8-bit adder needed Decompilation – Data Flow Analysis • Strength Reduction – Compare-with-zero instructions • Operator Size Reduction Sub reg3, reg4, reg5 Bz reg3, -5 32-bit reg4 32-bit reg5 Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5 32-bit + 32-bit reg3 Area Overhead Reduced to 10%
Decompilation – Function Recovery • Recover parameters and return values • Def-use analysis of prologue/epilogue • 100% success rate Corresponding Assembly Function Recovery Original C Code long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; }
Decompilation – Control Structure Recovery • Recover loops, if statements • Uses interval analysis techniques • [Cifuentes 94] • 100% success rate Corresponding Assembly Control Structure Recovery Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; }
Decompilation – Array Recovery • Detect linear memory patterns and row-major ordering calculations • ~ 95% success rate • [Stitt, Guo, Najjar, Vahid 05] • [Cifuentes 00] Corresponding Assembly Array Recovery Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; }
Comparison of Decompiled Code and Original Code • Decompiled code almost identical to original code • Only difference is variable names • Binary synthesis is competitive with high-level synthesis Decompiled Code Original C Code long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; } long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } Almost Identical Representations
Initially, high-level source is compiled and linked to form a binary Binary Binary Binary Binary Recovers high-level information needed for synthesis Decompilation Compiler Decompilation Decompilation Hw/Sw Estimation Hw/Sw Partitioning Hw/Sw Estimation Profiling Hw/Sw Partitioning Profiling Binary Updater Synthesis Profiling Profiling Binary Synthesis Software Hardware Software Hardware Modifies binary to use synthesized hardware High-level Source Updated Binary Hardware Netlists Updated Binary Updated Binary Hardware Netlists Bitstream Bitstream Bitstream Bitstream Binary Synthesis Tool Flow Libraries/ Object Code Libraries/ Object Code uP FPGA ~30,000 lines of C code
Binary Synthesis is Competitive with High-Level Synthesis • Binary synthesis competitive with high-level synthesis • Binary speedup: 8x, High-level speedup: 8.2x • High-level synthesis only 2.5% better • Commercial products beginning to appear • Critical Blue, Binachip Small difference in speedup
Binary is optimized for software Binary Synthesis with Software Compiler Optimizations • But, binaries generated with few optimizations • Optimizations for software may hurt hardware • Need new decompilation techniques Hardware synthesized from optimized binary may be inefficient C code SW Compiler Optimized Binary Binary Synthesis uP FPGA
Synthesis Execution Times Loop Rerolling Non-unrolled Loop • Problem: Loop unrolling may cause inefficient hardware • Longer synthesis times • Super-linear heuristics • Unrolling 100 times => synthesis time is 1002 times longer • Larger area requirements • Unrolling by compiler unlikely to match unrolling by synthesis • Loop structure needed for advanced synthesis techniques Unrolled Loop Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Solution: We introduce loop rerolling to undo loop unrolling
String Representation Unrolled Loop Map to String Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 BABCABCD Suffix Tree b c abc d [Ukkonen 95] abcabcd abcd c d abcd d Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) abcd d Loop Rerolling – Identifying Unrolled Loops • Idea - Identify consecutively repeating instruction sequences Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring
Replace constants with induction variable expression Determine relationship of constants Rerolled, decompiled code reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 1) 2) 3) Loop Rerolling Unrolled Loop Identificiation Original C Code Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x; Average Speedup of 1.6x
B[i] 10 B[i+1] 18 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 << << << << B[i] B[i] 3 1 4 B[i+1] B[i+1] 1 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 B[i+2] 5 B[i+2] 6 6 B[i+3] B[i+3] 1 B[i+3] B[i+3] 1 1 * * + + << << << << << << << << << << << << << << Replace with multiplication + + + + + + Identify strength-reduced subgraphs 4 B[i+1] B[i+1] 1 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 + + + + + + << << << << << << + + + + + + + + + A[i] + + + + A[i] + + B[i] B[i] B[i] B[i] B[i] B[i] B[i] B[i] B[i] B[i] 18 10 10 18 10 66 10 34 18 34 Synthesis reapplies strength reduction to get optimal DFG A[i] A[i] A[i] + * * * * * * * * * * A[i] Strength Promotion • Problem: Strength reduction may cause inefficient hardware However, some of the strength reduction was beneficial Strength promotion lets synthesis decide on strength reduction, not software compiler Average Speedup of 1.5
Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis Multiple ISA/Optimization Results • What about aggressive software compiler optimizations? • May obscure binary, making decompilation impossible • What about different instructions sets? • Side effects may degrade hardware performance Speedup
High-level vs. Binary Synthesis: Proprietary H.264 Decoder • High-level synthesis vs. binary synthesis • Collaboration with Freescale Semiconductor • H.264 Decoder • MPEG-4 Part 10 Advanced Video Coding (AVC) • 3x smaller than MPEG-2 • Better quality H.264 MPEG2
Binary synthesis competitive with high- level synthesis High-level vs. Binary Synthesis: Proprietary H.264 Decoder • Binary synthesis was competitive with high-level synthesis • High-level speedup – 6.56x • Binary speedup – 6.55x
Outline • Introduction • Warp Processing Overview • Enabling Technology – Binary Synthesis • Key techniques for synthesis from binaries • Decompilation • Current and Future Directions • Multi-Threaded Warp Processing • Custom Communication
Warp FPGA b( ) b( ) Function a( ) for (i=0; i < 10; i++) createThread( b ); b( ) b( ) b( ) b( ) a( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) b( ) Warp Tools OS can only schedule 2 threads b( ) Thread Queue OS Thread Warping - Overview Architectural Trend – Include more cores on chip Result – More multi-threaded applications Warp FPGA Profiler OS schedules 4 threads to custom accelerators µP µP Warp tools create custom accelerators for b( ) µP µP Warp Tools OS Remaining 8 threads placed in thread queue 3x more thread parallelism
Profiler detects performance critical loop in b( ) Warp FPGA Profiler Warp tools create larger/faster accelerators b( ) b( ) Function a( ) for (i=0; i < 10; i++) createThread( b ); b( ) b( ) b( ) b( ) b( ) b( ) a( ) Warp Tools b( ) b( ) b( ) Potentially > 100x speedup Thread Warping - Overview Warp FPGA Profiler µP µP µP µP Warp Tools OS