390 likes | 402 Views
Warp Processors. Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students:
E N D
Warp Processors Frank Vahid (Task Leader) Department of Computer Science and Engineering University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Task ID: 1331.001 July 2005 – June 2008 Ph.D. students: Greg Stitt Ph.D. expected June 2006 Ann Gordon-Ross Ph.D. expected June 2006 David Sheldon Ph.D. expected 2009 Ryan Mannion Ph.D. expected 2009 Scott Sirowy Ph.D. expected 2010 Industrial Liaisons: Brian W. Einloth, Motorola Serge Rutman, Dave Clark, Intel Jeff Welser, IBM
Task Description • Warp processing background • Two seed SRC CSR grants (2002-2005) showed feasibility • Idea: Transparently move critical binary regions from microprocessor to FPGA 10x perf./energy gains or more • Task– Mature warp technology • Years 1/2 (in progress) • Automatic high-level construct recovery from binaries • In-depth case studies (with Freescale) • Also discovered unanticipated problem, developed solution • Warp-tailored FPGA prototype (with Intel) • Years 2/3 • Reduce memory bottleneck by using smart buffer • Investigate domain-specific-FPGA concepts (with Freescale) • Consider desktop/server domains (with IBM)
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background: Basic Idea 1 Initially, software binary loaded into instruction memory Profiler I Mem µP D$ FPGA On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 µP Warp Processing Background: Basic Idea 2 Microprocessor executes instructions in software binary Profiler I Mem µP D$ FPGA On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Critical Loop Detected Warp Processing Background: Basic Idea 3 Profiler monitors instructions and detects critical regions in binary Profiler Profiler I Mem µP µP beq beq beq beq beq beq beq beq beq beq add add add add add add add add add add D$ FPGA On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 Warp Processing Background: Basic Idea 4 On-chip CAD reads in critical region Profiler Profiler I Mem µP µP D$ FPGA On-chip CAD On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 reg3 := 0 reg4 := 0 loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 Warp Processing Background: Basic Idea 5 On-chip CAD decompiles critical region into control data flow graph (CDFG) Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 + + + + + + . . . + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background: Basic Idea 6 On-chip CAD synthesizes decompiled CDFG to a custom (parallel) circuit Profiler Profiler I Mem µP µP D$ FPGA Dynamic Part. Module (DPM) On-chip CAD
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background: Basic Idea 7 On-chip CAD maps circuit onto FPGA Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +
Software Binary Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 SM SM SM SM SM SM CLB CLB + + + + + + . . . SM SM SM SM SM SM + + + Software-only “Warped” FPGA . . . + + reg3 := 0 reg4 := 0 . . . loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop + ret reg4 Warp Processing Background: Basic Idea On-chip CAD replaces instructions in binary to use hardware, causing performance and energy to “warp” by an order of magnitude or more 8 Mov reg3, 0 Mov reg4, 0 loop: // instructions that interact with FPGA Ret reg4 Profiler Profiler I Mem µP µP D$ FPGA FPGA Dynamic Part. Module (DPM) On-chip CAD + +
Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms • FPGAs with hard core processors • FPGAs with soft core processors • Computer boards with FPGAs Xilinx Virtex II Pro. Source: Xilinx Altera Excalibur. Source: Altera Xilinx Spartan. Source: Xilinx Cray XD1. Source: FPGA journal, Apr’05
Architectures Standard binaries Applications Tools Warp Processing Background: Trend Towards Processor/FPGA Programmable Platforms • Programming a key challenge • Soln 1: Compile high-level language to custom binaries • Soln 2: Use standard binaries, dynamically re-map (warp) • Cons: • Less high-level information, less optimization • Pros: • Available to all software developers, not just specialists • Data dependent optimization • Most importantly, standard binaries enable “ecosystem” among tools, architecture, and applications Xilinx Virtex II Pro. Source: Xilinx Most significant concept presently absent in FPGAs and other new programmable platforms
Binary Updated Binary HW Binary Binary Binary Logic Synthesis Placement & Routing Technology Mapping Behav./RT Synthesis Binary Updater Partitioning Decompilation Profiler uP I$ D$ FPGA On-chip CAD Warp Processing Background: Basic Technology • Warp processing • On-chip profiler • Warp-tuned FPGA • On-chip CAD, including Just-in-Time FPGA compilation JIT FPGA compilation
Manually performed 60 MB ROCCAD 46x improvement 30% perf. penalty 3.6MB 0.2 s On a 75Mhz ARM7: only 1.4 s Warp Processing Background: Initial Results Decomp. Partitioning Tech. Map RT Syn. Log. Syn. Route Place 9.1 s Xilinx ISE
Warp Processing Background: Publications 2002-2005 • On-chip profiler • Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware, A. Gordon-Ross and F. Vahid, ACM/IEEE Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2003; • Extended version of above in special issue “Best of CASES/MICRO” of IEEE Trans. on Comp., Oct 2005. • Warp-tuned FPGA • A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning, R. Lysecky and F. Vahid, Design Automation and Test in Europe Conf. (DATE), Feb 2004. • On-chip CAD, including Just-in-Time FPGA compilation • A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), 2005. • A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. • Dynamic FPGA Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid, and S. Tan. Design Automation Conf. (DAC), June 2004. • A Codesigned On-Chip Logic Minimizer, R. Lysecky and F. Vahid, ISSS/CODES conf., Oct 2003. • Dynamic Hardware/Software Partitioning: A First Approach. G. Stitt, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. • On-Chip Logic Minimization, R. Lysecky and F. Vahid, Design Automation Conf. (DAC), 2003. • The Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic, G. Stitt and F. Vahid, IEEE Design and Test of Computers, Nov./Dec. 2002. • Hardware/Software Partitioning of Software Binaries, G. Stitt and F. Vahid, IEEE/ACM International Conference on Computer Aided Design (ICCAD), Nov. 2002. • Related • A Self-Tuning Cache Architecture for Embedded Systems. C. Zhang, F. Vahid and R. Lysecky. ACM Transactions on Embedded Computing Systems (TECS), Vol. 3., Issue 2, May 2004. • Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005.
Task Description • Warp processing background • Two seed SRC CSR grants (2002-2005) showed feasibility • Idea: Transparently move critical binary regions from microprocessor to FPGA 10x perf./energy gains or more • Task– Mature warp technology • Year 1 (in progress) • Automatic high-level construct recovery from binaries • In-depth case studies (with Freescale) • Also discovered unanticipated problem, developed solution • Warp-tailored FPGA prototype (with Intel) • Years 2/3 • Reduce memory bottleneck by using smart buffer • Investigate domain-specific-FPGA concepts (with Freescale) • Consider desktop/server domains (with IBM)
Control Structure Recovery Function Recovery Array Recovery Control/Data Flow Graph Creation Data Flow Analysis long f( long reg2 ) { int reg3 = 0; int reg4 = 0; loop: reg4 = reg4 + mem[reg2 + reg3 << 1)]; reg3 = reg3 + 1; if (reg3 < 10) goto loop; return reg4; } reg3 := 0 reg4 := 0 reg3 := 0 reg4 := 0 long f( short array[10] ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += array[reg3]; } return reg4; } long f( long reg2 ) { long reg4 = 0; for (long reg3 = 0; reg3 < 10; reg3++) { reg4 += mem[reg2 + (reg3 << 1)]; } return reg4; } loop: reg1 := reg3 << 1 reg5 := reg2 + reg1 reg6 := mem[reg5 + 0] reg4 := reg4 + reg6 reg3 := reg3 + 1 if (reg3 < 10) goto loop loop: reg4 := reg4 + mem[ reg2 + (reg3 << 1)] reg3 := reg3 + 1 if (reg3 < 10) goto loop ret reg4 ret reg4 Almost Identical Representations Automatic High-Level Construct Recovery from Binaries • Challenge: Binary lacks high-level constructs (loops, arrays, ...) • Decompilation can help recover • Extensive previous work (e.g., [Cifuentes 93, 94, 99]) Corresponding Assembly Original C Code Mov reg3, 0 Mov reg4, 0 loop: Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret reg4 long f( short a[10] ) { long accum; for (int i=0; i < 10; i++) { accum += a[i]; } return accum; }
Loop Rerolling for (int i=0; i<3;i++) reg1 += array[i]; New Method: Loop Rerolling • Problem: Compiler unrolling of loops (to expose parallelism) causes synthesis problems: • Huge input (slow), can’t unroll to desired amount, can’t use advanced loop methods (loop pipelining, fusion, splitting, ...) • Solution: New decompilation method: Loop Rerolling • Identify unrolled iterations, compact into one iteration Loop Unrolling Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2, 100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add reg1, reg1, reg2 for (int i=0; i < 3; i++) accum += a[i];
String Representation Unrolled Loop Map to String Binary x= x + 1; a[0] = b[0]+1; a[1] = b[1]+1; y = x; Add r3, r3, 1 => B Ld r0, b(0) => A Add r1, r0, 1 => B St a(0), r1 => C Ld r0, b(1) => A Add r1, r0, 1 => B St a(1), r1 => C Mov r4, r3 => D Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 BABCABCD Suffix Tree Derived from bioinformatics techniques b c abc d abcabcd abcd c d abcd d Find Consecutive Repeating Substrings: Adjacent Nodes with Same Substring Unrolled Loop 2 unrolled iterations Each iteration = abc (Ld, Add, St) abcd d Loop Rerolling: Identify Unrolled Iterations • Find consecutively repeating instruction sequences Original C Code x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x;
Replace constants with induction variable expression Determine relationship of constants Rerolled, decompiled code reg3 = reg3 + 1; for (i=0; i < 2; i++) array1[i]=array2[i]+1; reg4=reg3; Add r3, r3, 1 i=0 loop: Ld r0, b(i) Add r1, r0, 1 St a(i), r1 Bne i, 2, loop Mov r4, r3 Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 1) 2) 3) Loop Rerolling: Compacting Iterations Unrolled Loop Identificiation Original C Code Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0), r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4, r3 x = x + 1; for (i=0; i < 2; i++) a[i]=b[i]+1; y=x;
B[i] 10 B[i+1] 18 B[i+2] 34 B[i+3] 66 * * * * + + B[i] B[i] 3 1 4 B[i+1] B[i+1] 1 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 + << << << << << << << << A[i] + + + + + + + A[i] Method: Strength Promotion • Problem: Compiler’s strength reduction (replacing multiplies by shifts and adds) prevents synthesis from using hard-core multipliers, sometimes hurting circuit performance FIR Filter Strength-reduced multiplication Strength-Reduced FIR Filter
B[i] 10 B[i+1] 18 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 << << << << B[i] B[i] 3 1 4 B[i+1] B[i+1] 1 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 B[i+2] 5 B[i+2] 6 6 B[i+3] B[i+3] 1 B[i+3] B[i+3] 1 1 * * + + << << << << << << << << << << << << << << Replace with multiplication + + + + + + Identify strength-reduced subgraphs 4 B[i+1] B[i+1] 1 B[i+2] 5 B[i+2] 6 B[i+3] 1 B[i+3] 1 + + + + + + << << << << << << + + + + + + + + + A[i] + + + + A[i] + + B[i] B[i] B[i] B[i] B[i] B[i] B[i] B[i] B[i] B[i] 10 18 10 34 18 66 10 18 10 34 Synthesis can of course apply strength reduction itself A[i] A[i] A[i] + * * * * * * * * * * A[i] Strength promotion lets synthesis decide on strength reduction based on available resources Strength Promotion • Solution: Promote strength-reduced code to muls
Y axis = speedup, X axis = x_y_z => x adder constraint, y multiplier constraint, z = adders needed for reduction Y axis = clock frequency, X axis = adders needed for reduction New Decompilation Methods’ Benefits Speedups from Loop Rerolling • Rerolling • Speedups from better use of smart buffers • Other potential benefits: faster synthesis, less area • Strength promotion • Speedups from fewer cycles • Speedups from faster clock • New methods to be developed • e.g., pointer DS to arrays
Speedups similar on MIPS for –O1 and –O3 optimizations Speedups similar on ARM for –O1 and –O3 optimizations MicroBlaze speedups much larger MicroBlaze is a slower microprocessor -O3 optimizations were very beneficial to hardware Speedups similar between ARM and MIPS Complex instructions of ARM didn’t hurt synthesis Decompilation is Effective Even with High Compiler-Optimization Levels Average Speedup of 10 Examples Publication: New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov. 2005.
Task Description • Warp processing background • Two seed SRC CSR grants (2002-2005) showed feasibility • Idea: Transparently move critical binary regions from microprocessor to FPGA 10x perf./energy gains or more • Task– Mature warp technology • Year 1 (in progress) • Automatic high-level construct recovery from binaries • In-depth case studies (with Freescale) • Also discovered unanticipated problem, developed solution • Warp-tailored FPGA prototype (with Intel) • Years 2/3 • Reduce memory bottleneck by using smart buffer • Investigate domain-specific-FPGA concepts (with Freescale) • Consider desktop/server domains (with IBM)
Research Problem: Make Synthesis from Binaries Competitive with Synthesis from High-Level Languages • Performed in-depth study with Freescale • H.264 video decoder • Highly-optimized proprietary code, not reference code • Huge difference • A benefit of SRC collaboration • Research question: Is synthesis from binaries competitive on highly-optimized code? • Several-month study MPEG 2 H.264: Better quality, or smaller files, using more computation
Optimized H.264 • Larger than most benchmarks • H.264: 16,000 lines • Previous work: 100 to several thousand lines • Highly-optimized • H.264: Many man-hours of manual optimization • 10x faster than reference code used in previous works • Different profiling results • Previous examples • ~90% time in several loops • H.264 • ~90% time in ~45 functions • Harder to speedup
C vs. Binary Synthesis on Opt. H.264 • Binary partitioning competitive with source partitioning • Speedups compared to ARM9 software • Binary: 2.48, C: 2.53 • Decompilation recovered nearly all high-level information needed for partitioning and synthesis • Discovered another research problem: Why aren’t speedups (from binary or C) closer to “ideal” (0-time per fct)
Coding Guidelines • Are there C-coding guidelines to improve partitioning speedups? • Orthogonal to C vs. binary question • Guidelines may help both • Examined H.264 code further • Several phone conferences with Freescale liasons, also several email exchanges and reports Competitive, but both could be better Coding guidelines get closer to ideal
Synthesis-Oriented Coding Guidelines • Pass by value-return • Declare a local array and copy in all data needed by a function (makes lack of aliases explicit) • Function specialization • Create function version having frequent parameter-values as constants Rewritten Original void f(int width, int height ) { . . . . for (i=0; i < width, i++) for (j=0; j < height; j++) . . . . . . } void f_4_4() { . . . . for (i=0; i < 4, i++) for (j=0; j < 4; j++) . . . . . . } Bounds are explicit so loops are now unrollable
0 0 val[1] val[0] 255 255 < < > > 3x1 3x1 val[1] val[0] Synthesis-Oriented Coding Guidelines • Algorithmic specialization • Use parallelizable hardware algorithms when possible • Hoisting and sinking of error checking • Keep error checking out of loops to enable unrolling • Lookup table avoidance • Use expressions rather than lookup tables Original Rewritten Comparisons can now be parallelized int clip[512] = { . . . } void f() { . . . for (i=0; i < 10; i++) val[i] = clip[val[i]]; . . . } void f() { . . . for (i=0; i < 10; i++) if (val[i] > 255) val[i] = 255; else if (val[i] < 0) val[i] = 0; . . . } . . .
Synthesis-Oriented Coding Guidelines • Use explicit control flow • Replace function pointers with if statements and static function calls Original Rewritten void (*funcArray[]) (char *data) = { func1, func2, . . . }; void f(char *data) { . . . funcPointer = funcArray[i]; (*funcPointer) (data); . . . } void f(char *data) { . . . if (i == 0) func1(data); else if (i==1) func2(data); . . . }
Coding Guideline Results on H.264 • Simple coding guidelines made large improvement • Rewritten software only ~3% slower than original • And, binary partitioning still competitive with C partitioning • Speedups: Binary: 6.55, C: 6.56 • Small difference caused by switch statements that used indirect jumps
Studied More Benchmarks, Developed More Guidelines • Studied guidelines further on standard benchmarks • Further synthesis speedups (again, independent of C vs. binary issue) • Publications • Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005 (joint publication with Freescale) • Submitted: A Code Refinement Methodology for Performance-Improved Synthesis from C. G. Stitt, F. Vahid, W. Najjar, 2006. • More guidelines to be developed
Task Description • Warp processing background • Two seed SRC CSR grants (2002-2005) showed feasibility • Idea: Transparently move critical binary regions from microprocessor to FPGA 10x perf./energy gains or more • Task– Mature warp technology • Year 1 (in progress) • Automatic high-level construct recovery from binaries • In-depth case studies (with Freescale) • Also discovered unanticipated problem, developed solution • Warp-tailored FPGA prototype (with Intel) • Years 2/3 • Reduce memory bottleneck by using smart buffer • Investigate domain-specific-FPGA concepts (with Freescale) • Consider desktop/server domains (with IBM)
SM SM SM CLB CLB SM SM SM 0 1 2 3 0L 1L 2L 3L DADG LCH e a b c d f 3L 3L 32-bit MAC SM SM SM 2L 2L Configurable Logic Fabric 1L 1L LUT LUT 0L 0L CLB CLB Adj. CLB Adj. CLB 3 3 2 2 1 1 SM SM SM 0 0 o1 o2 o3 o4 0 1 2 3 3L 0L 1L 2L Warp-Tailored FPGA Prototype • Developed FPGA fabric tailored to fast/small-memory on-chip CAD • Building chip prototype with Intel • Created synthesizable VHDL models, running through Intel shuttle tool flow • Plan to incorporate with ARM processor and other IP on shuttle seat • Bi-weekly phone meetings with Intel engineers since summer 2005, ongoing, scheduled tapeout 2006 Q3
Industrial Interactions • Freescale • Numerous phone conferences, emails, and reports, on technical subjects • Co-authored paper (CODES/ISSS’05), another pending • Summer internship – Scott Sirowy (new UCR graduate student), summer 2005, Austin • Intel • Three visits by PI, one by graduate student Roman Lysecky, to Intel Research in Santa Clara • PI presented at Intel System Design Symposium, Nov. 2005 • PI served on Intel Research Silicon Prototyping Workshop panel, May 2005 • Participating in Intel’s Research Shuttle (chip prototype), bi-weekly phone conferences since summer 2005 involving PI, Intel engineers, and Roman Lysecky (now Prof. at UA) • IBM • Embarking on studies of warp processing results on server applications • UCR group to receive Cell-based prototyping platform (w/ Prof. Walid Najjar) • Several interactions with Xilinx also
Task Description – Coming Up • Warp processing background • Two seed SRC CSR grants (2002-2005) showed feasibility • Idea: Transparently move critical binary regions from microprocessor to FPGA 10x perf./energy gains or more • Task– Mature warp technology • Years 1/2 (in progress) • Automatic high-level construct recovery from binaries • In-depth case studies (with Freescale) • Also discovered unanticipated problem, developed solution • Warp-tailored FPGA prototype (with Intel) • Years 2/3 – All three sub-tasks just now underway • Reduce memory bottleneck by using smart buffer • Investigate domain-specific-FPGA concepts (with Freescale) • Consider desktop/server domains (with IBM)
Recent Publications • New Decompilation Techniques for Binary-level Co-processor Generation. G. Stitt, F. Vahid. IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2005. • Fast Configurable-Cache Tuning with a Unified Second-Level Cache. A. Gordon-Ross, F. Vahid, N. Dutt. Int. Symp. on Low-Power Electronics and Design (ISLPED), 2005. • Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode. G. Stitt, F. Vahid, G. McGregor, B. Einloth. International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2005. (Co-authored paper with Freescale) • Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware.A. Gordon-Ross and F. Vahid. IEEE Trans. on Computers, Special Issue- Best of Embedded Systems, Microarchitecture, and Compilation Techniques in Memory of B. Ramakrishna (Bob) Rau, Oct. 2005. • A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation. R. Lysecky, F. Vahid and S. Tan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), 2005. • A First Look at the Interplay of Code Reordering and Configurable Caches. A. Gordon-Ross, F. Vahid, N. Dutt. Great Lakes Symposium on VLSI (GLSVLSI), April 2005. • A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. R. Lysecky and F. Vahid. Design Automation and Test in Europe (DATE), March 2005. • A Decompilation Approach to Partitioning Software for Microprocessor/FPGA Platforms. G. Stitt and F. Vahid. Design Automation and Test in Europe (DATE), March 2005.