160 likes | 176 Views
Explore the usage of FPGA in app development through spatial algorithms, bytecode optimization, and dynamic accelerator management. Learn how to leverage FPGA for efficient processing.
E N D
JIT FPGA Ideas Frank Vahid Dept. of CS&E University of California, Riverside Associate Director, Center for Embedded Computer Systems, UC Irvine Contributing Ph.D. Students Roman Lysecky (Ph.D. 2005, now Asst. Prof. at Univ. of Arizona Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ. of Florida, Gainesville Scotty Sirowy (current) David Sheldon (current) Chen Huang (current) This research was supported in part by the National Science Foundation, the Semiconductor Research Corporation, Intel, Freescale, IBM, and Xilinx
SystemC Bytecode for FPGAs • Demo
FPGA Common Presence • Caches, FPUs, GPUs, FPGAs • App developers may expect FPGA presence • How create/distribute apps that make good use of FPGA if present? Binary µP Cache FPU GPU FPGA µP
“Spatial” Algorithms for FPGAs • Example – Count patterns • Sequential algorithm • Hash table • 10s cycles per pattern • Spatial algorithm • Pipelined stages • Essence is the connectivity of components, not the sequencing of instructions bus int patterns[1,000]; int counts[1,000]; while (1) { WaitForPattern(); CurrPattern = X; hash = HashFct(CurrPattern); item = Find(patterns, CurrPattern, hash); if (item) { counts[item]++; } } CurrPattern count pattern logic Level 1 count pattern logic Level 2 . . . count pattern logic Level m
Bytecode • Modern portability approach • Java, C# Compiler Virtual Machine (VM): Program that executes bytecode May JIT compile to native architecture bytecode VM VM VM Pentium Opteron Atom
SystemC Bytecode? SystemC Compiler SystemC bytecode VM VM VM Opteron + FPGA Pentium FPGA
UCR SystemC Bytecode and Compiler class EDGE_DETECTOR : public sc_module { //signal declarations … EDGE_DETECTOR() { SC_method(mainComp); sensitive << dataReady; SC_method(getPixel); sensitive << clock.pos(); void getPixel(){ … dataReady.write(1); } void mainComp(){ int i, j; for(i = 0; i < 3; i++){ for(j = 0; j < 3; j++){ sumX = sumX + mem.read()*GX[i][j] } } … edge.write(sumX + sumY) } --header signal clock : 1 signal reset : 1 signal memory_in : 32 signal fb_data : 32 signal leds : 4 process(clock) READ $1 memory_in ADD $2 $0 3 ADD $3 $2 $1 WRITE $3 s1 ADDI $1 $0 1 WRITE $1 dataReady END process(dataReady) READ $5 val6 SW $5 24($0) READ $5 val7 … ADDI $10 $0 0 ADDI $7 $0 0 ADDI $13 $0 8 … END UCR’s SystemC bytecode SystemC UCR’s SystemC-to-bytecode compiler Spatial Constructs MIPS-like sequential instructions
Emulator Input Memory Main Processor Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs SystemC Bytecode Emulator SystemC bytecode Bytecode uploadable via USB drive FPGA Accelerators speedup emulation
SystemC Bytecode Accelerators Emulator Input Memory Main Processor Output Memory Instruction Memory UART Read Signal Memory USB Interface Accelerator Buttons Write Signal Memory Register File LEDs Bus, start, load logic RISC Datapath Local Mem • Implementation • MIPS-like multicycle RISC datapath • 100 MHz Clock • ~33 Million Instr/Sec • Communicates to core emulator memory mapped registers • Area: ~5000 slices • # of accelerators limited to # of masters allowed on bus • ~1200 lines of VHDL SystemC bytecode Accelerator 1 Accelerator 2 Accelerator 3 FPGA
Emulator Input Memory Main Processor Output Memory Instruction Memory UART Read Signal Memory USB Interface Buttons Write Signal Memory LEDs Dynamic SystemC Accelerator Management • Only a limited number of SystemC accelerators can fit on an FPGA fabric • Dynamically map processes to accelerators based on process usage • Involves online algorithms SystemC bytecode 42 11 12 43 10 44 Accelerator 1 Accelerator 2 Accelerator 3 FPGA Image Filter Example
Emulator Input Memory Main Processor Output Memory Instruction Memory UART Read Signal Memory Buttons Write Signal Memory LEDs Accelerator 1 Accelerator 2 Accelerator 3 FPGA Just-in-Time Synthesis Send SystemC bytecode to synthesis server SystemC bytecode Dynamically reconfigure some or all of the FPGA FPGA Specific Bitstream Possible to even perform synthesis on-chip – “warp processing” (previous UCR work)
Transmuting Coprocessors • Demo
FPGA is a Size-Limited Coprocessing Resource App executions change. Must decide which coprocessors should be FPGA-resident at a given time – transmuting coprocessors Speedup with previous apps Upload app profile info Select coproc. set, generate new FPGA bitstream FPGA implements coprocessors Send back new bitstream, re-program FPGA
Transmuting Coprocessor Demo • Three image filters: • Blur filter (S/L): Blur the image • Sobel filter (S/L): Find the edge of the image • Emboss filter(S/L): Emboss the image • Platform: • Virtex 2P(XC2VP30): PPC + Coprocessors • PPC Frequency: 100Mhz • Coproc. Frequency: 50Mhz 30x 120x
Image (128*128 pixels and 24bit color): 24 BRAMs Soft version: Read (Image BRAM)Execution (PPC)Write (Display BRAM) Coprocessor version: Read (Image BRAM)Execution(Coproc)Write (Display BRAM) Dock: send the profile information through UART. Demo architecture UART Push button Image BRAM PLB PPC Peripherals Coproc Interface to external Instruction BRAM Display BRAM EDK VGA control ISE VGA display
Coprocessor configurations • Microprocessor only • Small blur+ small sobel • Small blur + small emboss • Small sobel + small emboss • Large blur • Large sobel • Large emboss • Choose the configuration according to app profile info. PPC Peripherals Blur (S) Blur (S) Sobel(s) Blur (L) Sobel (L) Emboss(L) Memory Sobel(S) Emboss(s) Emboss(s) Coprocessor region Virtex2P