1 / 20

Just-in-Time Compilation for FPGA Processor Cores

Just-in-Time Compilation for FPGA Processor Cores. Andrew Becker 1 , Scott Sirowy 2 , Frank Vahid Department of Computer Science and Engineering University of California, Riverside {abecker | ssirowy | vahid}@cs.ucr.edu 1. Now at EPFL 2. Now at ESRI.

gloriann
Download Presentation

Just-in-Time Compilation for FPGA Processor Cores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Just-in-Time Compilation for FPGA Processor Cores Andrew Becker1, Scott Sirowy2, Frank Vahid • Department of Computer Science and Engineering • University of California, Riverside • {abecker | ssirowy | vahid}@cs.ucr.edu • 1. Now at EPFL 2. Now at ESRI This work was supported in part by the National Science Foundation (CNS1016792) and by the Semiconductor Research Corporation (GRC 2143.001)

  2. Motivation • SystemC useful capture language • Concurrency, structure, timing • Simulation typical, but in-system I/O often useful • Design/synthesis to FPGA may take hours/days and require advanced tools Switches/LEDs Cameras/displays In-system I/O Simulation

  3. Background • Want rapid design iteration with in-system I/O • Compile design description; avoid design/synthesis • Previously: Hybrid approach—SystemC bytecode SystemC Code Bytecode class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ … process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start: ADDI $2 $2 1 ADDI $3 $0 7 … Simulator (no in-system I/O) Design/synthesis (time-consuming) Compiler … Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]

  4. Background • Emulate bytecode in engine on FPGA • Fast compilation • Bytecode also portable (FPGA-device independent) FPGA Bytecode process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start: ADDI $2 $2 1 ADDI $3 $0 7 … class CLK_GEN : public sc_module { sc_in<bool> clock; … CLK_GEN(){ … Compiler Emulation Engine In-system I/O Portable SystemC-on-a-chip – Sirowy [CODES+ISSS ’09]

  5. Emulation Engine • Discrete event simulator • C code on a processor • (Currently Microblaze soft-core; could be hard-core) • Support-circuits for architectural features, peripheral I/O Peripheral Bus Processor Core UART Event Kernel LEDs Instruction Mem. Buttons Read Signal Memory Frame Buffer Write Signal Memory

  6. Caveat Emptor • Emulation is slow • On soft-core, is even slower than PC simulation • Won't meet many real-time constraints

  7. This work – Speed up emulator • First analyzed emulator performance

  8. Low-Hanging Fruit • 69% of time spent emulating bytecode • Two strategies to reduce • Reduce each instruction’s emulation time • Reduce instruction memory latency

  9. First Step • Reduce instruction emulation time • Optimize event kernel? Peripheral Bus Processor Core UART Event Kernel LEDs Instruction Mem. Buttons Read Signal Memory Frame Buffer Write Signal Memory

  10. First Step • Reduce instruction emulation time • Optimize event kernel? • Just-in-time (JIT) compile bytecode to native processor code, done transparently by event kernel Peripheral Bus Processor Core UART Event Kernel LEDs Instruction Mem. Buttons Read Signal Memory Frame Buffer Write Signal Memory

  11. Just-in-Time Compilation of Bytecode • Implemented SystemC-bytecode to Microblaze JIT compiler • 3x speedup; still portable • Tunable delay/jitter • Still want more speed Emulation Engine Emulation Engine Machine Code Bytecode Machine Code Machine Code Machine Code process(clock) READ $1 dataRdy BGT $1 $0 Start J Done Start:ADDI $2 $2 1 ADDI $3 $0 7 … IMM 0xDEAD LWI $11 $0 0xBEEF BGTI $11 Start BRAI Done Start: … JIT Event Kernel

  12. Further Improvement • Reduce instruction memory latency • Add dedicated small, fast memory for JIT code on a fast, local bus • Unique JIT possibility due to FPGA configurability

  13. Architecture Changes Peripheral Bus Local Memory Bus Processor Core UART LEDs JIT Mem. Instr. Mem. Buttons Read Signal Memory Frame Buffer Write Signal Memory Emulation Engine

  14. Even Further Improvement • 23% of time spent maintaining signal queue • What can be done? • Optimize signal queue maintenance code?

  15. FPGA FPGA Extra Resources Emulation Engine Emulation Engine Common Denominator • FPGA offers configurability • Engine designer can make tradeoffs • Trade hardware resources for speed

  16. FPGA FPGA Extra Resources Emulation Engine Emulation Engine Common Denominator • FPGA offers configurability • Engine designer can make tradeoffs • Trade hardware resources for speed • Add another soft-core?

  17. Even Further Improvement • 23% of time spent maintaining signal queue • What can be done? • Optimize signal queue maintenance code? • Offload job to coprocessor • Again, unique JIT option due to FPGAconfigurability

  18. Architecture Changes Peripheral Bus Local Memory Bus Processor Core UART LEDs Signal Queue JIT Mem. Instr. Mem. Buttons Read Signal Memory Frame Buffer Emulation Memory Controller Write Signal Memory Emulation Engine

  19. Experimental Results

  20. Conclusions • Approach rapid design iteration with in-system I/O • Uses • Education (typically loose timing constraints) • System prototypes that can tolerate real-time slowdown (e.g., slow frame rate) • Portable and flexible • Engine design sets speed, not compiler or CAD flow • This work: 15x speedup via normal JIT (3x) + FPGA-specific JIT (5x) • But, still orders of magnitude slower than design/synthesis • Future work: Bytecode accelerators, JIT synthesis

More Related