FPGA-Based Multi-core Accelerator Prototype for Parallel Computing

1- ADVANCED FPGA PRODUCTS APPENDIX

Heterogeneous Programmable Platforms Centered around an FPGA FPGA Fabric Embedded memories Embedded PowerPc Hardwired multipliers Xilinx Vertex-II Pro High-speed I/O (3.125 Gbps transceivers) Courtesy Xilinx

Soft Cores • MicroBlazeembedded processor • SOFT CORE: • RISC processor optimized for implementation on the XilinxFPGAs • Completelyimplemented on-the-field in the general-purposememory and logicfabric of the FPGA Concept figure, not real device

FPGA Reconfigurable Data-path Interface ARM8 Core Berkeley Pleiades Processor Centered around an ARM7 core - ARM8: system manager - Intensive computations offloaded to a reconfigurable datapath (adders, multipliers, ASIP,..) - FPGA for bit manipulation • 0.25um 6-level metal CMOS • 5.2mm x 6.7mm • 1.2 Million transistors • 40 MHz at 1V • 2 extra supplies: 0.4V, 1.5V • 1.5~2 mW power dissipation

Today: Xilinx Zynq-7000

XilinxUltrascaleMPSoC:All-programmbleheterogeneousMPSoC …and of course programmable logic 

2- ADVANCED PROTOTYPING EXAMPLE APPENDIX

HeterogeneousParallel Computing • Template features • Host processor core (ARM Big.Little) • Programmable multi-core accelerator (GPPA) • Hierarchical interconnect fabric • CCI-400 (a crossbar) • System NoC (a network-on-chip) GOAL: prototype an innovative GPPA capable of running multiple concurrent offload applications by means of isolated and reserved computation partitions

MULTI-CORE ACCELERATOR PROTOTYPING • Virtex 7 evaluation board VC-707 • XC7VX485T chip, 486l logic cells, 76k slices, 36 Mb BRAM • Advanced GHz-range transceivers, on-board RAM and flash, display, Ethernet, etc.

FPGA HARDWARE PLATFORM OVERVIEW AXI Bus Main NoC Fabric Controller M μB μB μB μB DRAM Controller S Dual NoC Receiver M NI NI NI NI Memory S μB μB μB μB Progr.FaultInjector Dual NoC Driver S NI NI NI NI Interrupt Controller S GUI (from Valencia) UART μB μB μB μB S NI NI NI NI Debug Module S Traffic Sniffers S μB μB μB μB Timer S NI NI NI NI GPIO S Dual NoC = Ferrara IP = Xilinx IP

ACCELERATORARCHITECTURE Computation clusters with distributed L2 banks Dual NoC for routing reconfiguration Fabric Controller (NoC reconfiguration, Partition setup, application start,..) GPPA I/O interface

MESH NODE NORTH SW BRAM • MicroBlaze in place of clusters • Hardware sniffers for user-accessible link traffic monitoring uBlaze NI SW EAST SW WEST SW SNIFFER SNIFFER SNIFFER SNIFFER SNIFFER SNIFFER NI Memory SOUTH SW

Set of registers From Dual-Bus: For OSR programming partition - Hardwired Registers (full mesh) Vc_id_in µP • Dash Links are currently unused VC_ARBITER 3x1 6x6 switch LOCAL Inter-Processor Requests L2 + Routing reconf. Stall_out VC_ARBITER 2x1 STALL Vc_id_in Vc_id_out LOCAL LINK LOCAL LINK 6x6 switch LOCAL Responses L2 + Routing reconf. Stall_out Stall_in 6x6 switchGLOBAL No Circuit + No Routing Reconf. Stall_out Stall_in GLOBAL LINK GLOBAL LINK VC_ARBITER 3x1 “SET OF REGISTERS” FromDual-Bus Vc_id_in L2

uP NETWORK-ON-CHIP L2

KEY PERFORMED TESTS (1) • Initial NoC testing (for stuck-at faults) and configuration • Detection of a link failure • NoC is configured to route around it • Matrix Multiply benchmark starts on the 16 mesh MicroBlazes • Objective: configuration ok, rerouting ok, benchmark ok

KEY PERFORMED TESTS (2) • Button press, fabric controller (supervision MicroBlaze) initiates dynamic space-division multiplexing (SDM) • MicroBlazes start new SDM-aware tasks • Objective: prove partition isolation, differentiated partition shape-dependent execution time

PLATFORM IMPLEMENTATION 4x4 mesh Mesh NIs Dual NoC MicroBlazes and other Beyond 90% resource utilization!

Accelerator Offload Example: offload packet (data and binary for the GPPA) • OpenMP RTE forwards offload request to guest GPPA Driver • Guest GPPA Driver forwards it to the the GPPA emulation device • GPPA emulation device forwards the request to the GPPA bridge and copies data and binary from Guest memory space to host memory space • GPPA bridge forwards the packet to the Host GPPA driver and copies data and binary from host virtual memory to contiguous memory shared with the GPPA (L3 memory) Tsk data Tsk data Tsk data GPPAbrctl App Tsk data Tsk desc Resource allocation/management ioctl Tsk desc Tsk desc Tsk desc Guest Memory API API Kernel /dev/GPPAv GUEST POSIX Queue Host Memory iowrite Contiguous Memory GPPAv Tsk data QEMU/KVM KVM /dev/GPPA HOST GPPA

GPPA Offload • CURRENTLY AIMING AT A UNIQUE PROTOTYPING PLATFORM • The offload procedure relies on copiesinto a non-paged, contiguousmemoryrange (seenasmmap-ed IO) • COPIES ARE AVOIDED IN REAL SYSTEMS BY MEANS OF AN IOMMU! Example: offload packet (data and binary for the GPPA) • OpenMP RTE forwards offload request to guest GPPA Driver • Guest GPPA Driver forwards it to the the GPPA emulation device • GPPA emulation device forwards the request to the GPPA bridge and copies data and binary from Guest memory space to host memory space • GPPA bridge forwards the packet to the Host GPPA driver and copies data and binary from host virtual memory to contiguous memory shared with the GPPA (L3 memory) Tsk data Tsk data Tsk data GPPAbrctl App Tsk data Tsk desc Resource allocation/management Tsk desc Tsk desc Tsk desc Guest Memory API API Kernel /dev/GPPAv GUEST POSIX Queue Host Memory Contiguous Memory GPPAv Tsk data QEMU/KVM Validated on ODROID KVM /dev/GPPA HOST GPPA

AXI BUS IO PORTS Thread 1 FabricContrl UB_0 NI NI SW SW OFFLOAD – the accelerator side OpenMP Support Data 0x10030000 0x10000000 64K NI L2_0 4K TEST & SET NI 0x1003FFFF 0x10000FFF UB_1 UB_2 Thread 0 NI NI Task SW SW 0x100F0000 OpenMP Support Data 0x100C0000 BRAM-CTRL BRAM OpenMP Support Data 64K NI L2_2 64K L2_1 NI 0x100FFFFF 0x100CFFFF UART FC triggers task execution OpenMP Offload Data Support FC Generates a TASK Offload Task Queue OpenMP Offload Task Queue Task Task FC sets partition

Going to ASIC: The synthesis flow - 12T Library, Regular Threshold Voltage, 1,0V/0,9V/0,8V Supply Voltages (Best/Typical/Worst), 125C/25C/125C Temperatures (Best/Typical/Worst), 28 nm Target design: NI NI NI NI NI NI Design Compiler • switch radix: 7x7 • 32 bit flit width • 3 VCs • 2 slot input buffers • 6 slot output buffers • 3 NIs per cluster • Tight boundary constraints: • 400 ps input transition slope • 1000 times the input capacity of the biggest inverter in the library as output capacitance IC Compiler SoCEncounter Switch 1 Switch 0 Post-synthesis MAX speed: 800 MHz

Floorplanning -LINK LENGTH BASED ON ESTIMATED TILE SIZE OF 2 MM IN 28NM • HARD FENCES DEFINED FOR THE FLOORPLANNING BLOCKS • ROW UTILIZATION SET TO 60% CLUSTER CPU NETWORK INTERFACES CLUSTER L1 NETWORK INTERFACES SWITCHES L2 BANK NETWORK INTERFACES

Post-Layout Analysis • Post-layout: • 800 MHz (Highly predictable) • 213515 um2 • Critical path: • inside FSM of virtual channel flit-level arbiter • The link was not on the critical path

FPGA-Based Multi-core Accelerator Prototype for Parallel Computing

FPGA-Based Multi-core Accelerator Prototype for Parallel Computing

Presentation Transcript

APPENDIX

Appendix

Appendix

APPENDIX

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

Appendix

APPENDIX

Appendix

Appendix

Appendix