Mapping CMPs to Xilinx FPGAs

Mapping CMPs to Xilinx FPGAs Jan GrayArchitect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list) CMPs on FPGAs

Outline • Why am I here? (1) • FPGA CMPs: a brief personal history (1) • A methodology for great quality of results (7) • Mapping a scalar RISC PE to an FPGA (5) • CMP and RAMP Comments (4) CMPs on FPGAs

Why Am I Here? • End of rapid clock freq scaling – parallelism imperative – we get it… • The vast design space of 1 B trans. SoCs ~2010 • Enables dozens of cores, integration, 100s GFLOPS, but how can the millions of real world developers (not grad students) exploit it? • Particular a challenge in ‘client personal computing’ settings • ~2010, industry must deliver loveable (mainstream, evolutionary) concurrency programming models • RAMP promises rapid iteration → rapid innovation in tools and architectural support for loveable concurrency models • [How] can we study mainstream commercial workloads, tools, and platforms on RAMP CMPs? CMPs on FPGAs

My Journey into FPGA CMPs • Inspired by comp.arch, many Hot Chips conferences, H&P • 91: Freidin’s RISC4005: 20 MHz |4| 16-bit RISC in 4005 • 95: jr32: 33÷2 MHz |4|, 32-bit RISC + SoC in 4010 • 98: XSOC/xr16: 40 MHz |3|, 16-bit RISC in 260 LCs in S10 • Lcc, sims, SoC, Circuit Cellar series, fpgacpu.org, fpga-cpu list • 00: Altera NIOS • 01: gr1040 : 200 LUT+1 BRAM, 80 MHz |2|; 60 in V600E [P] • 01: Xilinx MicroBlaze (125 MHz |3| in 2Vxxx) • 04: 10 MicroBlaze in 2V2000 via EDK 6.3i • 04: 24 multithreaded-’MB’ in 4VLX25 [PD] • 06: 24 ‘PowerPC’ in 4VLX25 [PD], 200 MHz |4|, 133 MHz |2| • [P] = PAR/TRCE only; [D] = PAR/TRCE of datapath only; |x| = x stg pipe, no FP CMPs on FPGAs

A Methodology for Great Quality of Results: It’s Essential for CMPs! • It’s the golden age of FPGA development • Was: timing whack a mole, synthesis pushing on a rope • Now: good fast tools, fast computers, better fabrics • But: 2-10x better delay×area by tailoring ISA, HW/SW partitioning, datapath, pipeline, tech mapping, floorplanning to the FPGA • Prefer 40 200 MHz processors/die to 20 100 MHz ones • Example: always @(posedge clk) q <= add ? q + a : b; • Hand tech mapped and floorplanned: 1 LUT/bit • Synthesis: 2 LUT/bit, +0.5 ns delay • 5X faster place and route  rapid (methodical) expts • Efficiency of ASIC CPU models on FPGAs? CMPs on FPGAs

The Art of High Performance FPGA Design / How to Hack FPGAs like Ray Andraka • Great FPGA designers have The Knowledge • Best practices for great datapath QoR: • Choose the datapath’s technology mapping, pipeline regime, and floorplan, and then write the HDL • Bottom up experiments • Where it matters, use technology mapping tricks • Build up libraries of optimal datapath elements • Floorplan datapaths via Relationally Placed Macros (RPMs) • VHDL (generate + attributes) or Python + Verilog • Synthesize 95% of control unit – life is too short • Careful timing constraints, grok TRCE reports • Tune architecture and implementation together • Sweat the muxes • To iterate is divine CMPs on FPGAs

The Knowledge in a Nutshell • The LUT and its DFF • Tech mapping opts to quash mux LUTs • The BRAM • The DSP48 • (The DCM) CMPs on FPGAs

The LUT and its D-FF • 4-input LUT • Ripple-carry adder MUXCY and XORCY: ~2.5 ns 32-bit adder • MULTAND • P[i] = B[i] ? (P[i-1] + (A<<i)) : P[i-1]; • Mux cascades – MUXF5 etc. • 16x1-bit LUT-RAM; sp, dp • SRL16 – 16-bit tapped shift reg • D-Flip-Flops • Clock enable, synchronous reset, system reset regime CMPs on FPGAs

1 LUT/bit Technology Mapping Opts • ADDSUB: o <= add ? (a+b) : (a-b); • MUX2K: o <= k ? sel : (sel ? a : b); • MULTAND + carry-chain: • ADDMUX: o <= add ? (a+b) : c; • MUXADD: o <= addb ? (a+b) : (a+c); • ALU: o <= s1 ? (s2 ? (a+b) : (a-b)) : (s2 ? (a&b) : (a^b)); • Fast carry-chain-logic • EQV: o[i/2] <= a[i+1:i] == b[i+1:i] • EQZ: o[i/4] <= a[i+3:i] == 0 • C, V conditions • Other cheap mux ideas • 4-1 MUX using 2 2-1 MUXES and a MUXF5 • LUT-RAM / SRL16 is a 16-1 MUX • 4-input OR of 4 clearable registers CMPs on FPGAs

BRAM • 18 Kb dual port synch SRAM • Up to dual x32+x4 D/Q • 0 cycle : tAS ~0.5 ns; tCO ~2.1 ns • BRAM … adder … BRAM  6 ns • Virtex-4 • Opt. 1 cycle: DO*_REG: 0.9 ns • 400+ MHz • Byte write enables • FIFOs for ser/des rate matching • The Myriad Uses of BRAMs on fpgacpu.org CMPs on FPGAs

MULT/DSP48 • Dozens to hundreds in V-4 • Pipeline at 400+ MHz • Faster adders than the fabric • Basis for interesting fast simulated FPUs CMPs on FPGAs

QoR Examples • Xr16 core • ISA codesigned with datapath • Elide 1 result forwarding mux, compensate in SW • Map result mux, shifter to TBUFs • gr1040 core – 200 LUTs + 1 BRAM • 2 stage pipeline – elide all result forwarding muxes • BRAM for instructions and data • Use 1 LUT/bit ‘ALU’ – delete OR operator – 67% smaller • Use ADDMUX – faster, 30% smaller • C, V, branch, and i-cache tag check in carry-chain-logic CMPs on FPGAs

Mapping a Scalar RISC PE to an FPGA • Instruction cache, data cache • Cache lines – 1+ BRAMs • Cache tags – LUT RAM or BRAM • Read first mode for write-back caches • Register file • Single or dual ported LUT RAM • Multicontext reg files in BRAM • ALU • Tech mapping tricks; DSP48? • Result forwarding muxes • Multithreading – MicroUnity, HEP  deep pipelines OK • Clock pipeline faster than operand regs  ALU  forwarding  operand regs recurrence • LUT RAM PCs, SPRs, PSWs; BRAM reg files • But probably too much pressure on tiny i-caches and d-caches? CMPs on FPGAs

Simple Is Beautiful • Simpler is smaller • Smaller is cheaper • More PEs per part • Smaller can be faster • Interconnect is slow, so the less, the better • Easier to optimize (retiming, floorplanning, technology mapping) • Smaller is more power frugal • Simpler is easier to verify • Move complexity out the ISA, trap into software, or use dynamic translation to the simpler ISA • (WCED?) CMPs on FPGAs

“Jan’s Razor” • In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die. • Small clusters of cores share mul/div, barrel shift, FPU, TLB, even d-cache port CMPs on FPGAs

Silly Example: 70 ‘PowerPC-lite’ datapaths in a 2VP70 CMPs on FPGAs

Which ISAs for RAMP PEs? • Best fit in an FPGA fabric (==austerity) • MicroBlaze, MIPS, SPARC, PowerPC, x86 • x86 + PC periphs via dynamic translation? • Extant soft cores: MB, SPARC • 2VP/4VFX + EDK (CoreConnect *) bonus • MB, PowerPC • Commercial workloads and tools • PowerPC! CMPs on FPGAs

PE Figures of Merit • Area: #[LUTs, BRAMs, DSPs, DCMs] • Frequency, power, floorplanned? (fast PAR) • Simplicity / ease of modification • Some experiments will augment base CPU ISAs • Facilities • Validation • Debug support • Tools integration • Workloads • IP Rights CMPs on FPGAs

X86 HW seems too complex for area and time efficient large-n FPGA CMP 386, x64, v8086, x87, MMX, SSE2/3, SMM, hypervisor exts, … Don’t underestimate complexity of rest of system components / cores Build a ‘PowerPC’ CMP, run a port of the Virtual PC for Mac x86 dynamic translation engine upon, run apps on that Save/restore PC workloads to VHD images (When you have many cores, you don’t mind if your simulator spends a few on dyn translation) Speculation: How to Experiment Upon Commercial X86 Workloads on RAMP CMPs on FPGAs

Other Thoughts • Compose optimized building blocks into synthesized (floorplanned?) system architectures • Synplify Pro has a great RTL viewer • MicroBlaze is an excellent, Type B core • EDK is a great framework • Can plug in HW and SW components, bus masters and slaves, new CPU cores and OS and periphs, BSPs • EDK ships with a broad complement of cores • Don’t reinvent all that! EDK vs. RDL? • QinetiQ (?) FPU IP CMPs on FPGAs

Comments? Thanks. CMPs on FPGAs

Mapping CMPs to Xilinx FPGAs

Mapping CMPs to Xilinx FPGAs

Presentation Transcript

Intro To Xilinx Foundations

Mapping CMPs to Xilinx FPGAs

Introduction to FPGAs

RapidSmith: Do-It-Yourself CAD Tools for Xilinx FPGAs

Developing Video Applications on Xilinx FPGAs

Xilinx ISE

Xilinx CPLDs and FPGAs

Introduction to Xilinx CPLDs

FPGAs

Xilinx CPLDs and FPGAs

FLIPPER SEU Fault Injection in Xilinx FPGAs

Tools for synthesis and implementation using Xilinx FPGAs

Post Placement C -Slow Retiming for Xilinx Virtex FPGAs

Main Components Transmitting/Receiving Nodes: Four Xilinx Spartan IIE FPGAs

Simplifying MSO-based debug of designs with Xilinx FPGAs

CMPS 135 Introduction to Programming

CMPS 12A Introduction to Programming

Xilinx PCI

Xilinx CPLDs and FPGAs

Reconfigurable FPGAs (The Xilinx Virtex II Pro / ProX FPGA family)