210 likes | 444 Views
Mapping CMPs to Xilinx FPGAs. Jan Gray Architect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list). Outline. Why am I here? (1) FPGA CMPs: a brief personal history (1) A methodology for great quality of results (7) Mapping a scalar RISC PE to an FPGA (5) CMP and RAMP Comments (4).
E N D
Mapping CMPs to Xilinx FPGAs Jan GrayArchitect, Office of the CTO, Microsoft (fpgacpu.org, fpga-cpu list) CMPs on FPGAs
Outline • Why am I here? (1) • FPGA CMPs: a brief personal history (1) • A methodology for great quality of results (7) • Mapping a scalar RISC PE to an FPGA (5) • CMP and RAMP Comments (4) CMPs on FPGAs
Why Am I Here? • End of rapid clock freq scaling – parallelism imperative – we get it… • The vast design space of 1 B trans. SoCs ~2010 • Enables dozens of cores, integration, 100s GFLOPS, but how can the millions of real world developers (not grad students) exploit it? • Particular a challenge in ‘client personal computing’ settings • ~2010, industry must deliver loveable (mainstream, evolutionary) concurrency programming models • RAMP promises rapid iteration → rapid innovation in tools and architectural support for loveable concurrency models • [How] can we study mainstream commercial workloads, tools, and platforms on RAMP CMPs? CMPs on FPGAs
My Journey into FPGA CMPs • Inspired by comp.arch, many Hot Chips conferences, H&P • 91: Freidin’s RISC4005: 20 MHz |4| 16-bit RISC in 4005 • 95: jr32: 33÷2 MHz |4|, 32-bit RISC + SoC in 4010 • 98: XSOC/xr16: 40 MHz |3|, 16-bit RISC in 260 LCs in S10 • Lcc, sims, SoC, Circuit Cellar series, fpgacpu.org, fpga-cpu list • 00: Altera NIOS • 01: gr1040 : 200 LUT+1 BRAM, 80 MHz |2|; 60 in V600E [P] • 01: Xilinx MicroBlaze (125 MHz |3| in 2Vxxx) • 04: 10 MicroBlaze in 2V2000 via EDK 6.3i • 04: 24 multithreaded-’MB’ in 4VLX25 [PD] • 06: 24 ‘PowerPC’ in 4VLX25 [PD], 200 MHz |4|, 133 MHz |2| • [P] = PAR/TRCE only; [D] = PAR/TRCE of datapath only; |x| = x stg pipe, no FP CMPs on FPGAs
A Methodology for Great Quality of Results: It’s Essential for CMPs! • It’s the golden age of FPGA development • Was: timing whack a mole, synthesis pushing on a rope • Now: good fast tools, fast computers, better fabrics • But: 2-10x better delay×area by tailoring ISA, HW/SW partitioning, datapath, pipeline, tech mapping, floorplanning to the FPGA • Prefer 40 200 MHz processors/die to 20 100 MHz ones • Example: always @(posedge clk) q <= add ? q + a : b; • Hand tech mapped and floorplanned: 1 LUT/bit • Synthesis: 2 LUT/bit, +0.5 ns delay • 5X faster place and route rapid (methodical) expts • Efficiency of ASIC CPU models on FPGAs? CMPs on FPGAs
The Art of High Performance FPGA Design / How to Hack FPGAs like Ray Andraka • Great FPGA designers have The Knowledge • Best practices for great datapath QoR: • Choose the datapath’s technology mapping, pipeline regime, and floorplan, and then write the HDL • Bottom up experiments • Where it matters, use technology mapping tricks • Build up libraries of optimal datapath elements • Floorplan datapaths via Relationally Placed Macros (RPMs) • VHDL (generate + attributes) or Python + Verilog • Synthesize 95% of control unit – life is too short • Careful timing constraints, grok TRCE reports • Tune architecture and implementation together • Sweat the muxes • To iterate is divine CMPs on FPGAs
The Knowledge in a Nutshell • The LUT and its DFF • Tech mapping opts to quash mux LUTs • The BRAM • The DSP48 • (The DCM) CMPs on FPGAs
The LUT and its D-FF • 4-input LUT • Ripple-carry adder MUXCY and XORCY: ~2.5 ns 32-bit adder • MULTAND • P[i] = B[i] ? (P[i-1] + (A<<i)) : P[i-1]; • Mux cascades – MUXF5 etc. • 16x1-bit LUT-RAM; sp, dp • SRL16 – 16-bit tapped shift reg • D-Flip-Flops • Clock enable, synchronous reset, system reset regime CMPs on FPGAs
1 LUT/bit Technology Mapping Opts • ADDSUB: o <= add ? (a+b) : (a-b); • MUX2K: o <= k ? sel : (sel ? a : b); • MULTAND + carry-chain: • ADDMUX: o <= add ? (a+b) : c; • MUXADD: o <= addb ? (a+b) : (a+c); • ALU: o <= s1 ? (s2 ? (a+b) : (a-b)) : (s2 ? (a&b) : (a^b)); • Fast carry-chain-logic • EQV: o[i/2] <= a[i+1:i] == b[i+1:i] • EQZ: o[i/4] <= a[i+3:i] == 0 • C, V conditions • Other cheap mux ideas • 4-1 MUX using 2 2-1 MUXES and a MUXF5 • LUT-RAM / SRL16 is a 16-1 MUX • 4-input OR of 4 clearable registers CMPs on FPGAs
BRAM • 18 Kb dual port synch SRAM • Up to dual x32+x4 D/Q • 0 cycle : tAS ~0.5 ns; tCO ~2.1 ns • BRAM … adder … BRAM 6 ns • Virtex-4 • Opt. 1 cycle: DO*_REG: 0.9 ns • 400+ MHz • Byte write enables • FIFOs for ser/des rate matching • The Myriad Uses of BRAMs on fpgacpu.org CMPs on FPGAs
MULT/DSP48 • Dozens to hundreds in V-4 • Pipeline at 400+ MHz • Faster adders than the fabric • Basis for interesting fast simulated FPUs CMPs on FPGAs
QoR Examples • Xr16 core • ISA codesigned with datapath • Elide 1 result forwarding mux, compensate in SW • Map result mux, shifter to TBUFs • gr1040 core – 200 LUTs + 1 BRAM • 2 stage pipeline – elide all result forwarding muxes • BRAM for instructions and data • Use 1 LUT/bit ‘ALU’ – delete OR operator – 67% smaller • Use ADDMUX – faster, 30% smaller • C, V, branch, and i-cache tag check in carry-chain-logic CMPs on FPGAs
Mapping a Scalar RISC PE to an FPGA • Instruction cache, data cache • Cache lines – 1+ BRAMs • Cache tags – LUT RAM or BRAM • Read first mode for write-back caches • Register file • Single or dual ported LUT RAM • Multicontext reg files in BRAM • ALU • Tech mapping tricks; DSP48? • Result forwarding muxes • Multithreading – MicroUnity, HEP deep pipelines OK • Clock pipeline faster than operand regs ALU forwarding operand regs recurrence • LUT RAM PCs, SPRs, PSWs; BRAM reg files • But probably too much pressure on tiny i-caches and d-caches? CMPs on FPGAs
Simple Is Beautiful • Simpler is smaller • Smaller is cheaper • More PEs per part • Smaller can be faster • Interconnect is slow, so the less, the better • Easier to optimize (retiming, floorplanning, technology mapping) • Smaller is more power frugal • Simpler is easier to verify • Move complexity out the ISA, trap into software, or use dynamic translation to the simpler ISA • (WCED?) CMPs on FPGAs
“Jan’s Razor” • In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die. • Small clusters of cores share mul/div, barrel shift, FPU, TLB, even d-cache port CMPs on FPGAs
Silly Example: 70 ‘PowerPC-lite’ datapaths in a 2VP70 CMPs on FPGAs
Which ISAs for RAMP PEs? • Best fit in an FPGA fabric (==austerity) • MicroBlaze, MIPS, SPARC, PowerPC, x86 • x86 + PC periphs via dynamic translation? • Extant soft cores: MB, SPARC • 2VP/4VFX + EDK (CoreConnect *) bonus • MB, PowerPC • Commercial workloads and tools • PowerPC! CMPs on FPGAs
PE Figures of Merit • Area: #[LUTs, BRAMs, DSPs, DCMs] • Frequency, power, floorplanned? (fast PAR) • Simplicity / ease of modification • Some experiments will augment base CPU ISAs • Facilities • Validation • Debug support • Tools integration • Workloads • IP Rights CMPs on FPGAs
X86 HW seems too complex for area and time efficient large-n FPGA CMP 386, x64, v8086, x87, MMX, SSE2/3, SMM, hypervisor exts, … Don’t underestimate complexity of rest of system components / cores Build a ‘PowerPC’ CMP, run a port of the Virtual PC for Mac x86 dynamic translation engine upon, run apps on that Save/restore PC workloads to VHD images (When you have many cores, you don’t mind if your simulator spends a few on dyn translation) Speculation: How to Experiment Upon Commercial X86 Workloads on RAMP CMPs on FPGAs
Other Thoughts • Compose optimized building blocks into synthesized (floorplanned?) system architectures • Synplify Pro has a great RTL viewer • MicroBlaze is an excellent, Type B core • EDK is a great framework • Can plug in HW and SW components, bus masters and slaves, new CPU cores and OS and periphs, BSPs • EDK ships with a broad complement of cores • Don’t reinvent all that! EDK vs. RDL? • QinetiQ (?) FPU IP CMPs on FPGAs
Comments? Thanks. CMPs on FPGAs