Field-Programmable Technology: Today’s and Tomorrow’s

Field-Programmable Technology: Today’s and Tomorrow’s Wayne Luk Imperial College London TWEPP-07 Prague 4 September 2007 Imperial College London April 2005

Outline: technology = devices + design 1. overview: motivation and vision 2. field-programmable devices: today - Xilinx Virtex-4, Virtex-5; Stretch S5 3. field-programmable design : today - enhance optimality and re-use 4. field-programmable devices: tomorrow - hybrid FPGA, die stacking 5. field-programmable design : tomorrow - guided synthesis, representation, upgradability 6. summary Thanks to colleagues, students and collaborators from Imperial College, University of British Columbia, Chinese University of Hong Kong, University of Massachusetts Amherst, UK Engineering and Physical Sciences Research Council, Stretch, Xilinx

1. Motivation: good - Moore’s Law • rising • capacity • speed • falling • power/MHz • price Source: Xilinx

Not so good: design productivity gap 10,000,000 100,000,000 Logic (Moore’s Law) Transistors/Chip .10m Transistor/Staff Month 1,000,000 10,000,000 58%/Yr. compound Complexity growth rate 100,000 1,000,000 Productivity Trans./Staff - Month 10,000 100,000 .35m Logic Transistors per Chip`(K) 1,000 10,000 x 100 1,000 x x x x x x 100 21%/Yr. compound Productivity growth rate 10 2.5m 10 1 1991 1999 2001 2003 2007 1987 1989 1993 1995 1997 2005 2009 1983 1985 1981 Source: SEMATECH challenge: reduce the design productivity gap

Our vision: unified synthesis + analysis application description /Progol Java/ manual/automatic partition, compile, analysis, validate system-specific programming interface FPGAs + sensors CPUs + DSPs system-specific architecture

2a. Devices today: Virtex-4 FPGA 450MHz PowerPC™ Processors with Auxiliary Processor Unit 500MHz multi-port Distributed RAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O 500MHz Programmable DSP Execution Units Source: Xilinx • fine-grain fabric + special function units • challenge: use resources effectively

2b. Virtex-5 FPGA • capacity • rises • logic speed • rises • I/O speed • rises • all good news? Source: Xilinx

Growing gap: amount of gates vs I/O • I/O off-chip • serial • I/O on-chip • scalable • flexible • easy to use • interconnect • heterogeneous • customisable Source: Xilinx

2c. Software configurable engine: S5 I-CACHE 32KB D-CACHE 32KB SRAM 256KB DATA RAM 32KB MMU 32-BIT RF 128-BIT WRF 32-BIT RF CONTROL ISEF Instruction-Set Extension Fabric ALU FPU ALU FPU S5 ENGINE • Wide Register File (WRF) • 32 Wide Registers (WR) • 128-bit Wide • Load/Store Unit • 128-bit Load/Store • Auto Increment/Decrement • Immediate, Indirect, Circular • Variable-byte Load/Store • Variable-bit Load/Store • RISC Processor • Tensilica – Xtensa V • 32 KB I & D Cache • On-Chip Memory, MMU • 24 Channels of DMA, FPU • ISEF • Instruction Specialization Fabric • Compute Intensive • Arbitrary Bit-width Operations • 3 Inputs and 2 Outputs • Pipelined, Bypassed, Interlocked • Random Logic Support • Internal State Registers Source: Stretch

3. Design today: overview • structural or register-transfer level (RTL) • e.g. VHDL, Verilog • low-level, little automation, small designs • behavioural, system-level descriptions • e.g. SystemC (public-domain: systemc.org) • MARTES: +UML for real-time embedded systems • general-purpose software languages • e.g. C, Java; with hardware support: Handel-C • high-level, large automation, large designs • special-purpose descriptions • e.g. System Generator (signal processing) • high-level, domain-specific optimisations

3a. Enhance optimality and re-use • design optimality: quality • select algorithm and devices: meet requirements • mapping: regular to systolic, rest to processor • I/O: dictates on-chip parallelism, buffering schemes • control speed/area/power: pipelining, layout plan • partitioning: coarse vs fine grain logic and memory • design re-use: productivity • separate aspects specific to application/technology • library of customisable components with trade-offs • compose and customise to meet requirements • uniform interface to memory and I/O: hide details • pre-verified parts: ease system verification

Example: systolic summation tree • n-input adder • tree of (n-1) 2-input adders • each adder • has k stages • each stage has s-bit adder • figure shows • k = 3 • s = 3 • high k, low s • less cycle time • more area • less power

Finance application: value-at-risk • sampling from multivariate Gaussian distribution • DSP units: matrix multiplication • systolic tree: accumulate result • RC2000 board • Virtex-4 xc4vsx55 • 400MHz 33 times faster than 2.2GHz quad Opteron (including all IO overheads, PC-FPGA communications, and using AMD optimised BLAS for software)

3b. Stretch program development CONTROL • profile code • identify hotspots • special instruction • implement ‘C’ functions in single instructions • bit-width optim. • software compiler • generate instruction • schedule instruction • multiple data (WR) • perform operations in parallel • efficient data movement • intrinsic load store operations • 20+ DMA channels COMPILED MACHINE CODE APPLICATION C/C++ NEW INSTRUCTIONS Instruction Definition Compiler TAILOR ISEF TO APPLICATION INSTRUCTION GENERATION AUTOMATIC Source: Stretch

4. Devices tomorrow: more diversed 450MHz PowerPC™ Processors with Auxiliary Processor Unit 500MHz multi-port Distributed RAM 500MHz DCM Digital Clock Management 500MHz Flexible Soft Logic Architecture 0.6-11.1Gbps Serial Transceivers 1Gbps Differential I/O Source: Xilinx 500MHz Programmable DSP Execution Units Replaced by other functional units, e.g. floating-point units

4a. Hybrid FPGA: architecture • most digital circuits • datapath: regular, word-based logic • control: irregular, bit-based logic • hybrid FPGA • customised coarse-grained block: domain-specific requirements • fine-grained blocks: existing FPGA architecture • good match to computing applications for given domain

Coarse-grained fabric library D=9, M=4, R=3, F=3, 2 add, 2 mul: best density over benchmarks

Evaluation • 6 benchmark circuits • digital signal processing kernels: e.g. bfly (for FFT) • linear algebra: e.g. matrix multiplication • complete application: e.g. bgm (financial model) • circuits: partitioned to control + datapath • control: vendor tools to fine-grained units • datapath: manually map to coarse-grained units • comparison • directly synthesized to Xilinx Virtex-II devices

Results

4b. On-chip memory bandwidth • storage hierarchy: registers, LUT RAM, block RAM • processor cache: address lack of I/O bandwidth Source: Xilinx

High density die stacking Source: Xilinx

Programmable circuit board • die stacking: 3D interchip connections • customisable system-in-package: productivity gap? Source: Xilinx

5a. Design tomorrow: guided synthesis • guided transformation of design descriptions • automate tedious and error-prone steps • applicable to various levels of abstraction • focus: two timing models • strict timing model: cycle-accurate - efficiency • flexible timing model: behavioural - productivity • combine cycle-accurate and behavioural models • rapid development with high quality • design maintainability and portability • based on high-level language • library developer: provide optimised building blocks • application developer: customise building blocks

Timing models: strict vs flexible 2 a c MUX MUX b * b * << - 0 > true false c c 2 0 = == true false c c 0 num_sol = = 1 num_sol num_sol scheduling unscheduling b a c 2 * << * - executed at cycle1 delta strict timing(Handel-C) > flexible timing 0 == 0 2 1 executed at cycle 2 0 { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } num_sol behavioural model - partial-ordering - abstract operations cycle accurate model - total-ordering - resource-bound

2 a c MUX MUX b a c b * b stage 1-7 * pmult[0] pmult[1] << stage 8 << 2 - tmp0 tmp1 0 > - stage 9 cycle1 true false tmp2 c c 2 0 = == stage 10 true false c c > 0 0 num_sol = = 1 == 2 1 0 num_sol num_sol num_sol delta Rapid design: automated scheduling scheduling par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } { delta = b*b - ((a*c) << 2); if (delta > 0) num_sol = 2; else if (delta == 0) num_sol = 1; else num_sol = 0; } constraints + unscheduling(flexible timing) synthesis(strict timing) • support combination of manual and automatic scheduling

Maintainability: retarget design 2 a c MUX MUX MUX MUX b a c b * b stage 1-7 * pmult [0] pmult [1] << stage 8 << 2 - tmp0 tmp1 0 > - stage 9 cycle1 true false tmp2 c c 2 0 = == stage 10 true false c c > 0 0 num_sol = = 1 == 2 1 0 num_sol num_sol num_sol delta scheduling par { { // ================= [stage 1] pipe_mult[0].in(b,b); pipe_mult[0].in(a,c); } .... } par { // ================== [stage 1] pipe_mult[0].in(b,b); pipe_mult[1].in(a,c); // ==================[stage 8] tmp0 = pipe_mult[0].q; tmp1 = pipe_mult[1].q << 2; // ==================[stage 9] tmp2 = tmp0 - tmp1; // ==================[stage 10] if (tmp2 > 0) num_sol = 2; else if (tmp2 == 0) num_sol = 1; else num_sol = 0; delta = tmp2; } synthesis(strict timing) constraints + b a c stage 1-4 cycle 1 pmult[0] pmult[0] tmp0 cycle 2 << stage 5 2 cycle 1 tmp1 - cycle 2 tmp2 stage 6 unscheduling(flexible timing) 0 > cycle 1 == 2 1 0 num_sol cycle 2

Implementation

Automatically generated results with respect to software with respect to smallest • ffd: free-form deformation; dct: discrete cosine transform

5b. Data representation optimisation • In1..In5 • known width • Out1..Out2 • width determines accuracy • defined by user • find representation • minimise width of nodes, e.g. X, Y • trade-off in speed, area, power, error

Fixed-point - range: integer - precision: fraction Floating-point - range: exponent - precision: mantissa

FIR filter and DFT: area vs error 1% more error: 65% less area

FIR filter and DFT: speed vs error 2% more error: 20% higher speed

5c. Upgradable design Number of new users upgrade 2 upgrade 1 initial release upgradable design non-upgradable design 30 20 10 Time (months) • upgradability: minimise time-to-market • maximise time-in-market • add new functions, fix bugs • very rapid upgrade? Source: Xilinx

Dynamic upgrade: turbo coder • error correction code: add redundancy • need: fast, low-power, adapt to noise level • recursive systematic convolutional (RSC) encoder, decoder, interleaver In Out Decoder Decide RSC noisy commun. channel Interleaver Interleaver Interleaver De-Interleaver Decoder RSC Turbo encoder Turbo decoder Source: Liang, Tessier, Goeckel

Self-tuning: run-time reconfiguration controller Nmax Configure FPGA LUT =? SNR current Nmax • adapt: less channel noise, so lower power • larger Nmax: better correction, more area/power • sample channel noise every 250K bits • find Signal to Noise Ratio (SNR), select Nmax • if Nmax  current Nmax, configure new bitstream • configuration overhead: about 30ms, 54.8mW Source: Liang, Tessier, Goeckel

Performance of dynamic upgrade • up to 0.5x power, 2x speed over static decoder • 100 times faster than processor decoder Source: Liang, Tessier, Goeckel

Other directions • domain-specific design automation • languages + tools: for particle physics systems? • multi-core, sensor network co-design • multiple hardware/software: FPGA + CPU + sensors • extending processor and compiler capabilities • static and dynamic optimizations, self-tuning • power-aware, radiation-aware design • transforms e.g. pipelining, damage monitoring • rapid and informative design validation • simulation + FPGA prototype + formal verification

6. Summary • good: Moore’s Law, bad: productivity gap • vision: unified design synthesis and analysis • devices and design today • growing gap: amount of I/O and amount of logic • enhance optimality and re-use: I/O driven • devices tomorrow • hybrid FPGA: multi-granularity fabric • 3D FPGA: customisable system-in-package • design tomorrow • guided synthesis: optimised and portable design • data representation optimisation • upgradable and self-tuned design

Field-Programmable Technology: Today’s and Tomorrow’s