Automatic Generation of Customized Discrete Fourier Transform IPs

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University This project is supported in part by NSF awards ITR/NGS-0325687 and SYS-0310941 and a DARPA DESA program www.spiral.net

The Paradox of Reusable IPs • Boon to productivity • zero effort required • zero knowledge required • zero chance to introduce new bugs Why repeat what has already been done? • Bane to optimality • finding the right functionality with the right interface • design tradeoff -- performance, area, power, accuracy ..... Are you getting what you really wanted? • Solution: parameterized automatic IP generators • zero effort, knowledge or bugs • allows application specific customization • facilitates design exploration

Our Work: Discrete Fourier Transform IPs • Discrete Fourier Transform (DFT) • important building block in DSP applications • numerous design “cores” available • Current IP libraries support: • various sizes, number formats, data orderings • only a small number of microarchitecture choices • (Xilinx LogiCore DFT gives 3 choices) • We generate IPs with custom design tradeoffs • degree of parallelism in microarchitecture (min  max) • resource preference (e.g. BRAM vs. slices in FPGAs) Extensible to other common linear DSP transforms

Outline • Introduction • Formula-Driven Design Generation • Microarchitecture Parameterization • Generator User Interface • Experimental Results • Conclusions

Transforms as Formulas [www.spiral.net] • Transform computation is represented as matrix-vector multiplication • Matrix-vector multiplication is O(n2) operations • “Fast” algorithms factor the transform into a sequence of structured sparse matrices • O(n log n) operations DFT: FFT: Datapath easily formed from factorized formulas

A A B A ×7 ×8 ×2 ×4 Formula to Datapath • Given where is: • apply , then • is a permutation permute • apply , times in parallel • is a diagonal scale

stage 3 stage 2 stage 1 Pease DFT butterfly • Simple regular structure embodied in formula • Example: k stages permutation diagonal parallel

x x x x x x x x x x x x Pease DFT Example: DFT8 (datapath is built left to right) stage 3 stage 2 stage 1 Repeating column structure  hardware reuse without performance penalty (formula is applied from right to left)

x x x x x x x x x x x x Horizontal folding • our baseline design • degree of freedom: vertical parallelism • parameter p p register inputbypass

Vertical (V-)folding according to p latency cost Fine-grained control over cost/latency tradeoff

common DFT options customization options User Interface http://www.spiral.net/hardware/dftgen.html

Evaluation • We compare against Xilinx LogiCore DFT Ver. 3.1 • radix-4 burst I/O interface We compare Xilinx’s fixed design against our variable generated designs • Comparison • DFT n = {64, 1024, 2048}; width = 16; bit-reversed output • Xilinx ISE ver. 6.1, Xilinx Virtex2-Pro XC2VP100-6

DFT1024 relative to Xilinx storage logic performance 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec 1.0 = 1955 slices Xilinx Performance and resources scale with p

35 14 30 12 25 10 20 8 relative BRAMs relative slices 15 6 10 4 5 2 0 0 1 2 4 8 16 32 1 2 4 8 16 32 p p Resource usage preferences storage logic performance 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec 1.0 = 1955 slices 6 4 speedup 2 Xilinx 0 1 2 4 8 16 32 p

Resource usage preferences storage logic performance 1.0 = 7 BRAMs 1.0 = 1 / 5.6 µsec 1.0 = 1955 slices Xilinx • exchange BRAM for slices • very little change in performance Can control tradeoff between slices and BRAMs

2048 64 DFT64 and DFT2048 1.0 = 1 transform / 0.648 µsec 1.0 = 8 BRAMs 1.0 = 1743 slices Xilinx 1.0 = 7 BRAMs 1.0 = 1 transform / 24.578 µsec 1.0 = 2140 slices Xilinx Trends hold for sizes 64, 2048

Related Work • Kumhom, Johnson, Nagvajara, ASIC/SOC 2000 • universal FFT processor microarchitecture based on processing elements interconnected by on-chip reconfigurable network • microarchitecture is scalable in the number of elements • supports both Cooley Tukey and Pease • Choi, Scrofano, Prasanna, Jang, FPGA’2003 • mapped radix-4 Cooley-Tukey algorithm onto log2(n)/2 DFT4 primitives • scalable datapath between 1 element and 4 elements at a time • show energy and performance improvements from scaling

Conclusions • Parameterized DFT IP generator • matrix formula-driven synthesis • performance/cost tradeoff • fine-grained control over resources vs. latency • resource usage preference • can balance tradeoff between slices and BRAM • Key results • efficient: the Xilinx design point can be matched • customizable: design tradeoffs directly controllable • easy to use: simple yet powerful web interface

Web Generator http://www.spiral.net/hardware/dftgen.html • This work is part of the SPIRAL project, which aims to push the limits of automation in software and hardware development for DSP algorithms.For more information visit: www.spiral.net http://www.spiral.net/hardware/dftgen.html

Automatic Generation of Customized Discrete Fourier Transform IPs