280 likes | 400 Views
Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform. Fang Fang James C. Hoe Markus P ü schel Smarahara Misra Carnegie Mellon University. ?. May not match the application’s needs: parameters, speed, power, area and their trade-off.
E N D
Generation of Custom DSP Transform IP Cores: Case StudyWalsh-Hadamard Transform Fang Fang James C. Hoe Markus Püschel Smarahara Misra Carnegie Mellon University
? • May not match the application’s needs: • parameters, speed, power, area and their trade-off. Conventional Approach:Static IP Cores • IP cores improve productivity and reduce time-to-market. • e.g. Xilinx LogiCore library: FFT for N=16, 64, 256 and 1024 on 16-bit complex numbers chip library application
Alternative Approach: IP Core Generation • Generate IP cores to match specific application requirements (speed, area, power, numerical accuracy, and I/O bandwidth…) Application parameters Generator+Evaluator Speed / area / power requirements Optimized IP cores
Design Space Designer’s Focus Design space • DSP transform design can be studied at several levels. • More math knowledge involved Bigger design space to explore. Math Algorithm Architecture Gate Circuit
Problem • Problem: gap between transform mathematics and hardware design A hardware engineer A math guy What I know: Finite state machinePipeliningSystolic array … What I know: Linear algebraDigital signal processingAdaptive filter theory …
Formula example RepresentationManipulationMapping Formula Bridge: Formula • Solution: - Formula representation of DSP transforms - Automated formula manipulation and mapping A hardware engineer A math guy What I know: Finite state machinePipeliningSystolic array … What I know: Linear algebraDigital signal processingAdaptive filter theory …
Outline • Introduction • Technical Details(illustrated by WHT transform) • What are the degreesof design freedom? • How do we explore this design space? • Experimental Results • Summary and Future work
n fold Tensor productA B = [ ak,l• B ], where A = [ak,l] Walsh-Hadamard Transform • Why WHT? • Typical access pattern for a DSP transform • Close to 2-power FFT • Study important construct • Definition
an F2 block Subtraction Addition WHT23 From Formula to Architecture
Pease Algorithm Stride permutation L2N
0 1 Regular routing 2 3 4 5 Possibility for vertical folding 6 7 Possibility for horizontal folding Pease Algorithm an F2 block 0 1 2 3 4 5 6 7 12 F2 blocks total
2 2 HF + Vertical Folding (VF) Repeat 3 times 1 F2 block Folded L28 Folding 8 Repeat 3 times Horizontal Folding (HF) 4 F2 blocks 8 L28 I4F2 ?
Challenge in Vertical Folding • Straightforward approach: Memory-based reordering • Extra control logic to reorder address • Computation speed is limited by memory speed • Ad-hoc approach: Register routing • Hard to automate the process • Our approach: formula-based matrix factorization How to fold these wires? L2N Folded L28 N ports Q ports
0 0 L2Q has Q input portsQ=2q, N=2n 1 JN can be easily folded [1] 4 2 2 3 6 4 L24 (J4)4 (J64)4 (J32)4 (J16)4 (J8)4 1 5 5 6 3 Example of (L264)4 (N=64, Q = 4) 7 7 7 [1]. J.H.Takala etc., “Multi-Port Interconnection Networks for Radix-R Algorithms”, ICASSP01 Factorization of Stride Permutation J8
Freedom in Horizontal Folding • WHT2n has n horizontal stages in the flattened design • The divisors of n are all the possible folding degrees • Example: HF degrees of WHT26 can be 1, 2, 3, 6 • Effects of more horizontal folding degree • Less pipeline depth • lower • throughput
Freedom in Vertical Folding • WHT2n has 2n vertical ports in the flattened design • 1, 2, 4… 2n-1 are all possible folding degrees • Example: VF degrees of WHT26 could be 1, 2, 4, … 32 • Effects of more vertical folding degree • Less I/O bandwidth • longer • computation
Outline • Introduction • Technical Details • Experimental Results • Summary and Future work
Design Space Exploration HF factor(1,2,3,6) VF factor(1,2,4, ... 32) X = 24 different designs Transform size(64) Bit-width (8) WHT Generator Technology Libary Xilinx FPGA Synthesis Performance requirement Evaluator Xilinx FPGA Place&Route
To achieve the same area, multiple folding options are available. Area vs. Folding Degrees
Latency is almost unaffected by HF, except comparing flattened design with folded design Latency vs. Folding Degrees (WHT64)
Throughput vs. Folding Degrees Folding always lowers throughput
Our fastest design Design in [2] Comparison with an Existing Design • WHT8 • 8 bit fixed-point • FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 • Compare our fastest generated designs against results reported by Amira, et al. [2] 60% more area 80% reduction in latency13 times higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001
Our smallest design Design in [2] Comparison with an Existing Design • WHT8 • 8 bit fixed-point • FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 • Compare our smallest generated designs against results reported by Amira, et al. [2] Less areaShorter latencyHigher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001
Performance throughput / latency / area … Verticalfolding Horizontal folding Summary • Large performance variations over the design space of horizontal and vertical folding • Automatic design space exploration through formula manipulation and mapping can find the best trade-off
RepresentationManipulationMapping Formula PipeliningSystolic arrayDistributed ArithmeticFix-point vs. Floating-point … DFTDCTDSTDWT … Future work More DSPtransforms More design decisions
Thank you! Contact: Fang Fang Email: ffang@cmu.edu URL: www.ece.cmu.edu/~ffang