1 / 28

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform. Fang Fang James C. Hoe Markus P ü schel Smarahara Misra Carnegie Mellon University. ?. May not match the application’s needs: parameters, speed, power, area and their trade-off.

rusti
Download Presentation

Generation of Custom DSP Transform IP Cores: Case Study Walsh-Hadamard Transform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generation of Custom DSP Transform IP Cores: Case StudyWalsh-Hadamard Transform Fang Fang James C. Hoe Markus Püschel Smarahara Misra Carnegie Mellon University

  2. ? • May not match the application’s needs: • parameters, speed, power, area and their trade-off. Conventional Approach:Static IP Cores • IP cores improve productivity and reduce time-to-market. • e.g. Xilinx LogiCore library: FFT for N=16, 64, 256 and 1024 on 16-bit complex numbers chip library application

  3. Alternative Approach: IP Core Generation • Generate IP cores to match specific application requirements (speed, area, power, numerical accuracy, and I/O bandwidth…) Application parameters Generator+Evaluator Speed / area / power requirements Optimized IP cores

  4. Design Space Designer’s Focus Design space • DSP transform design can be studied at several levels. • More math knowledge involved  Bigger design space to explore. Math Algorithm Architecture Gate Circuit

  5. Problem • Problem: gap between transform mathematics and hardware design A hardware engineer A math guy What I know: Finite state machinePipeliningSystolic array … What I know: Linear algebraDigital signal processingAdaptive filter theory …

  6. Formula example RepresentationManipulationMapping Formula Bridge: Formula • Solution: - Formula representation of DSP transforms - Automated formula manipulation and mapping A hardware engineer A math guy What I know: Finite state machinePipeliningSystolic array … What I know: Linear algebraDigital signal processingAdaptive filter theory …

  7. Outline • Introduction • Technical Details(illustrated by WHT transform) • What are the degreesof design freedom? • How do we explore this design space? • Experimental Results • Summary and Future work

  8. n fold Tensor productA  B = [ ak,l• B ], where A = [ak,l] Walsh-Hadamard Transform • Why WHT? • Typical access pattern for a DSP transform • Close to 2-power FFT • Study important construct  • Definition

  9. an F2 block Subtraction Addition WHT23 From Formula to Architecture

  10. Pease Algorithm Stride permutation L2N

  11. 0 1 Regular routing 2 3 4 5 Possibility for vertical folding 6 7 Possibility for horizontal folding Pease Algorithm an F2 block 0 1 2 3 4 5 6 7 12 F2 blocks total

  12. 2 2 HF + Vertical Folding (VF) Repeat 3 times 1 F2 block Folded L28 Folding 8 Repeat 3 times Horizontal Folding (HF) 4 F2 blocks 8 L28 I4F2 ?

  13. Challenge in Vertical Folding • Straightforward approach: Memory-based reordering • Extra control logic to reorder address • Computation speed is limited by memory speed • Ad-hoc approach: Register routing • Hard to automate the process • Our approach: formula-based matrix factorization How to fold these wires? L2N Folded L28 N ports Q ports

  14. 0 0 L2Q has Q input portsQ=2q, N=2n 1 JN can be easily folded [1] 4 2 2 3 6 4 L24 (J4)4 (J64)4 (J32)4 (J16)4 (J8)4 1 5 5 6 3 Example of (L264)4 (N=64, Q = 4) 7 7 7 [1]. J.H.Takala etc., “Multi-Port Interconnection Networks for Radix-R Algorithms”, ICASSP01 Factorization of Stride Permutation J8

  15. Freedom in Horizontal Folding • WHT2n has n horizontal stages in the flattened design • The divisors of n are all the possible folding degrees • Example: HF degrees of WHT26 can be 1, 2, 3, 6 • Effects of more horizontal folding degree • Less pipeline depth •  lower • throughput

  16. Freedom in Vertical Folding • WHT2n has 2n vertical ports in the flattened design • 1, 2, 4… 2n-1 are all possible folding degrees • Example: VF degrees of WHT26 could be 1, 2, 4, … 32 • Effects of more vertical folding degree • Less I/O bandwidth •  longer • computation

  17. Outline • Introduction • Technical Details • Experimental Results • Summary and Future work

  18. Design Space Exploration HF factor(1,2,3,6) VF factor(1,2,4, ... 32) X = 24 different designs Transform size(64) Bit-width (8) WHT Generator Technology Libary Xilinx FPGA Synthesis Performance requirement Evaluator Xilinx FPGA Place&Route

  19. To achieve the same area, multiple folding options are available. Area vs. Folding Degrees

  20. Latency vs. Folding Degrees (WHT64)

  21. Latency vs. Folding Degrees (WHT64)

  22. Latency is almost unaffected by HF, except comparing flattened design with folded design Latency vs. Folding Degrees (WHT64)

  23. Throughput vs. Folding Degrees Folding always lowers throughput

  24. Our fastest design Design in [2] Comparison with an Existing Design • WHT8 • 8 bit fixed-point • FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 • Compare our fastest generated designs against results reported by Amira, et al. [2] 60% more area 80% reduction in latency13 times higher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

  25. Our smallest design Design in [2] Comparison with an Existing Design • WHT8 • 8 bit fixed-point • FPGA: Xilinx Virtex xcv1000e-fg680 Speed grade: -8 • Compare our smallest generated designs against results reported by Amira, et al. [2] Less areaShorter latencyHigher throughput Area (#of slices) Latency(ns) Throughput(MOP/s) [2] A.Amira et al., “Novel FPGA Implementations of Walsh-Hardamard Transforms for Signal Processing”, Vision, Image and Signal Processing, IEE Proceedings- , Volume: 148 Issue: 6 , Dec. 2001

  26. Performance throughput / latency / area … Verticalfolding Horizontal folding Summary • Large performance variations over the design space of horizontal and vertical folding • Automatic design space exploration through formula manipulation and mapping can find the best trade-off

  27. RepresentationManipulationMapping Formula PipeliningSystolic arrayDistributed ArithmeticFix-point vs. Floating-point … DFTDCTDSTDWT … Future work More DSPtransforms More design decisions

  28. Thank you! Contact: Fang Fang Email: ffang@cmu.edu URL: www.ece.cmu.edu/~ffang

More Related