1 / 28

FlexFilm An Image Processor for Digital Film Processing

FlexFilm An Image Processor for Digital Film Processing. Schloss Dagstuhl , Germany, April 02 – 07, 2006. Amilcar do Carmo Lucas , Sven Heithecker , Rolf Ernst Technical University of Braunschweig, Germany. Technische Universität Braunschweig. Outline. Motivation Film Grain Reduction

elroy
Download Presentation

FlexFilm An Image Processor for Digital Film Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FlexFilmAn Image Processor for Digital Film Processing Schloss Dagstuhl, Germany, April 02 – 07, 2006 Amilcar do Carmo Lucas, Sven Heithecker, Rolf Ernst Technical University of Braunschweig, Germany Technische Universität Braunschweig

  2. Outline Motivation Film Grain Reduction FlexWAFE Architecture / Library FlexFilm Hardware SDRAM-Controller QoS, detailed view, configurability Conclusion

  3. Motivation Application • Digital film image processing Features • High data volumes • Low resolution (2K) - 2048 x 2048 / 30bpp = 120 Mbit per frame • High resolution (4K) - 4096 x 4096 / 48bpp = 768 Mbit per frame • Low-latency (allow user interactivity) • Real time is 24 frames per second → up to 18 Gbit/s First application • Film grain noise reduction

  4. Film grain noise reduction application • Required as a first step in other image processing algorithms • Next generation NR requires higher than the available state-of-the-art processing power with film grain de-noised

  5. Film grain noise reduction application 500 Mop/s 100 Kbit H FIR V FIR Frame delays V FIR H FIR HH NR V FIR V FIR-1 2D DWT 2D DWT NR NR 2D DWT-1 2D DWT-1 HL NR V FIR V FIR-1 HL NR V FIR V FIR V FIR-1 V FIR-1 H FIR-1 V FIR-1 H FIR H FIR H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 HL NR HL NR V FIR-1 + + Bi-dir. motion-compensation H FIR-1 V FIR V FIR V FIR-1 V FIR-1 + + + + + + LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 H FIR H FIR H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 + + V FIR V FIR HH NR HH NR V FIR-1 V FIR-1 Temporal 1D DWT-1 Temporal 1D DWT H FIR H FIR V FIR V FIR 2D DWT 2D DWT NR NR sync sync 2D DWT-1 2D DWT-1 V FIR V FIR H FIR H FIR HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer V FIR V FIR Buffer Buffer HL NR HL NR V FIR V FIR Buffer Buffer H FIR H FIR V FIR-1 V FIR-1 + + HL NR HL NR HL NR HL NR Buffer Buffer V FIR-1 V FIR-1 V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR-1 H FIR-1 Buffer Buffer V FIR-1 V FIR-1 V FIR-1 V FIR-1 LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR-1 H FIR-1 H FIR H FIR + + 8 Mbit 3 Mbit 36 BPP 30 BPP HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer 2D DWT 2D DWT NR NR sync sync 2D DWT-1 2D DWT-1 V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR H FIR + + HL NR HL NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR H FIR + + de-noised image HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer Interframe 7 DWT 7 IDWT 640 Mbit 160 Gop/s

  6. Film grain noise reduction application • Complexity (2K @24fps) • 180 G add/s • 11 G mul/s • 12 G comp/s • 660 Mbit • 3 Gbit/s in • 3 Gbit/s out • Too complex for state-of-the-art PC or DSP based systems • Texas Instruments fixed-point DSP max 0,8 Gop/s @ 1GHz • Stanford Imagine ASIC 18 Gop/s average @ 400MHz • Memory footprint too big for ASIC or FPGA • Imagine ASIC has 1 Mbit • Xilinx VirtexII Pro 50 FPGA has 4.1 Mbit • Algorithm not appropriate for GPUs • NVIDIA can only do motion estimation on standard resolution video in real-time

  7. Project Goal Develop a reusable image processing platform for digital film production • Provide the bandwidth and processing power required by future algorithms • Shorten the design cycle of new applications Project members and tasks • Grass Valley Germany - Board design • TU-Braunschweig - FPGA programming and internal architecture • TU-Ilmenau - Example algorithm: Film grain noise reduction

  8. System-Wide Communication I/O I/O I/O I/O I/Obridge I/Obridge I/Obridge I/Obridge Image Engine Core Image Engine Core Image Engine Core Image Engine Core PCIe Switch I/O I/O I/O I/O • PCI Express is time-driven packet based (TDMA) • Scalable and standard in PCs PCI Express Network Host Interface (PCI Express)

  9. Inter FPGA Communication 4 x 16bit - 250 MHzDDR PCI Express Schedule FPGA 125 MHz FPGA 125 MHz TDMA send /rec TDMA send /rec FPGA Schedule 2 1 1 2 2 2 1 3 3 3 2 2 separation of logical and physical channels

  10. FlexWAFE Architecture RAM FlexWAFE (FPGA) RAM RAM SDRAM-Controller Caches Caches 2xCPU 2xCPU FlexWAFE FlexWAFE FlexWAFE DP DP DP DP DP DP DP DP DP DP Data Path Data Path RAM RAM RAM I/O PCIe Bridge I/Oprocessor RAM I/O Flexible Weakly-programmable Advanced Film Engine Real-time SDRAM Controller with priorities and Traffic Shaping PowerPC 405 CPU,16kB$,MMU Flexible configurable datapaths with local memories and address generators Blocks with local connections facilitate the usage of user-assisted Floorplaning (i.e. PlanAhead) SDRAM FlexWAFE (FPGA) SDRAM-Controller DP Physical FPGA Structure

  11. LMC for MC-DPU streams CMC address and data stream – large address space LMC Ingress AG gen gen ctrl param param Parameter bus LMC LMC DPU DPU LMC Egress AG gen ctrl Dual ported Block RAM Local DPU data stream – small address space Off-chip SDRAM FPGA • central controller + fast local controllers allows one cycle context switch • programming using a “slow” parameter bus Memory Controller Algorithm controller VLIW Low routing effort required Similar to Prof. Hartenstein self-addressed memories address bus feedback (“done” signals) data bus parameter bus (data+addr) param. regs. + local controller param regs + local controller + AGs

  12. 3D wavelet NR - mapping to Flexfilm board Motion Compensation Wavelet transform-1 Noise Reduction Wavelet transform Haar filter Haar filter-1 Next Current Previous Bi-directional Motion Estimation Wavelet transform-1 Noise Reduction Wavelet transform 2 Gbit 2 Gbit 2 Gbit 2 Gbit 90% 84% 86% 14 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 2 Gbit 2 Gbit 2 Gbit 2 Gbit 8 Gb/s 8 Gb/s PCIe 4x to host PC Smaller, cheaper, faster than current existing solutions FlexWAFE FPGAs Router FPGA 8 Gb/s 8 Gb/s • VirtexII • Pro V50-6 • 23616 slices • 4.1 Mbit RAM • 2 PPC

  13. Motion estimation/compensation example 3 Gbit/s The harmless 3 Gbit/s data input grows to 20 Gbit/s memory I/O FPGA Floorplan Logic channel in (TDMA) MC TDMA LMC ME A RGB-Y RGB -> Y ME Algorithm Controller 480 Mbit SDRAM LMC LMC LMC MC 3 Gbit/s 120 Mbit SDRAM LMC MC 3 Gbit/s 1 Gbit/s MC 3 Gbit/s MC 120 Mbit SDRAM Power PC Power PC 9 Gbit/s 1 Gbit/s MC ME B LMC LMC LMC LMC LMC LMC LMC MC Algorithm Controller LMC MC ME A ME B Motion Compensation (MC) Logic channel out (TDMA) TDMA 6 Gbit/s • Loose-floorplaning is required to achive the desired speed • Real-time capable MCs obliviate the memory bottleneck issue • Weakly programable LMCs • allow design reuse • low overhead run-time reconfiguration • Tight-floorplaning was used in the ME (facilitated by it’s regular structure) • Small ME algorithm contr. • 575 parameters for bidir-ME 512 x

  14. FlexFilm vs. Intel Pentium 4 This shows that a FPGA full of adders can actually do (motion estimation) useful work.

  15. Memory Controller FlexWAFE (FPGA) SDRAM-Controller Caches Caches 2xCPU 2xCPU DP DP DP DP DP DP DP DP DP DP Data Path Data Path SDRAM • External SDRAM: • internal FPGA memoryto small • one 2K image : ~ 20 MByte • XC2VP50: ~0.5 MByte • high bandwidth required • data path and CPU access same memory • different access patterns ! FlexWAFE (FPGA) SDRAM-Controller

  16. Different Memory Access Patterns • CPU • irregular access patterns • prefetch not possible • buffers don’t help • stalls at memory access • system performance loss • short latency needed • every nanosecond counts • in case of real-time: latency must be bounded • QoS requirement:minimum possible latency • Data Path • regular access patterns • prefetch possible • long latency allowed • compensated by buffers • real-time operation -latency must be bounded • QoS requirement:guaranteed minimum throughput at guaranteed maximum latency

  17. Related Work • Designs from • Sonics Memmax Controller [Sonics Inc., Weber], • MediaTek Corp. [Lee, Lin] / Ciao-Tung University [Jen] • different service levels • similar architectures, • complex ASIC-based approach

  18. QoS Implementation • 2 QoS components • access prioritization • low latency requests get a higher priority • high priority requests are executed before low priority requests • SIPS 2003, Seoul • traffic shaping • restrict bandwidth of high priority requests • to prevent starvation of low priority requests • allow complex traffic shaping patterns to support bursts • N requests in a window of T clock cycles, “Leaky Bucket” • DAC 2005, Anaheim (L.A.)

  19. Memory Controller Main Features • Main Features • full burst accesses, burst length 8 (4 clock cycles) • uses auto-precharge access mode • rows are closed automatically after access • much simpler (and faster) design • much simpler address generation • access optimization • bank interleaving (hides row activation latency) • request bundling (minimizes bus direction switches) • possible out-of-order execution of memory requests to improve bandwidth • round-robin scheduling • guarantees fairness

  20. Memory Controller Block Diagram

  21. Simulation Environment CPU Caches SDRAM Controller CPU PowerPC, ARM pegwitdecode Flow Control DDR-SDRAM I/O Image Datapath 2048 x 1556 x 24 FPS 16 bit grayscale Image Input and Output 3-level Discrete Wavelet Transformation DWT Written in SystemC @ Transaction Level, (mostly) cycle accurate

  22. Simulation Results Traffic Shaping n: n consecutive requests Tø = T / n:avg. clock cycles between requests • Evaluation • Traffic shaping very efficient • n > 1 hardly more efficient than n = 1reason: blocking read cache transactions Required memory accesses per cacheline fill: PPC 1, ARM 4

  23. SDRAM Controller Resources • Configuration • application ports • high priority: 1 read, 1 write; 32 bit • standard priority: 2 reads, 5 write; 32 bit • 32 bit DDR-SDRAM, 4 banks • flow control for high priority ports • Performance: 125 MHz

  24. Configurability • Configuration • required to support multiple applications • DWT based Noise Reduction requires 9 controllers and 5 different controller configurations • done at synthesis time • usage of generics • only a few general constants • no code changes required

  25. Configurability • Configuration Options • SDRAM timing and layout • data bus sizes, address bus sizes, number of chip selects, … • application ports • number of ports, up to 8 per priority level • per port data and address width • port address translation • per-port configuration of address translation tables • number of entries per table and table contents • global physical / bank-row-column translation style • flow control • T (window size) and N (number of requests) • disable

  26. Configurability MC TDMA LMC ME A RGB-Y LMC MC Power PC Power PC MC ME B LMC MC TDMA • Example: Motion Estimation / Compensation FPGA • Controller ME 1,Controller ME 2 • 32 bit • 1 write port • 3 read ports • Controller MC • 64 bit • 1 write port • 2 read ports

  27. FlexFilm / FlexWAFE conclusions • Communication Centric Architecture • PCI-Express network interconnect allows easy scaling • separation of logical and physical channels for real-time inter-FPGA communication • small reconfigurable self-addressed local memories • FlexWAFE-Architecture • weakly-programmable paradigm allows fast context-switch like dynamic reconfiguration • library of hand-optimized data path modules allow fast application development • SDRAM-Controller • QoS increases overall system performance • lean, but efficient controller design • easy configuration

  28. Thank you for your attention for more information please visit: http://www.flexfilm.org

More Related