FlexFilm An Image Processor for Digital Film Processing

FlexFilmAn Image Processor for Digital Film Processing Schloss Dagstuhl, Germany, April 02 – 07, 2006 Amilcar do Carmo Lucas, Sven Heithecker, Rolf Ernst Technical University of Braunschweig, Germany Technische Universität Braunschweig

Outline Motivation Film Grain Reduction FlexWAFE Architecture / Library FlexFilm Hardware SDRAM-Controller QoS, detailed view, configurability Conclusion

Motivation Application • Digital film image processing Features • High data volumes • Low resolution (2K) - 2048 x 2048 / 30bpp = 120 Mbit per frame • High resolution (4K) - 4096 x 4096 / 48bpp = 768 Mbit per frame • Low-latency (allow user interactivity) • Real time is 24 frames per second → up to 18 Gbit/s First application • Film grain noise reduction

Film grain noise reduction application • Required as a first step in other image processing algorithms • Next generation NR requires higher than the available state-of-the-art processing power with film grain de-noised

Film grain noise reduction application 500 Mop/s 100 Kbit H FIR V FIR Frame delays V FIR H FIR HH NR V FIR V FIR-1 2D DWT 2D DWT NR NR 2D DWT-1 2D DWT-1 HL NR V FIR V FIR-1 HL NR V FIR V FIR V FIR-1 V FIR-1 H FIR-1 V FIR-1 H FIR H FIR H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 HL NR HL NR V FIR-1 + + Bi-dir. motion-compensation H FIR-1 V FIR V FIR V FIR-1 V FIR-1 + + + + + + LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 H FIR H FIR H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 + + V FIR V FIR HH NR HH NR V FIR-1 V FIR-1 Temporal 1D DWT-1 Temporal 1D DWT H FIR H FIR V FIR V FIR 2D DWT 2D DWT NR NR sync sync 2D DWT-1 2D DWT-1 V FIR V FIR H FIR H FIR HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer V FIR V FIR Buffer Buffer HL NR HL NR V FIR V FIR Buffer Buffer H FIR H FIR V FIR-1 V FIR-1 + + HL NR HL NR HL NR HL NR Buffer Buffer V FIR-1 V FIR-1 V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR-1 H FIR-1 Buffer Buffer V FIR-1 V FIR-1 V FIR-1 V FIR-1 LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR-1 H FIR-1 H FIR H FIR + + 8 Mbit 3 Mbit 36 BPP 30 BPP HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer 2D DWT 2D DWT NR NR sync sync 2D DWT-1 2D DWT-1 V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR H FIR + + HL NR HL NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR H FIR + + de-noised image HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer Interframe 7 DWT 7 IDWT 640 Mbit 160 Gop/s

Film grain noise reduction application • Complexity (2K @24fps) • 180 G add/s • 11 G mul/s • 12 G comp/s • 660 Mbit • 3 Gbit/s in • 3 Gbit/s out • Too complex for state-of-the-art PC or DSP based systems • Texas Instruments fixed-point DSP max 0,8 Gop/s @ 1GHz • Stanford Imagine ASIC 18 Gop/s average @ 400MHz • Memory footprint too big for ASIC or FPGA • Imagine ASIC has 1 Mbit • Xilinx VirtexII Pro 50 FPGA has 4.1 Mbit • Algorithm not appropriate for GPUs • NVIDIA can only do motion estimation on standard resolution video in real-time

Project Goal Develop a reusable image processing platform for digital film production • Provide the bandwidth and processing power required by future algorithms • Shorten the design cycle of new applications Project members and tasks • Grass Valley Germany - Board design • TU-Braunschweig - FPGA programming and internal architecture • TU-Ilmenau - Example algorithm: Film grain noise reduction

System-Wide Communication I/O I/O I/O I/O I/Obridge I/Obridge I/Obridge I/Obridge Image Engine Core Image Engine Core Image Engine Core Image Engine Core PCIe Switch I/O I/O I/O I/O • PCI Express is time-driven packet based (TDMA) • Scalable and standard in PCs PCI Express Network Host Interface (PCI Express)

Inter FPGA Communication 4 x 16bit - 250 MHzDDR PCI Express Schedule FPGA 125 MHz FPGA 125 MHz TDMA send /rec TDMA send /rec FPGA Schedule 2 1 1 2 2 2 1 3 3 3 2 2 separation of logical and physical channels

FlexWAFE Architecture RAM FlexWAFE (FPGA) RAM RAM SDRAM-Controller Caches Caches 2xCPU 2xCPU FlexWAFE FlexWAFE FlexWAFE DP DP DP DP DP DP DP DP DP DP Data Path Data Path RAM RAM RAM I/O PCIe Bridge I/Oprocessor RAM I/O Flexible Weakly-programmable Advanced Film Engine Real-time SDRAM Controller with priorities and Traffic Shaping PowerPC 405 CPU,16kB$,MMU Flexible configurable datapaths with local memories and address generators Blocks with local connections facilitate the usage of user-assisted Floorplaning (i.e. PlanAhead) SDRAM FlexWAFE (FPGA) SDRAM-Controller DP Physical FPGA Structure

LMC for MC-DPU streams CMC address and data stream – large address space LMC Ingress AG gen gen ctrl param param Parameter bus LMC LMC DPU DPU LMC Egress AG gen ctrl Dual ported Block RAM Local DPU data stream – small address space Off-chip SDRAM FPGA • central controller + fast local controllers allows one cycle context switch • programming using a “slow” parameter bus Memory Controller Algorithm controller VLIW Low routing effort required Similar to Prof. Hartenstein self-addressed memories address bus feedback (“done” signals) data bus parameter bus (data+addr) param. regs. + local controller param regs + local controller + AGs

3D wavelet NR - mapping to Flexfilm board Motion Compensation Wavelet transform-1 Noise Reduction Wavelet transform Haar filter Haar filter-1 Next Current Previous Bi-directional Motion Estimation Wavelet transform-1 Noise Reduction Wavelet transform 2 Gbit 2 Gbit 2 Gbit 2 Gbit 90% 84% 86% 14 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 2 Gbit 2 Gbit 2 Gbit 2 Gbit 8 Gb/s 8 Gb/s PCIe 4x to host PC Smaller, cheaper, faster than current existing solutions FlexWAFE FPGAs Router FPGA 8 Gb/s 8 Gb/s • VirtexII • Pro V50-6 • 23616 slices • 4.1 Mbit RAM • 2 PPC

Motion estimation/compensation example 3 Gbit/s The harmless 3 Gbit/s data input grows to 20 Gbit/s memory I/O FPGA Floorplan Logic channel in (TDMA) MC TDMA LMC ME A RGB-Y RGB -> Y ME Algorithm Controller 480 Mbit SDRAM LMC LMC LMC MC 3 Gbit/s 120 Mbit SDRAM LMC MC 3 Gbit/s 1 Gbit/s MC 3 Gbit/s MC 120 Mbit SDRAM Power PC Power PC 9 Gbit/s 1 Gbit/s MC ME B LMC LMC LMC LMC LMC LMC LMC MC Algorithm Controller LMC MC ME A ME B Motion Compensation (MC) Logic channel out (TDMA) TDMA 6 Gbit/s • Loose-floorplaning is required to achive the desired speed • Real-time capable MCs obliviate the memory bottleneck issue • Weakly programable LMCs • allow design reuse • low overhead run-time reconfiguration • Tight-floorplaning was used in the ME (facilitated by it’s regular structure) • Small ME algorithm contr. • 575 parameters for bidir-ME 512 x

FlexFilm vs. Intel Pentium 4 This shows that a FPGA full of adders can actually do (motion estimation) useful work.

Memory Controller FlexWAFE (FPGA) SDRAM-Controller Caches Caches 2xCPU 2xCPU DP DP DP DP DP DP DP DP DP DP Data Path Data Path SDRAM • External SDRAM: • internal FPGA memoryto small • one 2K image : ~ 20 MByte • XC2VP50: ~0.5 MByte • high bandwidth required • data path and CPU access same memory • different access patterns ! FlexWAFE (FPGA) SDRAM-Controller

Different Memory Access Patterns • CPU • irregular access patterns • prefetch not possible • buffers don’t help • stalls at memory access • system performance loss • short latency needed • every nanosecond counts • in case of real-time: latency must be bounded • QoS requirement:minimum possible latency • Data Path • regular access patterns • prefetch possible • long latency allowed • compensated by buffers • real-time operation -latency must be bounded • QoS requirement:guaranteed minimum throughput at guaranteed maximum latency

Related Work • Designs from • Sonics Memmax Controller [Sonics Inc., Weber], • MediaTek Corp. [Lee, Lin] / Ciao-Tung University [Jen] • different service levels • similar architectures, • complex ASIC-based approach

QoS Implementation • 2 QoS components • access prioritization • low latency requests get a higher priority • high priority requests are executed before low priority requests • SIPS 2003, Seoul • traffic shaping • restrict bandwidth of high priority requests • to prevent starvation of low priority requests • allow complex traffic shaping patterns to support bursts • N requests in a window of T clock cycles, “Leaky Bucket” • DAC 2005, Anaheim (L.A.)

Memory Controller Main Features • Main Features • full burst accesses, burst length 8 (4 clock cycles) • uses auto-precharge access mode • rows are closed automatically after access • much simpler (and faster) design • much simpler address generation • access optimization • bank interleaving (hides row activation latency) • request bundling (minimizes bus direction switches) • possible out-of-order execution of memory requests to improve bandwidth • round-robin scheduling • guarantees fairness

Memory Controller Block Diagram

Simulation Environment CPU Caches SDRAM Controller CPU PowerPC, ARM pegwitdecode Flow Control DDR-SDRAM I/O Image Datapath 2048 x 1556 x 24 FPS 16 bit grayscale Image Input and Output 3-level Discrete Wavelet Transformation DWT Written in SystemC @ Transaction Level, (mostly) cycle accurate

Simulation Results Traffic Shaping n: n consecutive requests Tø = T / n:avg. clock cycles between requests • Evaluation • Traffic shaping very efficient • n > 1 hardly more efficient than n = 1reason: blocking read cache transactions Required memory accesses per cacheline fill: PPC 1, ARM 4

SDRAM Controller Resources • Configuration • application ports • high priority: 1 read, 1 write; 32 bit • standard priority: 2 reads, 5 write; 32 bit • 32 bit DDR-SDRAM, 4 banks • flow control for high priority ports • Performance: 125 MHz

Configurability • Configuration • required to support multiple applications • DWT based Noise Reduction requires 9 controllers and 5 different controller configurations • done at synthesis time • usage of generics • only a few general constants • no code changes required

Configurability • Configuration Options • SDRAM timing and layout • data bus sizes, address bus sizes, number of chip selects, … • application ports • number of ports, up to 8 per priority level • per port data and address width • port address translation • per-port configuration of address translation tables • number of entries per table and table contents • global physical / bank-row-column translation style • flow control • T (window size) and N (number of requests) • disable

Configurability MC TDMA LMC ME A RGB-Y LMC MC Power PC Power PC MC ME B LMC MC TDMA • Example: Motion Estimation / Compensation FPGA • Controller ME 1,Controller ME 2 • 32 bit • 1 write port • 3 read ports • Controller MC • 64 bit • 1 write port • 2 read ports

FlexFilm / FlexWAFE conclusions • Communication Centric Architecture • PCI-Express network interconnect allows easy scaling • separation of logical and physical channels for real-time inter-FPGA communication • small reconfigurable self-addressed local memories • FlexWAFE-Architecture • weakly-programmable paradigm allows fast context-switch like dynamic reconfiguration • library of hand-optimized data path modules allow fast application development • SDRAM-Controller • QoS increases overall system performance • lean, but efficient controller design • easy configuration

Thank you for your attention for more information please visit: http://www.flexfilm.org

FlexFilm An Image Processor for Digital Film Processing

FlexFilm An Image Processor for Digital Film Processing

Presentation Transcript

Digital Image Processing

Digital Image Processor

Digital Image Processing

Digital Image Processing

Digital Image Processing

DIGITAL IMAGE PROCESSING

DIGITAL IMAGE PROCESSING

Digital Image Processing

Digital Image Processing

Digital Image Processing

Digital Image Processing

DIGITAL IMAGE PROCESSING

Digital Image Processing

Digital Image Processing

Digital Image processing

Digital Image Processing

Digital Image Processing

Digital Image Processing

Digital Image Processing

Digital Image Processing

Digital Image Processing