280 likes | 539 Views
FlexFilm An Image Processor for Digital Film Processing. Schloss Dagstuhl , Germany, April 02 – 07, 2006. Amilcar do Carmo Lucas , Sven Heithecker , Rolf Ernst Technical University of Braunschweig, Germany. Technische Universität Braunschweig. Outline. Motivation Film Grain Reduction
E N D
FlexFilmAn Image Processor for Digital Film Processing Schloss Dagstuhl, Germany, April 02 – 07, 2006 Amilcar do Carmo Lucas, Sven Heithecker, Rolf Ernst Technical University of Braunschweig, Germany Technische Universität Braunschweig
Outline Motivation Film Grain Reduction FlexWAFE Architecture / Library FlexFilm Hardware SDRAM-Controller QoS, detailed view, configurability Conclusion
Motivation Application • Digital film image processing Features • High data volumes • Low resolution (2K) - 2048 x 2048 / 30bpp = 120 Mbit per frame • High resolution (4K) - 4096 x 4096 / 48bpp = 768 Mbit per frame • Low-latency (allow user interactivity) • Real time is 24 frames per second → up to 18 Gbit/s First application • Film grain noise reduction
Film grain noise reduction application • Required as a first step in other image processing algorithms • Next generation NR requires higher than the available state-of-the-art processing power with film grain de-noised
Film grain noise reduction application 500 Mop/s 100 Kbit H FIR V FIR Frame delays V FIR H FIR HH NR V FIR V FIR-1 2D DWT 2D DWT NR NR 2D DWT-1 2D DWT-1 HL NR V FIR V FIR-1 HL NR V FIR V FIR V FIR-1 V FIR-1 H FIR-1 V FIR-1 H FIR H FIR H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 HL NR HL NR V FIR-1 + + Bi-dir. motion-compensation H FIR-1 V FIR V FIR V FIR-1 V FIR-1 + + + + + + LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 H FIR H FIR H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 H FIR-1 + + V FIR V FIR HH NR HH NR V FIR-1 V FIR-1 Temporal 1D DWT-1 Temporal 1D DWT H FIR H FIR V FIR V FIR 2D DWT 2D DWT NR NR sync sync 2D DWT-1 2D DWT-1 V FIR V FIR H FIR H FIR HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer V FIR V FIR Buffer Buffer HL NR HL NR V FIR V FIR Buffer Buffer H FIR H FIR V FIR-1 V FIR-1 + + HL NR HL NR HL NR HL NR Buffer Buffer V FIR-1 V FIR-1 V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR-1 H FIR-1 Buffer Buffer V FIR-1 V FIR-1 V FIR-1 V FIR-1 LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR-1 H FIR-1 H FIR H FIR + + 8 Mbit 3 Mbit 36 BPP 30 BPP HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer 2D DWT 2D DWT NR NR sync sync 2D DWT-1 2D DWT-1 V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR H FIR + + HL NR HL NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer LH NR LH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer H FIR H FIR + + de-noised image HH NR HH NR V FIR V FIR V FIR-1 V FIR-1 buffer buffer Interframe 7 DWT 7 IDWT 640 Mbit 160 Gop/s
Film grain noise reduction application • Complexity (2K @24fps) • 180 G add/s • 11 G mul/s • 12 G comp/s • 660 Mbit • 3 Gbit/s in • 3 Gbit/s out • Too complex for state-of-the-art PC or DSP based systems • Texas Instruments fixed-point DSP max 0,8 Gop/s @ 1GHz • Stanford Imagine ASIC 18 Gop/s average @ 400MHz • Memory footprint too big for ASIC or FPGA • Imagine ASIC has 1 Mbit • Xilinx VirtexII Pro 50 FPGA has 4.1 Mbit • Algorithm not appropriate for GPUs • NVIDIA can only do motion estimation on standard resolution video in real-time
Project Goal Develop a reusable image processing platform for digital film production • Provide the bandwidth and processing power required by future algorithms • Shorten the design cycle of new applications Project members and tasks • Grass Valley Germany - Board design • TU-Braunschweig - FPGA programming and internal architecture • TU-Ilmenau - Example algorithm: Film grain noise reduction
System-Wide Communication I/O I/O I/O I/O I/Obridge I/Obridge I/Obridge I/Obridge Image Engine Core Image Engine Core Image Engine Core Image Engine Core PCIe Switch I/O I/O I/O I/O • PCI Express is time-driven packet based (TDMA) • Scalable and standard in PCs PCI Express Network Host Interface (PCI Express)
Inter FPGA Communication 4 x 16bit - 250 MHzDDR PCI Express Schedule FPGA 125 MHz FPGA 125 MHz TDMA send /rec TDMA send /rec FPGA Schedule 2 1 1 2 2 2 1 3 3 3 2 2 separation of logical and physical channels
FlexWAFE Architecture RAM FlexWAFE (FPGA) RAM RAM SDRAM-Controller Caches Caches 2xCPU 2xCPU FlexWAFE FlexWAFE FlexWAFE DP DP DP DP DP DP DP DP DP DP Data Path Data Path RAM RAM RAM I/O PCIe Bridge I/Oprocessor RAM I/O Flexible Weakly-programmable Advanced Film Engine Real-time SDRAM Controller with priorities and Traffic Shaping PowerPC 405 CPU,16kB$,MMU Flexible configurable datapaths with local memories and address generators Blocks with local connections facilitate the usage of user-assisted Floorplaning (i.e. PlanAhead) SDRAM FlexWAFE (FPGA) SDRAM-Controller DP Physical FPGA Structure
LMC for MC-DPU streams CMC address and data stream – large address space LMC Ingress AG gen gen ctrl param param Parameter bus LMC LMC DPU DPU LMC Egress AG gen ctrl Dual ported Block RAM Local DPU data stream – small address space Off-chip SDRAM FPGA • central controller + fast local controllers allows one cycle context switch • programming using a “slow” parameter bus Memory Controller Algorithm controller VLIW Low routing effort required Similar to Prof. Hartenstein self-addressed memories address bus feedback (“done” signals) data bus parameter bus (data+addr) param. regs. + local controller param regs + local controller + AGs
3D wavelet NR - mapping to Flexfilm board Motion Compensation Wavelet transform-1 Noise Reduction Wavelet transform Haar filter Haar filter-1 Next Current Previous Bi-directional Motion Estimation Wavelet transform-1 Noise Reduction Wavelet transform 2 Gbit 2 Gbit 2 Gbit 2 Gbit 90% 84% 86% 14 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 8 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 14 Gb/s 2 Gbit 2 Gbit 2 Gbit 2 Gbit 8 Gb/s 8 Gb/s PCIe 4x to host PC Smaller, cheaper, faster than current existing solutions FlexWAFE FPGAs Router FPGA 8 Gb/s 8 Gb/s • VirtexII • Pro V50-6 • 23616 slices • 4.1 Mbit RAM • 2 PPC
Motion estimation/compensation example 3 Gbit/s The harmless 3 Gbit/s data input grows to 20 Gbit/s memory I/O FPGA Floorplan Logic channel in (TDMA) MC TDMA LMC ME A RGB-Y RGB -> Y ME Algorithm Controller 480 Mbit SDRAM LMC LMC LMC MC 3 Gbit/s 120 Mbit SDRAM LMC MC 3 Gbit/s 1 Gbit/s MC 3 Gbit/s MC 120 Mbit SDRAM Power PC Power PC 9 Gbit/s 1 Gbit/s MC ME B LMC LMC LMC LMC LMC LMC LMC MC Algorithm Controller LMC MC ME A ME B Motion Compensation (MC) Logic channel out (TDMA) TDMA 6 Gbit/s • Loose-floorplaning is required to achive the desired speed • Real-time capable MCs obliviate the memory bottleneck issue • Weakly programable LMCs • allow design reuse • low overhead run-time reconfiguration • Tight-floorplaning was used in the ME (facilitated by it’s regular structure) • Small ME algorithm contr. • 575 parameters for bidir-ME 512 x
FlexFilm vs. Intel Pentium 4 This shows that a FPGA full of adders can actually do (motion estimation) useful work.
Memory Controller FlexWAFE (FPGA) SDRAM-Controller Caches Caches 2xCPU 2xCPU DP DP DP DP DP DP DP DP DP DP Data Path Data Path SDRAM • External SDRAM: • internal FPGA memoryto small • one 2K image : ~ 20 MByte • XC2VP50: ~0.5 MByte • high bandwidth required • data path and CPU access same memory • different access patterns ! FlexWAFE (FPGA) SDRAM-Controller
Different Memory Access Patterns • CPU • irregular access patterns • prefetch not possible • buffers don’t help • stalls at memory access • system performance loss • short latency needed • every nanosecond counts • in case of real-time: latency must be bounded • QoS requirement:minimum possible latency • Data Path • regular access patterns • prefetch possible • long latency allowed • compensated by buffers • real-time operation -latency must be bounded • QoS requirement:guaranteed minimum throughput at guaranteed maximum latency
Related Work • Designs from • Sonics Memmax Controller [Sonics Inc., Weber], • MediaTek Corp. [Lee, Lin] / Ciao-Tung University [Jen] • different service levels • similar architectures, • complex ASIC-based approach
QoS Implementation • 2 QoS components • access prioritization • low latency requests get a higher priority • high priority requests are executed before low priority requests • SIPS 2003, Seoul • traffic shaping • restrict bandwidth of high priority requests • to prevent starvation of low priority requests • allow complex traffic shaping patterns to support bursts • N requests in a window of T clock cycles, “Leaky Bucket” • DAC 2005, Anaheim (L.A.)
Memory Controller Main Features • Main Features • full burst accesses, burst length 8 (4 clock cycles) • uses auto-precharge access mode • rows are closed automatically after access • much simpler (and faster) design • much simpler address generation • access optimization • bank interleaving (hides row activation latency) • request bundling (minimizes bus direction switches) • possible out-of-order execution of memory requests to improve bandwidth • round-robin scheduling • guarantees fairness
Simulation Environment CPU Caches SDRAM Controller CPU PowerPC, ARM pegwitdecode Flow Control DDR-SDRAM I/O Image Datapath 2048 x 1556 x 24 FPS 16 bit grayscale Image Input and Output 3-level Discrete Wavelet Transformation DWT Written in SystemC @ Transaction Level, (mostly) cycle accurate
Simulation Results Traffic Shaping n: n consecutive requests Tø = T / n:avg. clock cycles between requests • Evaluation • Traffic shaping very efficient • n > 1 hardly more efficient than n = 1reason: blocking read cache transactions Required memory accesses per cacheline fill: PPC 1, ARM 4
SDRAM Controller Resources • Configuration • application ports • high priority: 1 read, 1 write; 32 bit • standard priority: 2 reads, 5 write; 32 bit • 32 bit DDR-SDRAM, 4 banks • flow control for high priority ports • Performance: 125 MHz
Configurability • Configuration • required to support multiple applications • DWT based Noise Reduction requires 9 controllers and 5 different controller configurations • done at synthesis time • usage of generics • only a few general constants • no code changes required
Configurability • Configuration Options • SDRAM timing and layout • data bus sizes, address bus sizes, number of chip selects, … • application ports • number of ports, up to 8 per priority level • per port data and address width • port address translation • per-port configuration of address translation tables • number of entries per table and table contents • global physical / bank-row-column translation style • flow control • T (window size) and N (number of requests) • disable
Configurability MC TDMA LMC ME A RGB-Y LMC MC Power PC Power PC MC ME B LMC MC TDMA • Example: Motion Estimation / Compensation FPGA • Controller ME 1,Controller ME 2 • 32 bit • 1 write port • 3 read ports • Controller MC • 64 bit • 1 write port • 2 read ports
FlexFilm / FlexWAFE conclusions • Communication Centric Architecture • PCI-Express network interconnect allows easy scaling • separation of logical and physical channels for real-time inter-FPGA communication • small reconfigurable self-addressed local memories • FlexWAFE-Architecture • weakly-programmable paradigm allows fast context-switch like dynamic reconfiguration • library of hand-optimized data path modules allow fast application development • SDRAM-Controller • QoS increases overall system performance • lean, but efficient controller design • easy configuration
Thank you for your attention for more information please visit: http://www.flexfilm.org