280 likes | 469 Views
Embedded OpenCV Acceleration. Dario Pennisi. Introduction. Open -Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained. History.
E N D
Embedded OpenCV Acceleration Dario Pennisi
Introduction • Open-Source ComputerVision Library • Over 2500 algorithms and functions • Cross platform, portable API • Windows, Linux, OS X, Android, iOS • Real Time performance • BSD license • Professionally developed and maintained
History • Launched in 1999 by Intel • Showcasing Intel Performance Library • First Alpha released in 2000 • 1.0 version released in 2006 • Corporate support by Willow Garage in 2008 • 2.0 version released in 2009 • Improved c++ interfaces • Releases each 6 months • In 2014 taken over by ItSeez • 3.0 in beta now • Drop C API support
Application structure • Building blocks to ease vision applications Image Retrieval Pre Processing Feature Extraction Object Detection OpenCV imgproc highgui objdetect features2d ml stitching calib3d video Recognition Reconstruction Analisys Decision Making
Environment Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration CUDA SSE/AVX/NEON OpenCL
System Engineering • Dimensioning system is fundamental • Understand your algorithm • Carefully choose your toolbox • Embedded means no chance for “one size fits all”
Acceleration Strategies • Optimize Algorithms • Profile • Optimize • Partition (CPU/GPU/DSP) • FPGA acceleration • High level synthesis • Custom DSP • RTL coding • Brute Force • Increase number of CPUs • Increase CPU Frequency • Accelerated libraries • NEON • OpenCL/CUDA
Bottlenecks Know your enemy
Memory • Access to external memory is expensive • CPU load instructions are slow • Memory has Latency • Memory bandwidth is shared among CPUs • Cache • Prevents CPU to access external memory • Data and instruction
Disordered accesses • What happens when we have cache miss? • Fetch data from same memory row 13 clocks • Fetch data from a different row 23 clocks • Cache line usually 32 bytes • 8 clocks to fill a line (32 bit data bus) • Memory bandwidth Efficiency • 38% on same row • 26% on different row
Bottlenecks - Cache • 1920x1080 YCbCr 4:2:2 (Full HD) 4MB • Double the size of the biggest ARM L2 cache • 1280x720 YCbCr 4:2:2 (HD) 1.8 MB • Just fits L2 Cache… ok if reading and writing to the same frame • 720x576 YCbCr 4:2:2 (SD) 800KB • 2 images in L2 cache…
OpenCV Algorithms • Mostly designed for PCs • Well structured • General purpose • Optimized functions for SSE/AVX • Relatively optimized • Small number of accelerated functions • NEON • Cuda (nVidia GPU/Tegra) • OpenCL (GPU, Multicore processors)
Multicore ARM/NEON • NEON SIMD instructions work on vectors of registers • Load-process-storephilosophy • Load/store costs 1 cycle only if in L1 cache • 4-12 cycles if in L2 • 25 to 35 cycles on L2 cache miss • SIMD instructions can take from 1 to 5 clocks • Fast clock useless on big datasets/small computation
Generic DSP • Very similar to ARM/NEON • High speed pipeline impaired by inefficient memory access subsystem • When smart DMA is available it is very complex to program • When DSP is integrated in SoC it shares ARM’s bandwidth
OpenCL on GPU • OpenCL on Vivante GC2000 • Claimed capability up to 16 GFLOPS • Real Applications • only on internal registers: 13.8 GFLOPS • computing 1000x1000 matrix: 600 MFLOPS • Bandwidth and inefficiencies: • Only 1K local memory and 64 byte memory cache
OpenCL on FPGA • Same code can run on FPGA and GPU • Transform selected functions in hardware • Automated memory access coalescing • Each function requires dedicated logic • Large FPGAs required • Partial reconfiguration may solve this • Significant compilation time
HLS on FPGA • High Level Synthesis • Convert C to hardware • HLS requires Code to be heavily modified • Pragmas to instruct compiler • Code restructuring • Not portable anymore • Each function requires dedicated logic • Large FPGAs required • Partial reconfiguration may solve this • Significant compilation time
A different approach Demanding algorithms on low cost/power HW Algorithm Analysis Memory Access Pattern Data intensive processing Decision Making DSP NEON Custom Instruction (RTL) ARM program DMA
External co-processing ARM Memory ARM Memory GPU FPGA PCIe Memory FPGA
Co-processor details • FPGA Co-Processor • Separate memory • Adds bandwidth • Reduces access conflict • Algorithm aware DMA • Access memory in ordered way • Add caching through embedded RAM • Algorithm specific processors • HLS/OpenCL synthesized IP blocks • DSP with custom instructions • Hardcoded IP blocks ARM ARM Memory DMA Processor DMA Processor Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM DSP core (s) DSP core (s) DSP core/IP Block
Co-processor details • Flex DMA • Dedicated processor with DMA custom instruction • Software defined memory access pattern • Block Capture • Extracts data for each tile • DPRAM • Local, high speed cache • DSP Core • Dedicated processor with Algorithm specific custom instructions ARM ARM Memory Flex DMA Flex DMA Flex DMA Flex DMA Block capture Block capture Block capture Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM DPRAM DPRAM DSP core (s) DSP core (s) DSP core (s) DSP core (s) DSP core/IP Block DSP core/IP Block
Environment Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS OpenVX Acceleration SSE/AVX/NEON OpenCL CUDA FPGA
OpenVX Graph Manager • Graph Construction • Allocates resources • Logical representation of algorithm • Graph Execution • Concatenate nodes avoiding memory storage • Tiling extensions • Single node execution can be split in multiple tiles • Multiple accelerators executing single task in parallel Memory Node1 Memory Node2 Memory Memory Node1 Node2 Memory
Summary • OpenCV today is mainly PC oriented. • ARM, Cuda, OpenCL support growing • Existing acceleration only on selected functions • Embedded CV requires good partitioning among resources • When ASSPs are not enough FPGAs are key • OpenVX provides a consistent HW acceleration platform, not only for OpenCV What we learnt