850 likes | 1.04k Views
Evaluation of LBP (and other local descriptors) computational performance i n multiple (computing) architectures. Center for Machine Vision Research. Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén
E N D
Evaluation of LBP (and other local descriptors) computational performance in multiple (computing) architectures Center for Machine Vision Research Miguel Bordallo López, Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén Sami Varjo, Henri Nykänen, Abdenour Hadid Center for Machine Vision Research, University of Oulu, Finland
A sentence I read somewhere... LBP features are desirable because of their extremely high computational performance... , connected to the power grid, , on a high-end CPU ...per pixel if we use the basic LBP, and we don’t need interpolation.
Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices
Why should we care ? • Evaluation of descriptors/features done in terms of accuracy • Computational performance (sometimes) disregarded • In Matlab, not processor specific, based on libraries, not measured, ... ... But ... • Faster methods are able to compute larger amounts of input • Applications at lower framerate might perform worse than at higher rates • Computational performance is a KEY measurement for application performance
... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy
... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy
... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame
... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame
... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame
What descriptor to choose? Accuracy % Computation time
Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices
LBPs are essentially local descriptors HD1080@60fps 1920x1080x60 = 125Mpix/s UHD up to 2 Gpix/s !!! That’s a lot of throughput !!
Linear complexity of local descriptors Time (ms) 1920x1080 1280x720 Number of pixels
Linear complexity of local descriptors Time (ms) 1920x1080 1280x720 Number of pixels Time grows linearly with the resolution
Linear complexity of local descriptors Time (ms) 1920x1080 Number of pixels Time grows linearly with the resolution
LBP variants Census (8x8) LBP(24,3) VLBP(8,1) LBP-TOP(8,1) LBP(8,1) LOCP(8,1) CLBP(8,1) Time grows linearly with the number of points
Implications Time = K * n_pixels * n_points K is implementation dependent K is platform dependent Allows for platform comparison: • CPP metric (cycles per pixel) • Time normalized by resolution and clock frequency
Local descriptor computational breakdown • Filtering • Quantization • Feature composition • Histogramming
Local descriptor computational breakdown • Filtering (LBP) • Quantization • Feature composition • Histogramming 0 0 0 0 -1 0 1 0 0 1 0 0 0 -1 0 0 0 0 0 1 0 0 -1 0 0 0 0 0 0 1 0 -1 0 0 0 0 ... f1 f2 f3 ... f8
Local descriptor computational breakdown • Filtering (BSIF) 2. Quantization • Feature composition • Histogramming -0.18 0.19 -0.19 -1.56 1.46 0.05 -0.35 2.74 -2.68 2.50 -2.22 0.29 -0.01 -0.14 0.25 -0.67 0.95 -0.38 0.03 -0.48 0.63 -3.16 3.29 -0.79 0.60 -2.72 2.20 -0.67 0.69 0.13 0.22 0.19 0.08 0,75 -1.07 0.40 ... f1 f2 f3 ... f8
Local descriptor computational breakdown • Filtering • Quantization (LBP, LPQ, BSIF) q1 = f1 > 0 , q2 = f2 >0 ... q8 = f8>0 • Feature composition • Histogramming
Local descriptor computational breakdown • Filtering • Quantization • Feature composition (LBP, LPQ, BSIF) LBP = q1*1 + q2*2 + q3*4 + ... + q8*128 LBP = q1 + q2<<1 + q3<<2 + ... + q8<<7 • Histogramming
Local descriptor computational breakthrough • Filtering • Quantization • Feature composition • Histogramming (LBP, LPQ, BSIF if LBP = 1 then bin1++ if LBP = 2 then bin2++ ... ... if LBP = 255 then bin255++
LBP computational breakdown • Filtering • Quantization • Feature composition • Histogramming
LBP computational breakdown • Filtering • Quantization • Feature composition • Histogramming
Local descriptor computational breakdown 3.95x • Filtering • Quantization • Feature composition • Histogramming 2.80x 1x 76% 83% 56% 20% 14%
Local descriptor computational breakthrough 3.95x 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 2.80x 1x 76% 83% 56% 20% 14%
Local descriptor computational breakthrough 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 3.95x 2.80x 1x
Local descriptor computational breakthrough 1.40x 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 1.25x 1x 86% 70% 62% 56% 4.6% 23% 31%
Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices
Personal (desktop) computers High performance applications Not constrained (almost) by power Numerous available technologies: Libraries, programming languanges, support software Short developing times !!!
Personal (desktop) computer applications • Main goal: Maximize performance • High speed • High framerate • Low latency • High resolutions • Best quality
Personal (desktop) computers Computing devices: CPUs (single core or multicore) GPUs (single GPU or multiple GPUs)
General Purpose Processors (GPPs) • Essentially SISD machines • Optimized for low latency • Single or multiple cores • Include SIMD units
CPU implementation strategies for LBP • Avoiding conditional branching • Using SIMD units • Using all cores
Avoiding conditional branching • Reduces the number of conditional branches • Result cannot be predicted • Substitutes comparisons for substractions • Use ”two’s complement” numeric representation to know sign of substraction • In practice equivalent to a comparison • Needs sufficient amounts of bits to avoid overflows Up to 3 times faster !!! Mäenpää, T., Turtinen, M., Pietikäinen, M.: Real-time surface inspection by texture. Real Time Imaging. 9(5), 289-296 (2003)
Use of SIMD units • Included in every modern CPU core • Exploited using inline assembly, specific functions, array annotations, pragmas or enabled compilers • Computes several pixels at the same time • Not independent units (shared control code with CPU) • Requires preprocessing for maximum efficiency • About 7% overhead Up to 7x speedup Juránek, R., Herout, A., Zemĉik, P.: Implementing local binary patterns with SIMD instructions of CPU.
Exploiting multiple cores • Posix threads, Intel TBB, OpenMP, OpenCL • Divide image in multiple overlaping stripes • Asign one stripe per core • Overlaps cause contention on data reading and overhead For N cores, up to 0.9*N times faster 2 cores = 1.8x 4 cores = 3,7x 8 cores = 6,8x Humenberger, M., Zinner, C., Kubinger, W.: Performance evaluation of a census-based stereo matching algorithm on embedded and multi-core hardware.
Graphics processing units • Independent units (work concurrently with CPUs) • Essentially SIMD machines • Many simpler cores (hundreds) • Operating at lower clockrates • Operating in floating-point data • Built-in graphics primitives • Ideal for interpolation and filtering • Flow control, looping and branching restricted
GPU implementations • Stream processing • Exploiting shared and texture memory • Multi-platform code • Data transfer consideration
Stream processing Input stream Processor array Output stream
Exploiting shared and texture memory Shared memory model Stream processing model
Exploiting shared and texture memory • Shared memory acts as a practical L2 cache • Texture memory as read-only shared memory • Textures have ”free” bilinear interpolation Up to 5x speedup
Multi-platform code • GPU can be used concurrently with CPU • OpenCL allows the use of the same code • Concurrent implementations surpass GPU-only
Multi-platform code CPU CPU and GPU used concurrently Same code for both devices Input data GPU Output data
Data transfer • Data needs to be transferred to GPU memory • - It can be a bottleneck • Data transfers can overlap computations • - Latency can be hidden • Long imaging pipelines preferred • - More computations per transfer
Data transfer (LBP case) • LBP is memory bound • Most time consumed in memory acceses • Graphic Memory bandwitdh vs Graphics Bus bandwidth • Transfer time about 4 times smaller than computation time • Data transfer can be hidden (affects latency but not throughput)