Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén

Evaluation of LBP (and other local descriptors) computational performance in multiple (computing) architectures Center for Machine Vision Research Miguel Bordallo López, Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén Sami Varjo, Henri Nykänen, Abdenour Hadid Center for Machine Vision Research, University of Oulu, Finland

A sentence I read somewhere... LBP features are desirable because of their extremely high computational performance... , connected to the power grid, , on a high-end CPU ...per pixel if we use the basic LBP, and we don’t need interpolation.

Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices

Why should we care ? • Evaluation of descriptors/features done in terms of accuracy • Computational performance (sometimes) disregarded • In Matlab, not processor specific, based on libraries, not measured, ... ... But ... • Faster methods are able to compute larger amounts of input • Applications at lower framerate might perform worse than at higher rates • Computational performance is a KEY measurement for application performance

... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy

... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame

What descriptor to choose? Accuracy % Computation time

LBPs are essentially local descriptors HD1080@60fps 1920x1080x60 = 125Mpix/s UHD up to 2 Gpix/s !!! That’s a lot of throughput !!

Linear complexity of local descriptors Time (ms) 1920x1080 1280x720 Number of pixels

Linear complexity of local descriptors Time (ms) 1920x1080 1280x720 Number of pixels Time grows linearly with the resolution

Linear complexity of local descriptors Time (ms) 1920x1080 Number of pixels Time grows linearly with the resolution

LBP variants Census (8x8) LBP(24,3) VLBP(8,1) LBP-TOP(8,1) LBP(8,1) LOCP(8,1) CLBP(8,1) Time grows linearly with the number of points

Implications Time = K * n_pixels * n_points K is implementation dependent K is platform dependent Allows for platform comparison: • CPP metric (cycles per pixel) • Time normalized by resolution and clock frequency

Local descriptor computational breakdown • Filtering • Quantization • Feature composition • Histogramming

Local descriptor computational breakdown • Filtering (LBP) • Quantization • Feature composition • Histogramming 0 0 0 0 -1 0 1 0 0 1 0 0 0 -1 0 0 0 0 0 1 0 0 -1 0 0 0 0 0 0 1 0 -1 0 0 0 0 ... f1 f2 f3 ... f8

Local descriptor computational breakdown • Filtering (BSIF) 2. Quantization • Feature composition • Histogramming -0.18 0.19 -0.19 -1.56 1.46 0.05 -0.35 2.74 -2.68 2.50 -2.22 0.29 -0.01 -0.14 0.25 -0.67 0.95 -0.38 0.03 -0.48 0.63 -3.16 3.29 -0.79 0.60 -2.72 2.20 -0.67 0.69 0.13 0.22 0.19 0.08 0,75 -1.07 0.40 ... f1 f2 f3 ... f8

Local descriptor computational breakdown • Filtering • Quantization (LBP, LPQ, BSIF) q1 = f1 > 0 , q2 = f2 >0 ... q8 = f8>0 • Feature composition • Histogramming

Local descriptor computational breakdown • Filtering • Quantization • Feature composition (LBP, LPQ, BSIF) LBP = q1*1 + q2*2 + q3*4 + ... + q8*128 LBP = q1 + q2<<1 + q3<<2 + ... + q8<<7 • Histogramming

Local descriptor computational breakthrough • Filtering • Quantization • Feature composition • Histogramming (LBP, LPQ, BSIF if LBP = 1 then bin1++ if LBP = 2 then bin2++ ... ... if LBP = 255 then bin255++

LBP computational breakdown • Filtering • Quantization • Feature composition • Histogramming

Local descriptor computational breakdown 3.95x • Filtering • Quantization • Feature composition • Histogramming 2.80x 1x 76% 83% 56% 20% 14%

Local descriptor computational breakthrough 3.95x 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 2.80x 1x 76% 83% 56% 20% 14%

Local descriptor computational breakthrough 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 3.95x 2.80x 1x

Local descriptor computational breakthrough 1.40x 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 1.25x 1x 86% 70% 62% 56% 4.6% 23% 31%

Personal (desktop) computers High performance applications Not constrained (almost) by power Numerous available technologies: Libraries, programming languanges, support software Short developing times !!!

Personal (desktop) computer applications • Main goal: Maximize performance • High speed • High framerate • Low latency • High resolutions • Best quality

Personal (desktop) computers Computing devices: CPUs (single core or multicore) GPUs (single GPU or multiple GPUs)

General Purpose Processors (GPPs) • Essentially SISD machines • Optimized for low latency • Single or multiple cores • Include SIMD units

CPU implementation strategies for LBP • Avoiding conditional branching • Using SIMD units • Using all cores

Avoiding conditional branching • Reduces the number of conditional branches • Result cannot be predicted • Substitutes comparisons for substractions • Use ”two’s complement” numeric representation to know sign of substraction • In practice equivalent to a comparison • Needs sufficient amounts of bits to avoid overflows Up to 3 times faster !!! Mäenpää, T., Turtinen, M., Pietikäinen, M.: Real-time surface inspection by texture. Real Time Imaging. 9(5), 289-296 (2003)

Use of SIMD units • Included in every modern CPU core • Exploited using inline assembly, specific functions, array annotations, pragmas or enabled compilers • Computes several pixels at the same time • Not independent units (shared control code with CPU) • Requires preprocessing for maximum efficiency • About 7% overhead Up to 7x speedup Juránek, R., Herout, A., Zemĉik, P.: Implementing local binary patterns with SIMD instructions of CPU.

Exploiting multiple cores • Posix threads, Intel TBB, OpenMP, OpenCL • Divide image in multiple overlaping stripes • Asign one stripe per core • Overlaps cause contention on data reading and overhead For N cores, up to 0.9*N times faster 2 cores = 1.8x 4 cores = 3,7x 8 cores = 6,8x Humenberger, M., Zinner, C., Kubinger, W.: Performance evaluation of a census-based stereo matching algorithm on embedded and multi-core hardware.

Comparative performance

Graphics processing units • Independent units (work concurrently with CPUs) • Essentially SIMD machines • Many simpler cores (hundreds) • Operating at lower clockrates • Operating in floating-point data • Built-in graphics primitives • Ideal for interpolation and filtering • Flow control, looping and branching restricted

GPU implementations • Stream processing • Exploiting shared and texture memory • Multi-platform code • Data transfer consideration

Stream processing Input stream Processor array Output stream

Exploiting shared and texture memory Shared memory model Stream processing model

Exploiting shared and texture memory • Shared memory acts as a practical L2 cache • Texture memory as read-only shared memory • Textures have ”free” bilinear interpolation Up to 5x speedup

Multi-platform code • GPU can be used concurrently with CPU • OpenCL allows the use of the same code • Concurrent implementations surpass GPU-only

Multi-platform code CPU CPU and GPU used concurrently Same code for both devices Input data GPU Output data

Data transfer • Data needs to be transferred to GPU memory • - It can be a bottleneck • Data transfers can overlap computations • - Latency can be hidden • Long imaging pipelines preferred • - More computations per transfer

Data transfer (LBP case) • LBP is memory bound • Most time consumed in memory acceses • Graphic Memory bandwitdh vs Graphics Bus bandwidth • Transfer time about 4 times smaller than computation time • Data transfer can be hidden (affects latency but not throughput)

Comparative performance

Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén

Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén

Presentation Transcript

Intrusion Detection System (IDS)

MSc IAMZ - UAB - UPV 2007 - 2008

English Phonics

English Phonics