1 / 85

Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén

Evaluation of LBP (and other local descriptors) computational performance i n multiple (computing) architectures. Center for Machine Vision Research. Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén

dawn-glover
Download Presentation

Miguel Bordallo López , Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of LBP (and other local descriptors) computational performance in multiple (computing) architectures Center for Machine Vision Research Miguel Bordallo López, Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén Sami Varjo, Henri Nykänen, Abdenour Hadid Center for Machine Vision Research, University of Oulu, Finland

  2. A sentence I read somewhere... LBP features are desirable because of their extremely high computational performance... , connected to the power grid, , on a high-end CPU ...per pixel if we use the basic LBP, and we don’t need interpolation.

  3. Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices

  4. Why should we care ? • Evaluation of descriptors/features done in terms of accuracy • Computational performance (sometimes) disregarded • In Matlab, not processor specific, based on libraries, not measured, ... ... But ... • Faster methods are able to compute larger amounts of input • Applications at lower framerate might perform worse than at higher rates • Computational performance is a KEY measurement for application performance

  5. ... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy

  6. ... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy

  7. ... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame

  8. ... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame

  9. ... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame

  10. What descriptor to choose? Accuracy % Computation time

  11. Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices

  12. LBPs are essentially local descriptors HD1080@60fps 1920x1080x60 = 125Mpix/s UHD up to 2 Gpix/s !!! That’s a lot of throughput !!

  13. Linear complexity of local descriptors Time (ms) 1920x1080 1280x720 Number of pixels

  14. Linear complexity of local descriptors Time (ms) 1920x1080 1280x720 Number of pixels Time grows linearly with the resolution

  15. Linear complexity of local descriptors Time (ms) 1920x1080 Number of pixels Time grows linearly with the resolution

  16. LBP variants Census (8x8) LBP(24,3) VLBP(8,1) LBP-TOP(8,1) LBP(8,1) LOCP(8,1) CLBP(8,1) Time grows linearly with the number of points

  17. Implications Time = K * n_pixels * n_points K is implementation dependent K is platform dependent Allows for platform comparison: • CPP metric (cycles per pixel) • Time normalized by resolution and clock frequency

  18. Local descriptor computational breakdown • Filtering • Quantization • Feature composition • Histogramming

  19. Local descriptor computational breakdown • Filtering (LBP) • Quantization • Feature composition • Histogramming 0 0 0 0 -1 0 1 0 0 1 0 0 0 -1 0 0 0 0 0 1 0 0 -1 0 0 0 0 0 0 1 0 -1 0 0 0 0 ... f1 f2 f3 ... f8

  20. Local descriptor computational breakdown • Filtering (BSIF) 2. Quantization • Feature composition • Histogramming -0.18 0.19 -0.19 -1.56 1.46 0.05 -0.35 2.74 -2.68 2.50 -2.22 0.29 -0.01 -0.14 0.25 -0.67 0.95 -0.38 0.03 -0.48 0.63 -3.16 3.29 -0.79 0.60 -2.72 2.20 -0.67 0.69 0.13 0.22 0.19 0.08 0,75 -1.07 0.40 ... f1 f2 f3 ... f8

  21. Local descriptor computational breakdown • Filtering • Quantization (LBP, LPQ, BSIF) q1 = f1 > 0 , q2 = f2 >0 ... q8 = f8>0 • Feature composition • Histogramming

  22. Local descriptor computational breakdown • Filtering • Quantization • Feature composition (LBP, LPQ, BSIF) LBP = q1*1 + q2*2 + q3*4 + ... + q8*128 LBP = q1 + q2<<1 + q3<<2 + ... + q8<<7 • Histogramming

  23. Local descriptor computational breakthrough • Filtering • Quantization • Feature composition • Histogramming (LBP, LPQ, BSIF if LBP = 1 then bin1++ if LBP = 2 then bin2++ ... ... if LBP = 255 then bin255++

  24. LBP computational breakdown • Filtering • Quantization • Feature composition • Histogramming

  25. LBP computational breakdown • Filtering • Quantization • Feature composition • Histogramming

  26. Local descriptor computational breakdown 3.95x • Filtering • Quantization • Feature composition • Histogramming 2.80x 1x 76% 83% 56% 20% 14%

  27. Local descriptor computational breakthrough 3.95x 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 2.80x 1x 76% 83% 56% 20% 14%

  28. Local descriptor computational breakthrough 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 3.95x 2.80x 1x

  29. Local descriptor computational breakthrough 1.40x 0. Interpolation • Filtering • Quantization • Feature composition • Histogramming 1.25x 1x 86% 70% 62% 56% 4.6% 23% 31%

  30. Contents • Introduction • Computational complexity of local descriptors • LBP in desktop computers • LBP in mobile devices • LBP in dedicated computing devices

  31. Personal (desktop) computers High performance applications Not constrained (almost) by power Numerous available technologies: Libraries, programming languanges, support software Short developing times !!!

  32. Personal (desktop) computer applications • Main goal: Maximize performance • High speed • High framerate • Low latency • High resolutions • Best quality

  33. Personal (desktop) computers Computing devices: CPUs (single core or multicore) GPUs (single GPU or multiple GPUs)

  34. General Purpose Processors (GPPs) • Essentially SISD machines • Optimized for low latency • Single or multiple cores • Include SIMD units

  35. CPU implementation strategies for LBP • Avoiding conditional branching • Using SIMD units • Using all cores

  36. Avoiding conditional branching • Reduces the number of conditional branches • Result cannot be predicted • Substitutes comparisons for substractions • Use ”two’s complement” numeric representation to know sign of substraction • In practice equivalent to a comparison • Needs sufficient amounts of bits to avoid overflows Up to 3 times faster !!! Mäenpää, T., Turtinen, M., Pietikäinen, M.: Real-time surface inspection by texture. Real Time Imaging. 9(5), 289-296 (2003)

  37. Use of SIMD units • Included in every modern CPU core • Exploited using inline assembly, specific functions, array annotations, pragmas or enabled compilers • Computes several pixels at the same time • Not independent units (shared control code with CPU) • Requires preprocessing for maximum efficiency • About 7% overhead Up to 7x speedup Juránek, R., Herout, A., Zemĉik, P.: Implementing local binary patterns with SIMD instructions of CPU.

  38. Exploiting multiple cores • Posix threads, Intel TBB, OpenMP, OpenCL • Divide image in multiple overlaping stripes • Asign one stripe per core • Overlaps cause contention on data reading and overhead For N cores, up to 0.9*N times faster 2 cores = 1.8x 4 cores = 3,7x 8 cores = 6,8x Humenberger, M., Zinner, C., Kubinger, W.: Performance evaluation of a census-based stereo matching algorithm on embedded and multi-core hardware.

  39. Comparative performance

  40. Comparative performance

  41. Graphics processing units • Independent units (work concurrently with CPUs) • Essentially SIMD machines • Many simpler cores (hundreds) • Operating at lower clockrates • Operating in floating-point data • Built-in graphics primitives • Ideal for interpolation and filtering • Flow control, looping and branching restricted

  42. GPU implementations • Stream processing • Exploiting shared and texture memory • Multi-platform code • Data transfer consideration

  43. Stream processing Input stream Processor array Output stream

  44. Exploiting shared and texture memory Shared memory model Stream processing model

  45. Exploiting shared and texture memory • Shared memory acts as a practical L2 cache • Texture memory as read-only shared memory • Textures have ”free” bilinear interpolation Up to 5x speedup

  46. Multi-platform code • GPU can be used concurrently with CPU • OpenCL allows the use of the same code • Concurrent implementations surpass GPU-only

  47. Multi-platform code CPU CPU and GPU used concurrently Same code for both devices Input data GPU Output data

  48. Data transfer • Data needs to be transferred to GPU memory • - It can be a bottleneck • Data transfers can overlap computations • - Latency can be hidden • Long imaging pipelines preferred • - More computations per transfer

  49. Data transfer (LBP case) • LBP is memory bound • Most time consumed in memory acceses • Graphic Memory bandwitdh vs Graphics Bus bandwidth • Transfer time about 4 times smaller than computation time • Data transfer can be hidden (affects latency but not throughput)

  50. Comparative performance

More Related