1 / 37

Nearest Neighbor POINT LOCATION Using KDTree on GPU

By Shawn Brown. Nearest Neighbor POINT LOCATION Using KDTree on GPU. Terrain Visualization: Want to compare results of 2 different surface generation methods against baseline for error Baseline (C) contains ~100 million terrain points Method A & B each generate ~100 million points

javen
Download Presentation

Nearest Neighbor POINT LOCATION Using KDTree on GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. By Shawn Brown Nearest Neighbor POINT LOCATIONUsing KDTree on GPU

  2. Terrain Visualization: Want to compare results of 2 different surface generation methods against baseline for error • Baseline (C) contains ~100 million terrain points • Method A & B each generate ~100 million points • Map A, B into interpolated C to get relative error • Need efficient nearest neighbor tool for this purpose Research GOAL

  3. Investigate using KDTree Nearest Neighbor on GPU • Single query point • Multiple query points • Can it be made to work? • Is it fast enough to be useful? GPU Project GOAL

  4. Nearest Neighbor Searches • Brute Force Method, KD Tree Methods • GPU Brute Force Method & Results • GPU KD Tree Method & Results • Limitations & Conclusions • Future Work • Appendices (kNN Search, Skew Grids, CUDA tips) Overview

  5. Given query point ‘q’ and a set of points ‘P’ • Find nearest point ‘p[i]’ to ‘q’ from a set of points ‘P’ under 2-norm measure of distance Nearest Neighbor SEARCH

  6. Linear Search • Compare query point ‘q’ against each of the ‘n’ points in point set “P” using 2-norm distance measure. Brute Force (NN Search)

  7. Linear NN

  8. An axis aligned spatial partitioning data structure for organizing data in k-dimensional space • Balanced vs. Unbalanced • Axis (Cycled vs. optimal) • Leaf vs. Node storage • # nodes in leafs • # dimensions KD Tree (special case of BSP Tree) * Image from Wikipedia.com

  9. Balanced • Pro: O(n log n) build time • Pro: Height is log2(N) • Con: Non-optimal Spatial subdivision • Cycled (x,y,x,y,…) • Pro: reduces storage • Con: Non-optimal split KD Tree (My Choices) • Node Storage • 1 point per node • Pro: O(n) storage • Pro: No special case logic for leafs • Con: May be faster to handle ‘k’ points in leaf • 2D (for now)

  10. Take original point set P0 and partition into three sets {pm}, P1 , and P2 on Kth dimension. • Save partition value {pm} in current KD tree node • as split point • P1 = left child set, P2 = right child set • Uses similar logic to q-sort partition function • Recursively partition each child set into 3 approximately equal size child sets alternating over k dimensions in turn until each child point set contains ‘1’ point. • The hierarchal construction of left and right children sets from the parent set form the KD tree. KD Tree (Creation)

  11. Depth first search on KDTree for query point • Always turn in direction of containing node • 1D Interval containing current splitValue of query point • Keep track of trim distance • Best (closest) distance found so far • Always add containing node (1D Interval) to stack • Optionally, Add other child node to stack • Don’t add if 1D Interval outside trim distance • Pop nodes off stack & process • Discard immediately if outside 1D interval trim distance • Index of node with closest distance is NN. KD TREE (NN Search)

  12. KD TREE (PER THREAD STACK) • DFS needs a stack • Multi threads → multi stacks • Do we have enough room? • 16K of __shared__ memory • 16K/32 threads • 512 bytes per thread • 32 element stack (2 fields) • 2 * 4 * 32 = 256 bytes • Just enough Space • A 32 high stack can traverse a balanced binary tree of 2^32 = 4 billion + element KD Tree Thread 1 Thread 2 … 32 Elements

  13. KDTree NN *Trim Distance used to eliminate portions of KD Tree from consideration

  14. Distance Kernel • Computes distance dist[i] from p[i] to query point q • O(n/k) where k = # of threads in parrallel • Reduction Kernel • Finds min distance dist[i] & index ‘i’ by reduction • Find min distance in each thread block store in smaller result vector • Keep looping until we have final singleton value in final thread block • O(n log n) = n + n/16 + n/256 + … + 1 GPU BRUTE FORCE IMPLEMENTATION

  15. Optimal Thread block size • 32x1, 64x1, or 128x1 • Distance kernel quite fast • 49.88 GFLOP/s • Reduction kernel much slower • 1.46 GFLOP/s • Bus Transfer (CPU -> GPU) • 1.66 GB/s • GPU faster than CPU • Cost: Transfer vector onto GPU • Cost: Slow reduction kernel GPU Brute Force Results GPU Throughput: 288 queries/s CPU Throughput: 45 queries/s

  16. KD Tree Build done on CPU • Transfer KD Tree Nodes onto GPU • Transfer query vector onto GPU • KD NN Search algorithm implemented as single GPU Kernel • Transfer NN result vector from GPU • Compare GPU results against similar CPU algorithm results. GPU KD Tree Implementation

  17. Use __shared__ memory • __align__ data structures (4,8,16 bytes) • 37% speedup in my algorithm • Use pinned memory I/O • 40%-60% speedup in my algorithm • Compress or Break apart data structures • 3x-5x speedup in my algorithm • Simplify code (final code much smaller) GPU KD TREE DESIGN WINS

  18. KD Tree RESULTS (STATS) Environment: All GPU results are from a NVidia GeForce 8800 GTX All CPU results are from a Pentium Dual Core 6700 chipset (2.66GHZ) Software environment is Windows XP, SP2

  19. 1 Million Search Points, 1 Million Query Points • Uniform Search Set (1x1) = Uniform Query Set • GPU kernel is 6x-7x times faster than CPU code • Optimal Thread Block size • 16x1 or 64x1 • Cut over point (10,000 points) • < 10,000 queries use CPU algorithm • > 10,000 queries use GPU kernel KD NN Search (RESULTS)

  20. Bus Transfer • Onto GPU: ~2.26 GB/s • From GPU: ~2.18 GB/s • Query Throughput (Compute Throughput) • GPU: 4.79 million queries/s (16.84 GFLOP/s) • CPU:0.68 million queries/s (2.15 GFLOP/s) • GPU Faster than CPU • Cost: Transfer KD Nodes onto GPU • Cost: Non-Uniform Data KD NN Search (RESULTS cont.)

  21. WHAT ABOUT Non-Uniform Data? Teapot in a Stadium? Environment: All GPU results are from a NVidia GeForce 8800 GTX All CPU results are from a Pentium Dual Core 6700 chipset (2.66GHZ) Software environment is Windows XP, SP2

  22. Numerical Precision (GPU & CPU results off slightly) • Interior vs. Exterior queries • Works best on interior queries • Uniform data vs. Non-Uniform • Works best on uniform data (search & query) • Need large datasets to get performance benefit • Overcome transfer costs of KDTree Nodes • Requires Balanced KD Tree • minimize height of tree (for per thread stack) • Static Data Sets • Dynamic changes → rebuild/reload KD Tree (expensive) Limitations:

  23. A GPU Kernel for KDTree NN Search can be made to work and works well within certain limitations • Correctness: Gives Correct solution (within precision) • Performance: Faster than CPU • 3x-7x faster than CPU algorithm for large datasets • Useful: uniform data, interior queries, large datasets • Borderline:Non-uniform, interior queries, medium data • Not so useful:exterior queries, small data sets CoNCLUSIONS:

  24. Add visualization tools (point set, KD Tree, etc.) Terrain Error Visualization kNN Search (see appendix) Performance: Can we optimize algorithm further Limitations: more testing of varying data sets Handle Huge Data Sets (100 Million+ points)(see apx) KD Tree variants, Hierarchical Bounding Volumes, Voronoi Diagrams Performance: CPU algorithm using SSE instructions Future WORK

  25. [DEB00] M.de Berg et al, Computational Geometry, Algorithms and Applications, 2nd Edition, Springer Verlag, New York, 2000. [FOL05] T. Foley and J Sugerman, KD-Tree Acceleration Structures for a GPU Raytracer, ACM SIGGRAPH/Eurographics conference on Graphics Hardware (HWWS), July 2005, pp. 15-22 [JEN01] H. Jensen, Realistic Image Synthesis Using Photon Mapping, A K Peters, Natick MA, 2001 [THR05] N. Thrane and L. Simonsen, A Comparison of Acceleration Structures for GPU Assisted Ray Tracing, Masters Thesis, University of Aarhus, August 2005 Kudos To Christian Lauterbach REferences

  26. Any Questions? THANK YOU

  27. Search for ‘k’ nearest points to query point We added stack support. Can we make enough room for a small heap as well? Yes, we can! Max heap can be used to track ‘k’ best points Trim distance takes longer to shrink resulting in more nodes visited. I just finished a working version on Sunday Appendix A: ‘k’ NN Search

  28. Closest Point Heap • acts like simple array on first ‘k-1’ insertions • Trim distance remains HUGE until… • Turns into a max heap on ‘k’th insertion • Heapify array (takes O(k) time) • Trim distance = root of max heap • Replace top of heap on subsequent updates • Demote heap top to proper position in heap • Takes O( log(k) ) time • Update trim distance to root of max heap APPENDIX A: ‘K’ NN Search

  29. Optimal Thread block size • 4x1, 32x1 • Query Throughput • GPU: 0.46 million queries/s • CPU: 0.20 million queries/s • Bus Transfer (CPU -> GPU) • Onto GPU: 2.25 GB/s • From GPU: 2.20 GB/s • GPU only slightly faster than CPU • Cost: Transfer result off GPU • Cost: More nodes visited in KDTree Appendix A: “K” NN Search (Stats)

  30. Need to handle 100 million (or larger) datasets • Could use a regular k-Grid (bin search points) • Bin query point (do KD tree search on cell) • Is circle (qp+dist) inside cell, return result • Otherwise, find neighboring cells that overlap • Return min result of all tests • How to handle empty cells? • How to handle exterior query points? • Doesn’t adapt well to data Appendix B: (Large Data Sets)

  31. Skew Set = irregular square grid • Given point set with ‘n’ points • Choose ‘k’ points per cell in grid • Grid will contain c=n/k cells • ~ length of each grid dimension will be approximately l=d-root(c=n/k) • Each cell will contain k points APPENDIX B: Skew Set

  32. 2D Skew Grid ConstructionPartition on x-dim into l=sqrt(n/k) slabseach slab contains approximately k*sqrt(n/k) points

  33. 2D Skew Grid constructionpartition each slab into cellseach cell contains approximately ‘k’ points

  34. Skew Grid NN search • Binary search to find slab containing query point • Binary search to find cell containing query point • KD Tree NN Search on current cell (get trim dist) • Is circle (qp+trim dist)completely contained? • Otherwise, Find other neighboring Skew cells which overlap circle centered at query point. • Run KDTree NN Search on each such cell • Return minimum result of all tests done • Exterior qp’s require different algorithm SKEW GriD

  35. Use __shared__ memory • Use simple data structures in shared memory • __align__ data structures (4,8,16 bytes) • 37% speedup in my algorithm • Use pinned memory I/O • 40%-60% speedup in my algorithm • Compress or Break apart data structures • 3x-5x speedup in my algorithm • Develop Methodically Appendix C: CUDA TIPS

  36. Use Different Memory Models • Use __constant__ variables • Use __constant__ Memory (Read only, 64K, 8K cache) • Use Texture Memory (Read only, 8K cache) • Use branch predication instead of conditionals • Use fast math operations • FMAD, __mul24, __fdividef( x, y), etc. • Avoid division, modulus for integers APPENDIX C: (More CUDA TIPS)

  37. Debugging: Divide & Conquer • Approach: Comment out half of code until offending code is found • Bug: No references on input parameters • Work around: Use device pointers or pass by value • Bug: Alignment issues • Work around: Rewrite code component wise • Bug: References to __shared__ memory • Work around: Rewrite without references Appendix C: CUDA Bugs (emulated mode works, Kernel Doesn’t)

More Related