370 likes | 565 Views
By Shawn Brown. Nearest Neighbor POINT LOCATION Using KDTree on GPU. Terrain Visualization: Want to compare results of 2 different surface generation methods against baseline for error Baseline (C) contains ~100 million terrain points Method A & B each generate ~100 million points
E N D
By Shawn Brown Nearest Neighbor POINT LOCATIONUsing KDTree on GPU
Terrain Visualization: Want to compare results of 2 different surface generation methods against baseline for error • Baseline (C) contains ~100 million terrain points • Method A & B each generate ~100 million points • Map A, B into interpolated C to get relative error • Need efficient nearest neighbor tool for this purpose Research GOAL
Investigate using KDTree Nearest Neighbor on GPU • Single query point • Multiple query points • Can it be made to work? • Is it fast enough to be useful? GPU Project GOAL
Nearest Neighbor Searches • Brute Force Method, KD Tree Methods • GPU Brute Force Method & Results • GPU KD Tree Method & Results • Limitations & Conclusions • Future Work • Appendices (kNN Search, Skew Grids, CUDA tips) Overview
Given query point ‘q’ and a set of points ‘P’ • Find nearest point ‘p[i]’ to ‘q’ from a set of points ‘P’ under 2-norm measure of distance Nearest Neighbor SEARCH
Linear Search • Compare query point ‘q’ against each of the ‘n’ points in point set “P” using 2-norm distance measure. Brute Force (NN Search)
An axis aligned spatial partitioning data structure for organizing data in k-dimensional space • Balanced vs. Unbalanced • Axis (Cycled vs. optimal) • Leaf vs. Node storage • # nodes in leafs • # dimensions KD Tree (special case of BSP Tree) * Image from Wikipedia.com
Balanced • Pro: O(n log n) build time • Pro: Height is log2(N) • Con: Non-optimal Spatial subdivision • Cycled (x,y,x,y,…) • Pro: reduces storage • Con: Non-optimal split KD Tree (My Choices) • Node Storage • 1 point per node • Pro: O(n) storage • Pro: No special case logic for leafs • Con: May be faster to handle ‘k’ points in leaf • 2D (for now)
Take original point set P0 and partition into three sets {pm}, P1 , and P2 on Kth dimension. • Save partition value {pm} in current KD tree node • as split point • P1 = left child set, P2 = right child set • Uses similar logic to q-sort partition function • Recursively partition each child set into 3 approximately equal size child sets alternating over k dimensions in turn until each child point set contains ‘1’ point. • The hierarchal construction of left and right children sets from the parent set form the KD tree. KD Tree (Creation)
Depth first search on KDTree for query point • Always turn in direction of containing node • 1D Interval containing current splitValue of query point • Keep track of trim distance • Best (closest) distance found so far • Always add containing node (1D Interval) to stack • Optionally, Add other child node to stack • Don’t add if 1D Interval outside trim distance • Pop nodes off stack & process • Discard immediately if outside 1D interval trim distance • Index of node with closest distance is NN. KD TREE (NN Search)
KD TREE (PER THREAD STACK) • DFS needs a stack • Multi threads → multi stacks • Do we have enough room? • 16K of __shared__ memory • 16K/32 threads • 512 bytes per thread • 32 element stack (2 fields) • 2 * 4 * 32 = 256 bytes • Just enough Space • A 32 high stack can traverse a balanced binary tree of 2^32 = 4 billion + element KD Tree Thread 1 Thread 2 … 32 Elements
KDTree NN *Trim Distance used to eliminate portions of KD Tree from consideration
Distance Kernel • Computes distance dist[i] from p[i] to query point q • O(n/k) where k = # of threads in parrallel • Reduction Kernel • Finds min distance dist[i] & index ‘i’ by reduction • Find min distance in each thread block store in smaller result vector • Keep looping until we have final singleton value in final thread block • O(n log n) = n + n/16 + n/256 + … + 1 GPU BRUTE FORCE IMPLEMENTATION
Optimal Thread block size • 32x1, 64x1, or 128x1 • Distance kernel quite fast • 49.88 GFLOP/s • Reduction kernel much slower • 1.46 GFLOP/s • Bus Transfer (CPU -> GPU) • 1.66 GB/s • GPU faster than CPU • Cost: Transfer vector onto GPU • Cost: Slow reduction kernel GPU Brute Force Results GPU Throughput: 288 queries/s CPU Throughput: 45 queries/s
KD Tree Build done on CPU • Transfer KD Tree Nodes onto GPU • Transfer query vector onto GPU • KD NN Search algorithm implemented as single GPU Kernel • Transfer NN result vector from GPU • Compare GPU results against similar CPU algorithm results. GPU KD Tree Implementation
Use __shared__ memory • __align__ data structures (4,8,16 bytes) • 37% speedup in my algorithm • Use pinned memory I/O • 40%-60% speedup in my algorithm • Compress or Break apart data structures • 3x-5x speedup in my algorithm • Simplify code (final code much smaller) GPU KD TREE DESIGN WINS
KD Tree RESULTS (STATS) Environment: All GPU results are from a NVidia GeForce 8800 GTX All CPU results are from a Pentium Dual Core 6700 chipset (2.66GHZ) Software environment is Windows XP, SP2
1 Million Search Points, 1 Million Query Points • Uniform Search Set (1x1) = Uniform Query Set • GPU kernel is 6x-7x times faster than CPU code • Optimal Thread Block size • 16x1 or 64x1 • Cut over point (10,000 points) • < 10,000 queries use CPU algorithm • > 10,000 queries use GPU kernel KD NN Search (RESULTS)
Bus Transfer • Onto GPU: ~2.26 GB/s • From GPU: ~2.18 GB/s • Query Throughput (Compute Throughput) • GPU: 4.79 million queries/s (16.84 GFLOP/s) • CPU:0.68 million queries/s (2.15 GFLOP/s) • GPU Faster than CPU • Cost: Transfer KD Nodes onto GPU • Cost: Non-Uniform Data KD NN Search (RESULTS cont.)
WHAT ABOUT Non-Uniform Data? Teapot in a Stadium? Environment: All GPU results are from a NVidia GeForce 8800 GTX All CPU results are from a Pentium Dual Core 6700 chipset (2.66GHZ) Software environment is Windows XP, SP2
Numerical Precision (GPU & CPU results off slightly) • Interior vs. Exterior queries • Works best on interior queries • Uniform data vs. Non-Uniform • Works best on uniform data (search & query) • Need large datasets to get performance benefit • Overcome transfer costs of KDTree Nodes • Requires Balanced KD Tree • minimize height of tree (for per thread stack) • Static Data Sets • Dynamic changes → rebuild/reload KD Tree (expensive) Limitations:
A GPU Kernel for KDTree NN Search can be made to work and works well within certain limitations • Correctness: Gives Correct solution (within precision) • Performance: Faster than CPU • 3x-7x faster than CPU algorithm for large datasets • Useful: uniform data, interior queries, large datasets • Borderline:Non-uniform, interior queries, medium data • Not so useful:exterior queries, small data sets CoNCLUSIONS:
Add visualization tools (point set, KD Tree, etc.) Terrain Error Visualization kNN Search (see appendix) Performance: Can we optimize algorithm further Limitations: more testing of varying data sets Handle Huge Data Sets (100 Million+ points)(see apx) KD Tree variants, Hierarchical Bounding Volumes, Voronoi Diagrams Performance: CPU algorithm using SSE instructions Future WORK
[DEB00] M.de Berg et al, Computational Geometry, Algorithms and Applications, 2nd Edition, Springer Verlag, New York, 2000. [FOL05] T. Foley and J Sugerman, KD-Tree Acceleration Structures for a GPU Raytracer, ACM SIGGRAPH/Eurographics conference on Graphics Hardware (HWWS), July 2005, pp. 15-22 [JEN01] H. Jensen, Realistic Image Synthesis Using Photon Mapping, A K Peters, Natick MA, 2001 [THR05] N. Thrane and L. Simonsen, A Comparison of Acceleration Structures for GPU Assisted Ray Tracing, Masters Thesis, University of Aarhus, August 2005 Kudos To Christian Lauterbach REferences
Any Questions? THANK YOU
Search for ‘k’ nearest points to query point We added stack support. Can we make enough room for a small heap as well? Yes, we can! Max heap can be used to track ‘k’ best points Trim distance takes longer to shrink resulting in more nodes visited. I just finished a working version on Sunday Appendix A: ‘k’ NN Search
Closest Point Heap • acts like simple array on first ‘k-1’ insertions • Trim distance remains HUGE until… • Turns into a max heap on ‘k’th insertion • Heapify array (takes O(k) time) • Trim distance = root of max heap • Replace top of heap on subsequent updates • Demote heap top to proper position in heap • Takes O( log(k) ) time • Update trim distance to root of max heap APPENDIX A: ‘K’ NN Search
Optimal Thread block size • 4x1, 32x1 • Query Throughput • GPU: 0.46 million queries/s • CPU: 0.20 million queries/s • Bus Transfer (CPU -> GPU) • Onto GPU: 2.25 GB/s • From GPU: 2.20 GB/s • GPU only slightly faster than CPU • Cost: Transfer result off GPU • Cost: More nodes visited in KDTree Appendix A: “K” NN Search (Stats)
Need to handle 100 million (or larger) datasets • Could use a regular k-Grid (bin search points) • Bin query point (do KD tree search on cell) • Is circle (qp+dist) inside cell, return result • Otherwise, find neighboring cells that overlap • Return min result of all tests • How to handle empty cells? • How to handle exterior query points? • Doesn’t adapt well to data Appendix B: (Large Data Sets)
Skew Set = irregular square grid • Given point set with ‘n’ points • Choose ‘k’ points per cell in grid • Grid will contain c=n/k cells • ~ length of each grid dimension will be approximately l=d-root(c=n/k) • Each cell will contain k points APPENDIX B: Skew Set
2D Skew Grid ConstructionPartition on x-dim into l=sqrt(n/k) slabseach slab contains approximately k*sqrt(n/k) points
2D Skew Grid constructionpartition each slab into cellseach cell contains approximately ‘k’ points
Skew Grid NN search • Binary search to find slab containing query point • Binary search to find cell containing query point • KD Tree NN Search on current cell (get trim dist) • Is circle (qp+trim dist)completely contained? • Otherwise, Find other neighboring Skew cells which overlap circle centered at query point. • Run KDTree NN Search on each such cell • Return minimum result of all tests done • Exterior qp’s require different algorithm SKEW GriD
Use __shared__ memory • Use simple data structures in shared memory • __align__ data structures (4,8,16 bytes) • 37% speedup in my algorithm • Use pinned memory I/O • 40%-60% speedup in my algorithm • Compress or Break apart data structures • 3x-5x speedup in my algorithm • Develop Methodically Appendix C: CUDA TIPS
Use Different Memory Models • Use __constant__ variables • Use __constant__ Memory (Read only, 64K, 8K cache) • Use Texture Memory (Read only, 8K cache) • Use branch predication instead of conditionals • Use fast math operations • FMAD, __mul24, __fdividef( x, y), etc. • Avoid division, modulus for integers APPENDIX C: (More CUDA TIPS)
Debugging: Divide & Conquer • Approach: Comment out half of code until offending code is found • Bug: No references on input parameters • Work around: Use device pointers or pass by value • Bug: Alignment issues • Work around: Rewrite code component wise • Bug: References to __shared__ memory • Work around: Rewrite without references Appendix C: CUDA Bugs (emulated mode works, Kernel Doesn’t)