210 likes | 473 Views
Speeding Up Large-Scale Geospatial Polygon Rasterization on GPGPUs. Jianting Zhang Department of Computer Science, the City College of New York jzhang@cs.ccny.cuny.edu. Outline. Introduction and Motivations Background and Related Works The Serial Scan-Line Fill Algorithm
E N D
Speeding Up Large-Scale Geospatial Polygon Rasterization on GPGPUs Jianting Zhang Department of Computer Science, the City College of New Yorkjzhang@cs.ccny.cuny.edu
Outline • Introduction and Motivations • Background and Related Works • The Serial Scan-Line Fill Algorithm • Preprocessing Polygon Collections • Efficient Polygon Rasterization on GPGPUs • Experiments and Results • Conclusion and Future Work
Introduction: Personal HPC-G A. Clematis, M. Mineter, and R. Marciano. High performance computing with geographical data. Parallel Computing, 29(10):1275–1279, 2003 “Despite all these initiatives the impact of parallel GIS research has remained slight…” “…fundamental problem remains the fact that creating parallel GIS operations is non-trivial and there is a lack of parallel GIS algorithms, application libraries and toolkits.” • Marrying GPGPU with GIS – The next generation High-Performance GIS in a Personal Computing Environment (Zhang 2010, HPDGIS) • Every personal computer is now a parallel machine: CMPs and GPUs • Multi-core CPUs become the mainstream ; the more cores they have, the more GPU features they have • NVIDIA alone has shipped almost 220 million CUDA-capable GPUs from 2006-2010 (CACM 2010/11)
Introduction – Personal HPC-G • Chip-Multiprocessors (CMP): • http://en.wikipedia.org/wiki/Multi-core_processor • Cores/per chip: Dual-core Quad-core Six-core8/10/12 • Chips/per node: 1->24/8 • Intel MIC (32 cores) • UIUC Rigel Design (1024 core) • Massively parallel GPGPU computing: Hundreds of GPU cores in a GPU card • Nvidia GTX480 (03/2010): 480 cores, 1.4 GHZ, 1.5GB, 177.4 GB/s memory bandwidth, 1.35 TFlops • Nvidia GTX590 (03/2011): 1024 cores, 1.2 GHZ, 3GB, 327.74 GB/s memory bandwidth, 2.49 TFlops Parallel hardware is ever affordable than before …
Introduction – Personal HPC-G COM.GEO’10 SSDBM’10 ACMGIS 10 ACMGIS 11 • Geospatial data volumes never stop growing • Satellite: e.g., from GOES to GOES-R (2016) • http://www.goes-r.gov/downloads/GOES-R-Tri.pdf • Spectral (3X)*spatial (4X)* temporal (5X)=60X • Derived thematic data products (vector) • http://www.goes-r.gov/products/baseline.html • http://www.goes-r.gov/products/option2.html • Species distributions and movement data • E.g. 300+ millions occurrence records (GBIF) • E.g. 717,057 polygons and 78,929,697 vertices for 4148 birds distribution data (NatureServe) • Animals can move across space and time • Event Locations, trajectories and O-D data • E.g., Taxi trip records (traces or O-D locations) • 0.5 million in NYC and 1.2 million in Beijing per day • From O-D to shortest paths to flow patterns ACMGIS’08 ACMGIS’09 GeoInformatics’09 HPDGIS’11 COM.GEO’10 HPDGIS’10 ???
0 2 3 1 Motivations GPU-based parallel algorithm design to efficiently manage large-scale species distribution data (overlapped polygons) • Part 1: Extended quadtree to represent overlapped polygons (GeoInformatics’09 and ACMGIS’09) • Part 2: Efficient conversion between real-world geospatial polygons to quadtrees • Step 1:From polygons to scan-line segments. Step 2: from scan-line segments to quadtrees • Part 3: Query-driven visual exploration (ACMGIS’08 and ACMGIS’09)
Background and Related Works • Polygon-rasterization on GPUS • State-of-the-art: OpenGL GL_Polygon • Problems • Fix-function, proprietary, black-box • Does not support complex (e.g. concave) polygons – results may be incorrect (although acceptable for display purposes) • GL_Polygon is much slower than GL_TRIANGLES • Require a hardware context to read back rasterization results • Accuracy is limited by screen resolution • Difficult to implement using graphics languages for GIS developers • GPGPU comes to the rescue • Being able to use GPU parallel computing power • Using C/C++ languages is more intuitive • Directly generating spatial data structures can be more efficient (than using rasterized images to construct quadtrees) • More client-server computing friendly • No previous works on polygon rasterization on GPGPUs for geospatial apps.
Background and Related Works • Spatial Data structures on GPUs for computer graphics applications • KD-Tree (Zhou et al 2008, Hou et al 2001), Octree (Zhou 2011) • They are designed to efficiently render triangles, not querying polygons • Software rasterization of triangles • (Laine and Karras 2011), (Panntaleoni 2011), (Schwarz and Seidel 2011) • Results are encouraging when compared to hardware rasterization (2-8x gap) • Again, they are deisgned for rasterizing/rendering triangles, not for query polygons
Background and Related Works • Geospatial Data Processing on GPUs • Pre-GPGPU: • Using graphics data structures and primitives for spatial selection and spatial join queries (Sun et al 2003) • Difficult and unintuitive • Post-GPGPU • Spatial similarity join (Lieberman et al 2008) • Density-based spatial clustering (Bohm et al 2009) • Min-Max quadtree for large-scale raster data (Zhang et al 2010) • Decoding quad-tree encoded bitplane bitmaps of large-scale raster data (Zhang et al 2011)
The Serial Scan-Line Fill Algorithm • For each scan line y from ymin to ymax • Compute the intersection points with all edges • Sort the intersection points and form the scan line segments • (Fill the raster cells in the scan line segments) • End Intersection points between scan line y=y’ and edge (x1,y1) and (x2,y2) x’=(x1+(y-y1)/(y2-y1)*(x2-x1)) GDAL/GRASS codebases
Polygon Rasterization on GPGPUs - Challenges • Unique hardware characteristics (e.g. Nvidia Telsa C2050) • large number of threads (1024 per SM, 14 SMs) • limited shared memory: 48K per SM (shared by 1024 threads) • limited registers: 32768 per SM, i.e., 32 per thread • Need explicit shared memory management to make full utilization of the memory hierarchy • Parallelizing Scan-Line Fill Algorithm • Mimicking CPU algorithm (assigning a polygon to a thread) • Will NOT Work • Uncoalesced accesses to global memory are extremely inefficient • Insufficient registers and shared memory • How to assign computing blocks and threads to scan-lines and polygon edges?
The GPU SMs are divided into 14*4 computing blocks • A computing block has 256 threads and processes one polygon • All threads in a computing block loop through scan lines cooperatively … Polygon Rasterization on GPGPUs – Design GPU Global Memory L2 L1 SM2 … SMn SM1
1 2 3 4 5 6 Global Memory X/Y 1 2 3 4 5 6 Shared Memory Polygon Rasterization on GPGPUs – Design 1 3 a b c f 2 4 d 6 For each scan line y from ymin to ymax End e 5 X O O X O Intersection X X O O O Sorting X/Y coordinates in shared memory are re-used (ymax-ymin-1) times
0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 2 1 2 Polygon Rasterization on GPGPUs – Sorting Step 0 • GPGPUs are extremely good at sorting • Sorting on shared memory are extremely fast Step 1 Step 2 Step 3 __device__ inline ushort scan4(ushort num) { __shared__ ushort ptr[2* MAX_PT]; ushort val=num; uint idx = threadIdx.x; ptr[idx] = 0; idx += Tn; ptr[idx] =num; SYNC val += ptr[idx - 1]; SYNC ptr[idx] = val; SYNC val += ptr[idx - 2]; SYNC ptr[idx] = val; SYNC val += ptr[idx - 4]; SYNC ptr[idx] = val; SYNC … val = ptr[idx - 1]; return val; } • Benefits • only true intersection results are written back to global memory • Save GPU memory footprint and I/O costs Result of exclusive scan
Experiments and Results • Data: • NatureServe West Hemisphere birds speices distributions: http://www.natureserve.org/getData/birdMaps.jsp • 4148 birds: http://geoteci.engr.ccny.cuny.edu/geoteci/SPTestMap.html • 717,057 polygons, 1,199,799 rings • 78,929,697 vertices (1.3 G - shp files) • Total number of scan-line/polygon edge intersections: 200+ billions
Discussions - handling large polygons • The current implementation can not process polygons whose number of vertices are above a few thousands • 8n bytes for x coordinates • 8n bytes for y coordinates • 4n bytes for x coordinates of the intersections • ~100 extra bytes • (20n+100)<48kn~2000 (using a whole SM as a computing block) • We have limited the number of points to the number of threads (1024) - having one thread process a few vertices is not scalable • We need a better way to handle scalability
1 2 3 4 5 6 Global Memory X/Y 1 2 3 4 5 6 shared Memory Chunking Computing Sorting using a separate kernel assembling (x1,y1) (x3,y1) (x1,y1) (x2,y2) (x2,y2) (x4,y2) (x3,y1) (x4,y2) Discussions - handling large polygons Proposed Solution: chunking edge list, computing separately and then assembling
Summary and Conclusion • Introduced A GPGPU accelerated software rasterization framework to rasterize and index large-scale geospatial polygons • Provided A GPGPU based design and implementation of computing intersection points • Achieved about 20X speedup for groups of polygons with vertices between 64 and 1024 using the birds species distribution data in the West Hemisphere that has about 3/4 million of polygons and more than 78 millions of vertices • Discussed on extending the current implementation to support polygons with arbitrarily large numbers of vertices by extensively using efficient sorting • Work reported is preliminary - several important components in realizing a dynamically integrated vector-raster data model for high-performance geospatial analysis on GPGPUs are still currently under development.
Future Work • Extend our current implementation to support large polygons with arbitrary numbers of vertices • Implement the quadtree construction (step2) based on the GPGPU computed scan-line segments (CPU/GPU) • Perform a comprehensive performance comparison with that of commercial spatial database indexing • Integrate with front end modules in spatial databases (e.g., query parser and optimizer)
Q&A jzhang@cs.ccny.cuny.edu 21