190 likes | 413 Views
Small-Scale Raster Map Projection using the Compute Unified Device Architecture (CUDA). Michael P. Finn, Jing Li, and David Mattli. ISPRS Technical Commission IV Symposium on Geospatial Databases and Location Based Services Suzhou, China 14 – 16 May 2014. HPC-Research/ Motivation.
E N D
Small-Scale Raster Map Projection using the Compute Unified Device Architecture (CUDA) Michael P. Finn, Jing Li, and David Mattli ISPRS Technical Commission IV Symposium on Geospatial Databases and Location Based Services Suzhou, China 14 – 16 May 2014
HPC-Research/ Motivation Prime test case: Map projection/ reprojection for large raster datasets (“Big” Data?) pRasterBlaster: mapIMG in HPC environment Solve problems using multiple processors Currently testing within the NSF CyberGIS Project leveraging XSEDE (more traditional supercomputing (SC) environment) How does the same problem compare in a computation sense between CPU-dominate SC environment and a more light-weight General Purpose GPU-dominate environment?
CUDA • A parallel computing platform and programming model invented by Nvidia • Allows GPUs to be used for general purpose processing (not exclusively graphics) • GPUs have a parallel throughput architecture that allows executing many concurrent threads slowly (rather than executing a single thread very quickly) • Accessible to software developers through libraries, compiler directives, and extensions to programming languages, including C, C++ and Fortran
Accurate Raster Reprojection in Three (primary) Steps • Step 1: Calculate and Partition Output Space • Step 2: Read Input and Reproject • Step 3: Combine Temporary Files
The Equations • Projection Transformation Process: Framing • The frame of a raster dataset defines the extent of the dataset in the projection space. It also defines the alignment of projection space with the input (often) image coordinate system. • X = ULprojX + ((sample – 1) * pixelSizeX) (1) • Y = ULprojY – ((line – 1) * pixelSizeY) (2) • Alternatively: • Sample = ((X – ULprojX) / PixelSizeX) + 1 (3) • Line = ((ULprojY – Y) / pixelSizeY) + 1 (4)
CUDA implementation 4 corner point based map projection using CUDA
Raster Chunk Handling Cannot merge output chunks due to the limitation of computing resources
Results • Configuration of the testing machine • Intel Quad-core CPU (i5-3450 CPU@3.10GHz) • GeForce GT 640, 384 GPU cores • 8G RAM • NVIDIA CUDA SDK 5.5 • Visual Studio 2010
Results • NA = Out of memory (8 Gb) on test machine Equirectangularto Albers • CUDA configuration • Block size:256*1 • Chunk dimension: 1024 • Resample GLC (original: ~900MB)
Results Albers to Equirectangular • CUDA configuration • Block size:256*1 • Chunk dimension: 1024 • Resample NLCD (original: 15.6G)
Issues(1 of 2) • The inverse/ forward map projection for Molliweide is not accurate • Need to find the reasons why (should be a minor fix) • Therefore, restrained the current testing to Equirectangular and Albers • The results of map projection were inaccurate due to misapplied resampling method (minor fix) • The way to retrieve input data chunk based on the bounding box of output chunk may not be quite accurate • Problem identified: chunks near the edges of dataset need to have some overlap retrieved (negative coordinates)
Issues(2 of 2) • Needs better memory management • CPU: Out of memory error even with chunk • Suspect test machine not releasing memory in timely fashion • GPU: not stable always: kernels may fail during the execution; grid/ block setup • Workload may not be balanced very well. • Kernels can fail when sending too much data • Using remote desktop to manipulate the data may cause issue
Conclusion • CUDA provides a light-weight, less-expensive alternative to CPU parallel environments like supercomputers • Raster map projection behaves similarly in initial test to established pRasterBlaster testing in CPU-dominated HPC environments • Greater than one order of magnitude faster • More work necessary/ issues remain
References • Behzad, Babak, Yan Liu, Eric Shook, Michael P. Finn, David M. Mattli, and Shaowen Wang (2012).A Performance Profiling Strategy for High-Performance Map Re-Projection of Coarse-Scale Spatial Raster Data.Abstract presented at the Auto-Carto 2012, A Cartography and Geographic Information Society Research Symposium, Columbus, OH. • Finn, Michael P., Yan Liu, David M. Mattli, BabakBehzad,Kristina H. Yamamoto, Qingfeng (Gene) Guan, Eric Shook, AnandPadmanabhan, Michael Stramel, and Shaowen Wang (2014). High-Performance Small-Scale Raster Map Projection Transformation on Cyberinfrastructure. Paper accepted for publication as a chapter in CyberGIS: Fostering a New Wave of Geospatial Discovery and Innovation, Shaowen Wang and Michael F. Goodchild, editors. Springer-Verlag. • Finn, Michael P., Yan Liu, David M. Mattli, Qingfeng (Gene) Guan, Kristina H. Yamamoto, Eric Shook and BabakBehzad(2012). pRasterBlaster: High-Performance Small-Scale Raster Map Projection Transformation Using the Extreme Science and Engineering Discovery Environment. Abstract presented at the XXII International Society for Photogrammetry & Remote Sensing Congress, Melbourne, Australia. • Finn, Michael P., Daniel R. Steinwand, Jason R. Trent, Robert A. Buehler, David Mattli, and Kristina H. Yamamoto (2012). A Program for Handling Map Projections of Small Scale Geospatial Raster Data. Cartographic Perspectives, Number 71, pages 53 – 67. • Liu, Yan, Michael P. Finn, BabakBehzad,andEric Shook (2013). High-Resolution National Elevation Dataset: Opportunities and Challenges for High-Performance Spatial Analytics. Abstract presented in the Special Session on “Big Data,” American Society for Photogrammetry and Remote Sensing Annual Conference. Baltimore, Maryland. • Liu, Yan, AnandPadmanabhan, and ShaowenWang, (2014) CyberGIS Gateway for enabling data-rich geospatial research and education, Concurrency Computat.: Pract. Exper., DOI: 10.1002/cpe.3256. • Rey, S.J. (2014) “Open regional science." Presidential Address, Western Regional Science Association, San Diego. February. • http://cegis.usgs.gov/ • http://www.du.edu/nsm/departments/geography/ • http://nationalmap.gov/3DEP/ • http://cybergis.cigi.uiuc.edu/cyberGISwiki/doku.php • http://cgwiki.cigi.uiuc.edu:8080/mediawiki/index.php/Main_Page • http://cgwiki.cigi.uiuc.edu:8080/mediawiki/index.php/Software:pRasterBlaster
Other Collaborators(primarily on the CyberGIS project) • Shaowen Wang, AnandPadmanabhan, Yan Liu • University of Illinois at Urbana-Champaign (UIUC), CyberInfrastructure and Geospatial Information Laboratory • David M. Mattli, Jeff Wendel, E. Lynn Usery, Michael Stramel • USGS, Center of Excellence for Geospatial Information Science (CEGIS) • Kristina H. Yamamoto • USGS, National Geospatial Technical Operations Center • BabakBehzad • UIUC, Department of Computer Science • Eric Shook • Kent State University, Department of Geography • Qingfeng (Gene) Guan • China University of Geosciences
Disclaimer • Any use of trade, product, or firm names in this paper is for descriptive purposes only and does not imply endorsement by the U.S. Government.
Small-Scale Raster Map Projection using the Compute Unified Device Architecture (CUDA) QUESTIONS? ISPRS Technical Commission IV Symposium on Geospatial Databases and Location Based Services Suzhou, China 14 – 16 May 2014
The block size is not directly related to the chunking concept. • Block size is the number of threads within each block of GPU. Another concept is grid size. CUDA can launch multiple threads at the same time (e.g., 512 threads). All threads in a block will be sent to the GPU processors at the same time but may not launch at the same time (depending how many GPU cores are available). • In my implementation, I assign each cell of the output image/chunk to a thread. • If the output image has a dimension of 256*256 and the block size is 16*16, then the grid size is (256/16)*(256/16) =16*16. • If the output image has a dimension of 250*250 and the block size is 16*16, then the grid size is (256/16)*(250/16) = 16*15.x = 16*16. This implies that the last few blocks have less data (e.g., 16*10). • So the selection of the block size is determined by the number of GPU cores as well as the dimension of the image. • When dealing with large image, which cannot be read into CPU main memory all at once, the image should be divided into chunks. One chunk then becomes an input image. Then GPU starts processing the chunk..