220 likes | 235 Views
This paper discusses the experiences and results of using MapReduce for processing spatial data, specifically for R-Tree index construction and aerial image processing.
E N D
Experiences on Processing Spatial Data with MapReduce Ariel Cary, Zhengguo Sun, VagelisHristidis, Naphtali Rishe Florida International University School of Computing and Information Sciences 11200 SW 8th St, Miami, FL 33199 {acary001,sunz,vagelis,rishen}@cis.fiu.edu Sponsored by: NSF Cluster Exploratory (CluE)
Agenda • Introduction • Solving Spatial Problems in MapReduce • R-Tree Index Construction • Aerial Image Processing • Experimental Results • Related Work • Conclusion Florida International University
Introduction • Spatial databases mainly store: • Raster data (satellite/aerial digital images), and • Vector data (points, lines, polygons). • Traditional sequential computing models may take excessive time to process large and complex spatial repositories. • Emerging parallel computing models, such as MapReduce, provide a potential for scaling data processing in spatial applications. Florida International University
Introduction (cont.) • MapReduce is an emerging massively parallel computing model (Google) composed of two functions: • Map: takes a key/value pair, executes some computation, and emits a set of intermediate key/value pairs as output. • Reduce: merges its intermediate values, executes some computation on them, and emits the final output. • In this work, we present our experiences in applying the MapReduce model to: • Bulk-construct R-Trees (vector) and • Compute aerial image quality (raster) Florida International University
Introduction (cont.) • Apache's Hadoop • Linux operating system • XEN hypervisor • HadoopDistributed File System (HDFS) • SOCKS proxy server • 480 computers (nodes), each half terabytes storage Florida International University
2. Solving Spatial Problems in MapReduce R-Tree Index Construction Aerial Image Processing
MapReduce (MR) R-Tree Construction • R-Tree Bulk-Construction • Every object oin database D has two attributes: • o.id - the object’s unique identifier. • o.P - the object’s location in some spatial domain. • The goal is to build an R-Tree index on D. • MapReduce Algorithm • Database partitioning function computation (MR). • A small R-Tree is created for each partition (MR). • The small R-Trees are merged into the final R-Tree. Florida International University
Phase 1 – Partitioning Function • Goal: compute f to assign objects of D into one of R possible partitions s.t.: • R (ideally) equally-sized partitions are generated (minimal variance is acceptable). • Objects close in the spatial domain are placed within the same partition. • Proposed solution: • Use Z-order space-filling curve to map spatial coordinate samples (3%) into an sorted sequence. • Collect splitting points that partition the sequence in R ranges. • Where: • o is an spatial object in the database. • C which is a constant that helps in sending Mappers' outputs to a single Reducer. • U is a space-filling curve, e.g. Z-order value. • S' is an array containing R-1 splitting points. Florida International University
Phase 2 - R-Tree Construction in MR • Mappers compute f() values for objects. • Reducers compute an R-Tree for each group of objects with identical f() value • Where: • o is an spatial object in the database. • f is the partitioning function computed in phase 1. • Tree.root is the R-Tree root node. Florida International University
Phase 3 - R-Tree Consolidation • sequential process Florida International University
Image Processing in MapReduce • Aerial Image Quality Computation • Let d be a orthorectified aerial photography (DOQQ) file and t be a tile inside d, d.name is d’s file name and t.q is the quality information of tile t. • The goal is to compute a quality bitmap for d. • MapReduce Algorithm • A customized InputFormatter partitions each DOQQ file d into several splits containing multiple tiles. • The Mappers compute the quality bitmap for each tile inside a split. • The Reducers merge all the bitmaps that belongs to a file d and write them to an output file. Florida International University
Image Processing in MapReduce • MapReduce Algorithm • Where: • d is a DOQQ file. • t is a tile in d. • t.q is the quality bitmap of t. Florida International University
Experimental Results: Setting • Data Set • Environment • The cluster was provided by the Google and IBM Academic Cluster Computing Initiative. • The cluster contains around 480 computers running Hadoop- open source MapReduce. Florida International University
Experimental Results: R-Tree • R-Tree Construction Performance Metrics Florida International University
Experimental Results: R-Tree • MapReduce R-Trees vs. Single Process (SP) Florida International University
Experimental Results: Imagery • Tile Quality Computation Florida International University
Related Work • Previous works on R-Tree parallel construction faced intrinsic distributed computing problems: load balancing, process scheduling, fault tolerance, etc. • Schnitzer and Leutenegger [16] proposed a Master-Client R-Tree, where the data set is first partitioned using Hilbert packing sort algorithm, then the partitions are declustered into a number of processors, where individual trees are built. At the end, a master process combines the individual trees into the final R-Tree. • Papadopoulos and Manolopoulos [17] proposed a methodology for sampling-based space partitionining, load balancing, and partition assignment into a set of processors in parallely building R-Trees. Florida International University
Conclusion • We used the MapReduce model to solve two spatial problems on a Google&IBM cluster: • (a) Bulk-construction of R-Trees and • (b) Aerial image quality computation • MapReduce can dramatically improve task completion times. Our experiments show close to linear scalability. • Our experience in this work shows MapReduce has the potential to be applicable to more complex spatial problems. Florida International University
References [1] Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD 1984:47-57. [2] NSF Cluster Exploratory Program: http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm [3] Google&IBM Academic Cluster Computing Initiative: http://www.google.com/intl/en/press/pressrel/20071008_ibm_univ.html [4] Apache Hadoop project: http://hadoop.apache.org [6] Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, USENIX Association, Volume 6, pp. 10-10, December 2004. [12] High Performance Database Research Center (HPDRC), Research Division of the Florida International University, School of Computing and Information Sciences, University Park, Telephone: (305) 348-1706, FIU ECS-243, Miami, FL 33199. [16] Schnitzer B., Leutenegger S.T.: Master-client R-trees: a new parallel R-tree architecture, In Proceedings of the 11th International Conference on Scientific and Statistical Database Management, pp. 68-77, August 1999. [17] Apostolos Papadopoulos, Yannis Manolopoulos: Parallel bulk-loading of spatial data, Parallel Computing, Volume 29, Issue 10, pp. 1419 - 1444, October 2003. Florida International University