190 likes | 344 Views
Query Processing of Massive Trajectory Data based on MapReduce. Qiang Ma, Bin Yang ( Fudan University) Weining Qian , Aoying Zhou (ECNU) Presented By: Xin Cao (Aalborg University). Outline. Introduction Preliminary Trajectory Processing Execution Overview Storage Indexing Methods
E N D
Query Processing of Massive Trajectory Data based on MapReduce Qiang Ma, Bin Yang (Fudan University) WeiningQian, Aoying Zhou (ECNU) Presented By: Xin Cao (Aalborg University)
Outline • Introduction • Preliminary • Trajectory Processing • Execution Overview • Storage • Indexing Methods • Query Processing • Experimental Study • Future Works
Introduction • Location-based services are playing important roles. • Large volumes of diverse formats of trajectory data have been accumulated. • Traditional centralized technologies may not deal with the large amount of trajectories. • Cloud computing, such as GFS and MapReduce, provides a promising paradigm to conquer the explosion of trajectory data.
Challenge • Huge volume, updates frequently, rapidly increasing. • Trajectory data is “continuous”, i.e. ordered sequentially. • Highly skewed. • MapReduce is good at offline data analysis, but not efficient for online query.
Our Contributions • Extend the MapReduce framework to manage massive sequential data, such as trajectories of moving objects. • Study what kind of query processing methods are appropriate for large clusters. • Provide two scalable indexing methods to facilitate query processing efficiently.
Preliminary • Data Model - line segments model • A polylinein three-dimensional space. • Query Types • Spatio-temporal Range Query: • Q(Es, Et) → {Sk} • Trajectory-based Query: • Q(O, Et) → {Sk}
Trajectory Processing • Execution Overviews
Storage • Data are grouped with key and organized in data chunks in GFS-style storage. • The whole data set is divided into several parts, and each part is called a partition and assigned to one data chunk to store. • Each trajectory data is assigned to at least one partition according to spatio-temporal information
Storage • A good spatio-temporal partitioning makes the size of data per chunk is fairly uniform. • Static partitioning strategies are easy to control and suitable for distributed scheduling, but may lead to load imbalance. • Dynamic strategies can resolve load imbalance, but re-split data can cause distantly migration of large volume of data in clusters. • Appropriate strategies should be trained
PMI (Partition based Multilevel Index) • Aim to speed up spatio-temporal range queries. • Generate all candidate partitions by invoking space partition strategy. • Store together as key/value. • <PartitionID, Sk> • Each data chunk only contains trajectory segments that belong to the same partition. • Multilevel index for each node can be built local. (using traditional centralized methods)
OII (Object Inverted Index) • Aim to speed up trajectory based queries. • Collect each object's all historical trajectories. • Store together as key/value. • <OID, { PartitionID, T}> • Access according to key(object identifier).
Query Processing • Query Processing • Trajectory based Queries • Given any object ID, the system can locate the object's trajectory according to OII. • Range Queries
Experimental Study • Settings • Hadoop version 0.19.0 • 8 PC nodes • Ubuntu Linux version 8.04 • Pentium IV 1.7GHz CPU • 512M memory • Java SDK 1.42 • Experiment data: Network-based Generator
Experiments – Load Balance Standard Deviation of Partitioning Load Balance of PRADASE
Experiments – Data Importing and Index Creating Data Importing with PMI Data Importing with OII
Experiments – Query Processing Spatio-temporal Range Query Processing with PMI Trajectory Base Query Processing with OII
Future Works • More heuristic partitioning methods. • Reducing data migration between nodes. • Efficient real-time query processing on Cloud infrastructure.