Query Processing of Massive Trajectory Data based on MapReduce

Query Processing of Massive Trajectory Data based on MapReduce Qiang Ma, Bin Yang (Fudan University) WeiningQian, Aoying Zhou (ECNU) Presented By: Xin Cao (Aalborg University)

Outline • Introduction • Preliminary • Trajectory Processing • Execution Overview • Storage • Indexing Methods • Query Processing • Experimental Study • Future Works

Introduction • Location-based services are playing important roles. • Large volumes of diverse formats of trajectory data have been accumulated. • Traditional centralized technologies may not deal with the large amount of trajectories. • Cloud computing, such as GFS and MapReduce, provides a promising paradigm to conquer the explosion of trajectory data.

Challenge • Huge volume, updates frequently, rapidly increasing. • Trajectory data is “continuous”, i.e. ordered sequentially. • Highly skewed. • MapReduce is good at offline data analysis, but not efficient for online query.

Our Contributions • Extend the MapReduce framework to manage massive sequential data, such as trajectories of moving objects. • Study what kind of query processing methods are appropriate for large clusters. • Provide two scalable indexing methods to facilitate query processing efficiently.

Preliminary • Data Model - line segments model • A polylinein three-dimensional space. • Query Types • Spatio-temporal Range Query: • Q(Es, Et) → {Sk} • Trajectory-based Query: • Q(O, Et) → {Sk}

Trajectory Processing • Execution Overviews

Storage • Data are grouped with key and organized in data chunks in GFS-style storage. • The whole data set is divided into several parts, and each part is called a partition and assigned to one data chunk to store. • Each trajectory data is assigned to at least one partition according to spatio-temporal information

Storage • A good spatio-temporal partitioning makes the size of data per chunk is fairly uniform. • Static partitioning strategies are easy to control and suitable for distributed scheduling, but may lead to load imbalance. • Dynamic strategies can resolve load imbalance, but re-split data can cause distantly migration of large volume of data in clusters. • Appropriate strategies should be trained

PMI (Partition based Multilevel Index) • Aim to speed up spatio-temporal range queries. • Generate all candidate partitions by invoking space partition strategy. • Store together as key/value. • <PartitionID, Sk> • Each data chunk only contains trajectory segments that belong to the same partition. • Multilevel index for each node can be built local. (using traditional centralized methods)

OII (Object Inverted Index) • Aim to speed up trajectory based queries. • Collect each object's all historical trajectories. • Store together as key/value. • <OID, { PartitionID, T}> • Access according to key(object identifier).

Data Insertion

Query Processing • Query Processing • Trajectory based Queries • Given any object ID, the system can locate the object's trajectory according to OII. • Range Queries

Experimental Study • Settings • Hadoop version 0.19.0 • 8 PC nodes • Ubuntu Linux version 8.04 • Pentium IV 1.7GHz CPU • 512M memory • Java SDK 1.42 • Experiment data: Network-based Generator

Experiments – Load Balance Standard Deviation of Partitioning Load Balance of PRADASE

Experiments – Data Importing and Index Creating Data Importing with PMI Data Importing with OII

Experiments – Query Processing Spatio-temporal Range Query Processing with PMI Trajectory Base Query Processing with OII

Future Works • More heuristic partitioning methods. • Reducing data migration between nodes. • Efficient real-time query processing on Cloud infrastructure.

Thanks!

Query Processing of Massive Trajectory Data based on MapReduce

Query Processing of Massive Trajectory Data based on MapReduce

Presentation Transcript

MapReduce: Simplified Data Processing on Large Clusters

Experiences on Processing Spatial Data with MapReduce

MapReduce: simplified data processing on large clusters

MapReduce : Simplified Data Processing on Large Clusters

Voronoi-based Geospatial Query Processing with MapReduce

MapReduce Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

Efficient Query Processing On Massive Multi-dimension Data

MapReduce: Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce: simplified data processing on large clusters

Data Processing with MapReduce

MapReduce: Simplied Data Processing on Large Clusters

Massive Data Processing 02: MapReduce Basics

Collaborative query processing based on reducts

Simplifying MapReduce Data Processing