Efficient Query Processing On Massive Multi-dimension Data

Efficient Query Processing On Massive Multi-dimension Data Presenter: Ying ZHANG DBG@UNSW

Outline • Background • Problems investigated • Data stream Computation • Skyline computation • Spatial keyword search DBG@UNSW

Background Massivemultidimensional data are collected everyday • location data from various Observational Mechanisms. • - Smart Phone • 0.36 billion this year in China – largest smart phone market , expect 0.45 billion next year. • Baidu Location based service receives 3.5 billion location requests on average each day. • - Sensor • - Radio Frequency Identification (RFID) • - Global Position System (GPS) DBG@UNSW

Background • Other Multi-dimensional data from various applications - Environment monitoring Measure light, temperature, humidity… - Finance and economic data purchase transactions, stock transactions … - User behavior data click streams , shopping records,… - Network data Network monitoring data - etc. DBG@UNSW

Problems Investigated Given a large number of multi-dimensional objects, we investigate the following representative and fundamental queries. • Rank-based Queries Top k query, Quantilequery • Dominance-based Queries Skyline query, representative skyline query • Proximity-based Queries Range search, nearest neighbor search and Reverse nearest neighbor search DBG@UNSW

Rank-based queries 1. Top k query p4 Y: research score p6 p1 p5 p8 p7 p2 p3 f(p) = x + y X : academic score DBG@UNSW

Rank-based queries (cont.) The first element in a sorted list with the cumulative weight not smaller than Φ, where Φ is a number in (0, 1]. 2. Quantile Computation ( Order statistics ) Φ-quantile : summarize score distribution • Sorted elements: • 3 3 6 7 8 9 12 13 15 20 0.5 quantile (median) 0.8 quantile DBG@UNSW

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% Rank-based queries (cont.) • Other Statistics Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency? DBG@UNSW

Dominance-based queries • n-dimensional numeric space D = (D1, …, Dn) • on each dimension, a user preference ≺ is defined • two points, udominatesv (u≺v), if - Di (1 ≤ i ≤ n), u.Di≺ = v.Di - Dj (1 ≤ j ≤ n), u.Dj≺v.Dj p4 Y: research score p6 p5 p1 p7 p8 p3 p2 X : academic score DBG@UNSW

Dominance-basedqueries (cont.) Skyline : points not dominated by other points. - candidates of best options in multi-criteria decision applications. DBG@UNSW

Proximity-based queries • Range search • Nearest Neighbor search • Reverse Nearest Neighbor search p2 p5 p6 p3 q p4 p7 DBG@UNSW

New Challenges (1) Massive Streaming data • Arrive at high speed and the volume of the data is extremely large. • - Twitter : 140 million users and over 340 million tweets per Day • - 200Mb/sec from a single sensor node for reading of the weather data • - AT&T collects 600-800 Gigabytes of NetFlowdata each day • - Square Kilometre Array (SKA) project : a few exabytes (1018 bytes) of data per day for a single beam per square kilometer, DBG@UNSW

Streaming Algorithm Synopses in Memory Data Streams ( Approximate ) Answer Stream processing Engine • One scan only • Processing time ( fast ) • Synopsis size ( small ) • Accuracy ( a good tradeoff with synopsis size ) DBG@UNSW

New Challenges (2) The data may be uncertain for various reasons. • Limits of the measuring devices • Noise • Delay or loss in data transfer. • Privacy • Data integration • The uncertainty of the data may be described continuously or discretely. DBG@UNSW

New Challenges (3) Enriched spatial data • Textual data • - Twitter , Weibo, Fourquare • The user profile • - age, gender, preference, etc. • Multimedia data • - photos, videos DBG@UNSW

Rank-based queries - Top K computation Find top k objects for given scoring/preference function. - Quantile computation Focus on approximate solution in the context of data stream - Others: Counting the number of distinct objects. Research outcome: ICDE’06, ICDE’07, TKDE’10, ADC’10 (best paper award) , etc. DBG@UNSW

Dominance-based queries • Skyline Computation • Skyline computation on uncertain data • Skyline computation over uncertain data streams Research outcome: ICDE’09, Info.Sys’11, ICDE’11, TODS’12, etc. DBG@UNSW

Dominance-based queries (cont.) • Representative Skyline Computation (ICDE’07) • Find k skyline objects which dominate the largest number of distinct objects. • Top k dominating query on uncertain data (VLDBJ’10) • Rank objects based on their dominance power, i.e, the number of objects dominated DBG@UNSW

Proximity-based queries • Range search on uncertain data • Report objects with appearance probability larger than p regarding a search region r . • The query is uncertain, target are certain objects (ICDE’09, TKDE’12 ) • The objects are uncertain. (TKDE’10, EDBT’12, TKDE’13) • Nearest neighbor search • Top k nearest neighbor search on uncertain data (ICDE’10) • - Top k spatial keyword search (ICDE’13, EDBT’14) DBG@UNSW

Proximity-based queries (cont.) • Reverse Nearest Neighbor Search • Reverse nearest neighbor search (ICDE’11, ICDE’14) • Continuous Monitoring reverse nearest neighbor (VLDB’09, 2 VLDBJ’12) DBG@UNSW

Two recent research topics • Skyline Computation on uncertain data • Spatial keyword search DBG@UNSW

Dominance Relation Easy for certain objects, Non-trivial for uncertain objects DBG@UNSW

Uncertain Skyline Computation (1) Probabilistic Skyline (2) Stochastic Order Non-trivial for uncertain objects A B 1K 500K 100K 80K 50K 10K DBG@UNSW

Spatial keyword search Spatial-TextualObjects • An enormous amount of spatio-textual objects available in many applications • Online local search e.g., online yellow pages • Social network services e.g., Facebook, Flickr, Twitter DBG@UNSW

Top k spatial keyword search (ICDE’13) p5 (pizza, steak,seafood) p2 (pizza, coffee,steak) p4 (coffee, sushi) pizza,coffee p3 (pizza, sushi) p1 (pizza, coffee,sushi) DBG@UNSW

Diversified spatial keyword search on Road Network (EDBT’14) • Consider the Road network distance Develop new signature techniques to improve I/O efficiency • Consider the diversity of the results ( Spatial disperse ) - Ranking score : linear combination of the distances from objects to query object (Relevance) and the sum of pairwise distance among resulting objects ( Diversity ) - Develop incremental diversified top k search algorithms DBG@UNSW

Streaming spatial keyword search Spatial-textual objects arrive in streaming fashion in many applications (e.g., twitter, and Weibo). • Size estimation for spatial keyword search. • Continuously monitoring local hot spot • Continuous spatial keyword queries. DBG@UNSW

Summary • Massive multi-dimensional data in various applications. • Three fundamental problems for massive multi-dimension data analysis. • New challenges and research opportunities DBG@UNSW

Thanks ! DBG@UNSW

Efficient Query Processing On Massive Multi-dimension Data

Efficient Query Processing On Massive Multi-dimension Data

Presentation Transcript

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

Towards Efficient Query Processing on Massive Evolving Graphs ( C-Big2012 )

SCOPE Easy and Efficient Parallel Processing of Massive Data Sets

Trie Indexes for Efficient XML Query Processing

Efficient Top-K Query Evaluation on Probabilistic Data

Query Processing of Massive Trajectory Data based on MapReduce

Multi-dimensional Range Query Processing on the GPU

Efficient OLAP Query Processing for Distributed Data Warehouses

Probabilistic Similarity Query on Dimension Incomplete Data

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

Bandwidth-Efficient Continuous Query Processing over DHTs

Efficient Processing of Massive Data Streams for Mining and Monitoring

Query Processing of XML Data

Continuous Query Processing on Spatio-Temporal Data Streams

Combining efficient XML compression with query processing

Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing

Query Processing of XML Data

Efficient Top-k Query Evaluation on Probabilistic Data

Efficient processing of path query with not-predicates on XML data

Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data