EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA Presented by: Duong, Huu Kinh Luan March 14th, 2011

Outline • Authors of paper • What is the problem? • Why there is the problem? • Introduction • Background information • Problem Definition • Handling Technique • Algorithm • Experimental Results • Rank Based K-NN • Related Papers on the same topic • Top-k Properties • Problem Definition • Notations used • Exact Algorithm • Randomized Algorithm

Introduction • Authors of the paper Xuemin Lin - Professor The University of New South Wales PhD – C.S from the U. Queensland (Australia) in 1992 Ying Zhang Research Fellow PhD – 01.2008 Wenjie Zhang Post-doc research fellow PhD – 2010 Gaoping Zhu PhD Candidate Qianlu Lin PhD Candidate

Introduction • What is the problem? Uncertain Data SENSOR NETWORK GPS TRACKING DEVICE

Introduction • What is the problem? k-Nearest Neighbor query

Background information Rank based k-NN

Background information • G. Cormode, F. Li, and K. Yi “Semantic of ranking queries for probabilistic data and expected ranks” • R. Chen, L. Chen, J. Chen and X. Xie“Evaluating probability threshold k-nn queires over uncertain data” • V. Ljosa and A. K. Singh“Apla: Indexing arbitrary probability distributions” Rank based k-NN is not a new problem

Outline • Introduction • Background information • Problem Definition • Handling Technique • Algorithm • Experimental Results

Problem definition Set of objects: U = {U1, U2, …, Un}  U = {U1, U2, U3, U4} Possible World: W = {u1, u2, u3, …, un}  W1 = {U1, U2} U U2 U1 Definition 1: Rank (Rank of an obj U in one possible world W) q U3 U4

Problem definition Definition 2: Expected Rank Definition 3: Median Rank

Problem definition Example: Show on board Possible Worlds? Rank for A? i.e. r(a1), r(a2), r(a3) Expected rank for A? i.e. er(A) Median rank for A? i.e. mr(A)

Problem definition • Top–K Query: Find k nearest neighbors for a given query q based on the expected (median) ranks of n objects.

Problem definition • Top–K Properties: Exact-k: K-NN query answer should return exactly k objects Containment: (K+1)-NN should contain all objects in KNN Unique Ranking: The same object should not be listed multiple times in KNN Value invariance: The distance only determines the relative behavior of the object Stability: Making an item in the top-k list more likely or more important should not remove it from the list

Problem definition • Top–K Properties: Proof that expected rank satisfies all 5 top-k properties  not this paper major concern.  It is done in the paper “Semantic of ranking queries for probabilistic data and expected ranks”, by G. Cormode, F. Li, and K. Yi

Problem challenge • Overcome previous paper’s difficulties: Reduce the number of objects accessed Pre-computed expected scores of objects Expected score might change upon different queries Approximation of KNN querie answer

Handling technique Lemma 2: Let ui and uj be the instances which determine the median rank and median distance of U respectively, we have r(ui) = r(uj) !

Handling technique • Finding Minimal Set for Selection Problem(Using Bound Based Approach) Motivation for the Algorithm

Notations in the Algo.

Algorithm Uncertain objects R-Tree query q also represented in R-Tree e(I) from d-(I) to d+(I)

Algorithm • Example of calculating r-(I) and r+(I) smaller than Sum up for r-(I)

Algorithm • Example of calculating r-(I) and r+(I) smaller than Sum up for r+(I)

Algorithm • Exact Algorithm: accrmin: accumulation of the probability values of the invervals {I of I} with d+(I)<=d Uarmin(d): accumulation of the probability values of the invervals {I of IU} with d+(I)<=d

Algorithm Cost: • Exact Algorithm: Initial Procedure: O(nlogn + np0 x cio) One round: O(n x m log(n x m)) Total time cost: T = O(h x n x m log(n x m)) + npi x cio (i:0:h) n: number of objects m: number of interval in 1 object h: max height of local R-Tree npo: number of IO npi: number of IO in ith round cio : cost of each IO

Algorithm Sample the possible world such that the expected rank and median rank can be approximately computed in an efficient way. • Randomized Algorithm:

Algorithm Estimate the expected rank of an object U where ri(U) is the rank of U in sample Si • Randomized Algorithm: Recall:

Algorithm • Find candidate objects C for the KNN query based on the global R-Tree • Minimal/Maximal Expected rank for each object using Sweepline algorithm • l and r --> value to prune or validate objects for the KNN query • Randomized Algorithm:

Algorithm T = O(nlogn + n’logn + n1 x cio) • Randomized Algorithm – Cost: O(nlogn) O(logn) O(n’logn + n1 x cio)

Algorithm • What is n’?

Experimental Results

Experimental Results • Comparision with the other paper: This paper The other paper

Q&A

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

EFFICIENT RANK BASED K-NN QUERY PROCESSING OVER UNCERTAIN DATA

Presentation Transcript

Top-k Query Processing in Uncertain Database

Top-k Query Processing

Query Processing over Incomplete Autonomous Databases

Efficient Join Processing over Uncertain Data - By Reynold Cheng, et all.

Efficient Processing of Top- k Queries in Uncertain Databases

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

Efficient Query Processing On Massive Multi-dimension Data

Model-Based Query Processing Over Uncertain Data (in ICDE 2011)

Efficient Top-K Query Evaluation on Probabilistic Data

OLAP over Uncertain and Imprecise Data

OLAP Over Uncertain and Imprecise Data

Efficient OLAP Query Processing for Distributed Data Warehouses

Top-k Query Processing and Optimization

Query Processing over Incomplete Autonomous Databases

Bandwidth-Efficient Continuous Query Processing over DHTs

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data

Query-Based Data Pricing

Efficient Top-k Query Evaluation on Probabilistic Data

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Learning Based Web Query Processing

Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing

Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data