160 likes | 373 Views
Top- k Queries on Uncertain Data. 指導教授:陳良弼 老師 報告者:鄧雅文 97753034. Outline. Introduction Related Work Problem Formulation Future Work. Introduction. Top- k query on certain data Rank results according to a user-defined score Important for explore large databases
E N D
Top-k Queries on Uncertain Data 指導教授:陳良弼 老師 報告者:鄧雅文 97753034
Outline • Introduction • Related Work • Problem Formulation • Future Work
Introduction • Top-k query on certain data • Rank results according to a user-defined score • Important for explore large databases • E.g., top-2 = {T1, T2}
Introduction (cont.) • Uncertain database • How to define top-k on uncertain data? • Mutually exclusive rules • E.g., T1♁T4
Related Work • C. C. Aggarwal and P. S. Yu. A Survey of Uncertain Data Algorithms and Applications. In TKDE, 2009. • Causes: • Sensor networks,privacy, trajectories prediction… • The main areas of research on the uncertain data: • Modeling of uncertain data • Uncertain data management • Top-k query, range query, NN query… • Uncertain data mining • Clustering, classification, frequent pattern, outliers…
Related Work (cont.) • M. Soliman, I. Ilyas, and K. Chang. Top-k Query Processing in Uncertain Databases. In ICDE, 2007. • Possible Worlds
Related Work (cont.) • U-Topk query • Return k tuples that can co-exist in a possible world with the highest probability • E.g., {T1, T2} as U-Top2 • U-kRanks query • Return k tuples each of which is a clear winner in its rank over all possible worlds • E.g., {T2, T6} as U-2Ranks
Related Work (cont.) • M. Hua, J. Pei, W. Zhang, X. Lin. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD, 2008. • PT-k query • Return a set of all tuples whose top-k probability values are at least p • E.g., {T1, T2, T5} as PT-2 (with p=0.4)
Related Work (cont.) • T. Ge, S. Zdonik, and S. Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. In SIGMOD, 2009. • The tradeoff between reporting high-scoring tuples and tuples with a high probability of being in the top-k • Return a number of typical vectors that efficiently sample the distribution of all potential top-ktuple vectors
Problem Formulation • Example: • In an International Tenpin Bowling Championship, the events include single, double, and trio. Due to the budget, the coach can only choose 3 players to attend. Therefore, we hope these 3 players can have relatively high probability to perform well over these 3 types of events.
Problem Formulation (cont.) • U-Top3={T2, T5, T6} • But U-Top2={T1, T2}, U-Top1={T1} • How about also considering {T1, T2, T5} as top-3?
Problem Formulation (cont.) • We choose the answers of a top-k query not only depending on the probability (P) but also on the confidence (C). • Confidence: to express the top-(k-1) probabilities of the sets formed by k-1 tuples of this possible top-k answer • E.g., k=3 {T1, T2, T3} as a possible top-k with P=0.0356 C is composed in some way of Pr({T1, T2}) to be top-2=0.2542 and its confidence, Pr({T1, T3}) to be top-2=0.0218 and its confidence, Pr({T2, T3}) to be top-2=0.0512 and its confidence
Problem Formulation (cont.) • Since every possible top-k answer has two features—probability (P) and confidence (C), we only return those non-dominated ones as a result set. • E.g., {T1, T3, T5}: P=0.8, C=0.4 {T1, T4, T7}: P=0.5, C=0.7 {T2, T6, T7}: P=0.3, C=0.2 this will not be returned
Future Work • Formulate the confidence function • Find an algorithm to generate the result set • Try to calculate the confidence in an efficient way • Carry out an empirical study on datasets