260 likes | 365 Views
Probabilistic Similarity Query on Dimension Incomplete Data. Wei Cheng 1 , Xiaoming Jin 1 , and Jian-Tao Sun 2. Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2. ICDM 2009, Miami. Outline. Motivation & Problem Our Solution Experiments
E N D
Probabilistic Similarity Query on Dimension Incomplete Data Wei Cheng1, Xiaoming Jin1, and Jian-Tao Sun2 Intelligent Data Engineering Group, School of Software, Tsinghua University1 Microsoft Research Asia2 ICDM 2009, Miami
Outline • Motivation & Problem • Our Solution • Experiments • Related Work • Summary and Future Work
Motivation • Multidimensional data are everywhere • Time series • stock data • data collected from sensor monitor • Feature vectors extracted from images or texts • …… • Similarity query on multidimensional data is important • data mining • database • information retrieval
Similarity query is challenging when the data is incomplete • Data incompleteness happens when: • Sensors do not work properly • Certain features are missing from particular feature vectors • ……. In order to process similarity query, imputation is necessary. (i.e. by “completing” the missing data by filling in specific values) Sensor data 2 X 3 12 … … Query Text vector 4 1 C1 Y 9 … … Image vector Z 5 2 11 … … … … … …
Dimension incomplete data • Dimension incomplete data satisfies: • (a) At least one of its data elements is missing; • (b) The dimension of the missing data element can not be determined. • E.g. • Observed data: • But we know the complete data should be of three dimensions • Data missing might happen on the first, second or third dimension. 3 6
Causes of dimension incomplete • Dimension incompleteness happens when: • Data missing happens while using the order as the implicit dimension indicator • The dimension indicator itself may also be lost • ……
Similarity query is more challenging when the dimension is incomplete • To measure the similarity between query and the dimension incomplete data object, we should first recover the incomplete data. • Enumerating all combination cases? – Time costing • E.g. Xobs : For an m-dimensional data object which has n elements missing, there will be Cmn cases to recover it. lost one dimension 3 6 3 possible results after data recovery X 3 6 Imputed element 3 6 X 3 6 X
Two assumptions: • The probability of using each recovery result is equal. • The missing values obey normal distribution.
Efficient approach for PSQ-DID • A gradual refinement search strategy including two pruning methods: • Lower/upper bounds of confidence • Probability triangle inequality • Our Overall Query Process
Lower and upper bounds of confidence • The missing part and the observed part of the dimension incomplete data are treated separately. Since we use Euclidean distance, we have: Lower/upper bounds of the observed part, denoted by δLBobs and δUBobs. Lower/upper bounds of the missing part, denoted by δLBmis and δUBmis.
E.g. • Xobs=(2,8,7), Q=(1,4,5,6,7) • δ2LBobs(Q, Xobs)=(2-1)2+(8-6)2+(7-7)2 = 5 corresponding recovery version: (2,8,7,x1,x2) • For the imputed random variables Xmis={x1,x2}, If the imputation policy is using the mean value of the two adjacent observed elements as the expectation of the imputed random variables, then δ2LBmis(Q , Xmis )=(4-x1)2+(5-x2)2,(E(x1)=E(x2)=5), corresponding to Xrv =(2, , , 8, 7). 5 5
Lower and upper bounds of confidence • We prove that Denoted by: ,
Probability triangle inequality • Given a query Q and a multidimensional data object R (|Q| = |R|). For a dimension incomplete data object Xobs whose underlying complete version is X, we have: • (1) • (2) Calculated in advance and stored in the database O(|Xobs|(|Q|-|Xobs|)2) Calculated during query processing O(|Q|)
Experiments • Data sets: • Standard and Poor 500 index historical stock data(S&P500) (251 dimensions) • A new data set with 30 dimensions • by segmenting the S&P500 data set, resulting in 4328 data objects. • Corel Color Histogram data (IMAGE) • 68040 images • 32 dimensions • Dimension incomplete data set: • randomly removing some dimensions of each data object.
Experiment Setup • Ground truth: • Similarity query results on the complete data • Performance measures • Precision, recall, pruning power • Pruning power=Ndefinite/Nprocessed • Nprocessed : number of all data objects • Ndefinite: number of data objects judged as dismissals or search results by the pruner. • Query: 100 data objects randomly sampled from the data set
Effectiveness of probabilistic similarity query on dimension incomplete data Query precision on S&P500 data set Query recall on S&P500 data set
Effectiveness of probabilistic similarity query on dimension incomplete data Query precision on IMAGE data set Query recall on IMAGE data set
Effect of the confidence threshold • Missing ratio=0.1; r=60 for S&P500, r=0.7 for IMAGE data Confidence threshold vs precision-recall
Effectiveness of different pruners Pruning power of probability triangle inequality
Pruning Power of Four Pruners • Pruner1: probability triangle inequality using confidence lower bound confidence; Pruner2: probability triangle inequality using confidence upper bound confidence; Pruner3: confidence lower bound; Pruner4: confidence upper bound • missing ratio=10%, c= 0.1, number of assistant objects=20 Pruning power of four pruners
Comparison of query quality when neglecting naïve verification • For data objects that the four pruners can not judge, Pos simply outputs as query results, Neg, by contrast, judges them as dismissals. • c=0.1 Comparison of query quality
Performance analysis Time cost
Related Work • Few research papers discuss similarity search on dimension incomplete data • Incomplete data • Recovery • D. Williams et al. [ICML’05], K. Lakshminarayan et al. [Applied Intelligence’99],… • Indexing • G. Canahuate et al. [EDBT’06], B. C. Ooi et al. [VLDB’98],… • Uncertain data • J. Pei et al.[Sigmod’08], D. Burdick et al. [VLDB’05],… • Dimension incomplete data • Symbolic sequences • J. Gu et al. [DEXA’07]
Summary and Future Work • Problem: • Tackle the similarity query on a new uncertain form (dimension incomplete) • Solution: • Lower and upper bounds of confidence • So that we can avoid enumerate all C|Q||Xmis| recovery cases • Probability triangle inequality • Further boost the performance in query processing procedure • Future work • Other similarity measurements • Index dimension incomplete data