320 likes | 462 Views
An Efficient Distance Calculation Method for Uncertain Objects. Edward Hung csehung@comp.polyu.edu.hk Hong Kong Polytechnic University 2007 CIDM, Hawaii, USA, Apr 1-5, 2007. Outline. Why we care about uncertain objects and their distances?
E N D
An Efficient Distance Calculation Method for Uncertain Objects Edward Hung csehung@comp.polyu.edu.hk Hong Kong Polytechnic University 2007 CIDM, Hawaii, USA, Apr 1-5, 2007
Outline • Why we care about uncertain objects and their distances? • Analytic Solutions for Uniform and Gaussian Distributions • Five Approximation Methods (DM, PRS, GAPS, PGM, ASG) for Arbitrary Distributions • Equivalence of PRS, PGM and ASG • Performance Study • Conclusion
Uncertain Objects: From Where? • Sources • Readings from sensors • Classification results of image processing using statistical classifiers • Results from predictive programs used for stock market • Weather prediction • Etc
Uncertain Objects: How to Represent? • Representation • An exact value with margins of error • E.g., 156±0.5, [23.8, 24.9] • An uncertainty domain with a probability distribution/density function (PDF/pdf) • Discrete E.g., for object o1, UD(o1) = {5.1,5.2,5.3}, P1(5.1) = 0.3, P1(5.2)=0.4, P1(5.3)=0.3 • Continuous E.g., for object o2 with uniform distribution, UD(o2) = [6,11], p2(x) = 0.2 where 6 ≤ x ≤ 11
Uncertain Objects handled traditionally … • Transformed into exact values to store in traditional databases • Weighted average or mean • Value of highest frequency or possibility • Why bad?? • Intermediate and final results of mining or queries will also be approximate and may be wrong • E.g., deviation of cluster centroids and wrong assignment of some data • Shown in experimental results later
Distance: Why Important? • Various queries and data mining tasks, e.g., • Nearest-neighbor queries • Clustering (e.g., K-means clustering)
Distance: Why Expensive? • An uncertain object has more than one possible location • Discrete E.g., o1 (o2) has n1 (n2) possible locations • n1n2 possible pair-wise combinations of their locations to calculate distances • Probability of each location may be different o1 o2
Distance: Why Expensive? • Continuous E.g., take n samples on each uncertain object • More samples in region of higher probability density • Each sample has the same probability o1 o2
Distance: Why Expensive? • Approximation by a grid of a finite number of cells formed on the uncertainty domain (region)1 • A grid of 14X14 cells • Probability of each cell determined by sampling • All combinations of cells of two objects 196X196 distance calculations 1e.g., used in Ngai, et al., “Efficient clustering of uncertain data”, in the 2006 IEEE International Conference on Data Mining (ICDM).
Why Expected Distance? • All possible pair-wise combinations a distance function di,j(x) to return the probability (or density) that the distance between objects oi and oj is x • VERY expensive (previous slides) • Expected distance: weighted average of all combinations’ distances • Could be much cheaper IF we do NOT need to try all combinations • Squared Euclidean distance chosen • Easier integration compared with Euclidean distance or Manhattan distance
Analytic Solutions • Uniform pdf • Gaussian pdf
Uniform pdf • c2+(a2-ab+b2)/3 • C2+r2/2 • C2+3r2/5 • C2+(r12+r22)/3 (5) C2+(r12+r22)/2 (6) C2+r12/2+3r22/5 (7) C2+3(r12+r22)/5
Gaussian pdf • For objects oi with Gaussian pdf N(μi,Σi), where μi is a dX1 mean vector, Σi is a dXd covariance matrix, • Expected distance between objects oi, oj is • EDAS(oi, oj) = ||μi -μj||2 + trace(Σi) + trace(Σj) • where trace(Σi) is sum of all diagonal elements in Σi
Approximation Methods for Arbitrary pdf • 5 methods proposed: • Distance between Means (DM) • Pair-wise between Random Samples (PRS) • Grid Approximation and Pair-wise between Samples (GAPS) • Pair-wise between Gaussian Mixture (PGM) • Approximation by Single Gaussian (ASG)
1. Distance between Means (DM) • EDDM(oi, oj) = ||μi -μj||2 o1 o2
2. Pair-wise between Random Samples (PRS) • take n samples on each uncertain object • More samples in region of higher probability density; each sample has the same probability o1 o2
3. Grid Approximation and Pair-wise between Samples (GAPS) • Approximation by a grid of √s X √s cells formed on the uncertainty domain • Probability of each cell determined by sampling
4. Pair-wise between Gaussian Mixture (PGM) • Approximate an uncertain object oi by a mixture of Gaussian distributions: ∑uCi Ai,uN(μi,u,Σi,u) • use K-means to cluster samples into a few clusters): • EDPGM(oi, oj) = ∑uCi ∑vCj Ai,uAj,v(||μi,u –μj,v||2 + trace(Σi,u) + trace(Σj,v)) o1 o2
5. Approximation by Single Gaussian (ASG) • Approximate an uncertain object oi by a single Gaussian distributions: • N(μi,Σi) • EDASG(oi, oj) = ||μi -μj||2 + trace(Σi) + trace(Σj) • Complexity = O((ni+nj)d) o1 o2
Equivalence of PRS, PGM and ASG • Theorem: • Given any uncertain objects oi, oj and their samples xi,1,…,xi,ni, xj,1,…,xj,nj, EDPRS(oi,oj)=EDPGM(oi,oj)=EDASG(oi,oj) • Theoretically ASG is the most inexpensive compared with all other methods (except DM) with the same results as PRS and PGM • What about compared with DM and GAPS?
Performance Study • Experimental results also show that ASG is: • much more accurate than DM with comparable speed • much faster than GAPS with higher or comparable accuracy • # grid cells = # samples
Experiment 1 • 100 uncertain objects (4 Gaussian pdfs, variances in [1,10])
Experiment 1 • ASG: ~ 0.02ms
Experiment 2 • Data generated in the way as • Ngai, et al., “Efficient clustering of uncertain data”, in the 2006 IEEE International Conference on Data Mining (ICDM) • A grid of 14X14 cells • Probability of each cell randomly generated • normalized GAPS produces the correct solution, but how close is ASG?
Experiment 2 • ASG: ~ 0.02ms
Experiment 3 • ASG also approximates well objects with uniform pdf • 10 objects with radius in [1,5], random located in 100X100 2D space • ASG takes 100 samples, and repeats for 6 times • Accuracy • Worst case > 0.98 • Average > 0.99
Experiment 4 • Scalability w.r.t. # Dimensions • 2/3/4-D • 256/216/256 samples/cells • ASG • Accuracy: 0.97 – 0.99 • Time: ~ 0.02ms or less
Conclusion • Importance of expected distance calculation in queries and data mining applications on uncertain data • Analytic solutions of special cases (uniform/Gaussian pdf) • ASG can obtain highly accurate results quickly • ASG can replace GAPS used in recent research work