430 likes | 539 Views
The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries. Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST. Roadmap. Problem – motivation Survey Proposed method – main idea Proposed method – details
E N D
The Power-Method: A Comprehensive Estimation Technique for Multi-Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Target query types DB = set of m –d points. • Range search (RS) • k nearest neighbor (KNN) • Regional distance (self-) join (RDJ) • in Louisiana, find all pairs of music stores closer than 1mi to each other
Target problem Estimate • Query selectivity • Query (I/O) cost • for any Lp metric • using a single method
Target Problem • for any Lp metric • using a single method
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Older Query estimation approaches • Vast literature • Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc • BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear
Main competitors • Local method • Representative methods: Histograms • Global method • Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations • Representative methods: Fractal and power law
Rationale and problems of histograms • Partition the data space into a set of buckets and assume (local) uniformity • Problems • uniformity • tricky/slow estimations, for all but the L norm
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10
Inherent defect of histograms • Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10
“Density Trap” • Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! • This ‘trap’ will appear for any non-uniform dataset • Almost ALL real point-sets are non-uniform -> the trap is real
“Density Trap” In short: is meaningless • What should we do instead?
“Density Trap” In short: is meaningless • What should we do instead? • A: log(count_of_neighbors) vs log(area)
Local power law • In more detail: ‘local power law’: • nb: # neighbors of point p, within radius r • cp: ‘local constant’ • np : ‘local exponent’ (= local intrinsic dimensionality)
Local power law Intuitively: to avoid the ‘density trap’, use • np:local intrinsic dimensionality • instead of density
Does LPL make sense? • For point ‘q’: LPL gives nbq(r) = <constant> r1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10
Local power law and Lx if a point obeys L.P.L under L, ditto for any other Lx metric, with same ‘local exponent’ ->LPL works easily, for ANY Lx metric
Examples #neighbors(<=r) p1 p2 radius p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2
Roadmap • Problem – motivation • Survey • Proposed method – main idea • Proposed method – details • Experiments • Conclusions
Proposed method • Main idea: if we know (or can approximate) the cp and np of every point p, we can solve all the problems:
Target Problem • for any Lp metric • using a single method
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends • only on the ‘local exponent’ • and NOT on the ‘local constant’, • nor on the cardinality of the dataset
Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do?
Implementation • Given a query point q, we need its local exponent and constants to perform estimation • but: too expensive to store, for every point. • Q: What to do? • A: exploit locality:
Implementation • nearby points: usually have similar local constants and exponents. Thus, one solution: • ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q
Implementation • choose anchors: with sampling, DBS, or any other method.
Implementation • (In addition to ‘anchors’, we also tried to use ‘patches’ of near-constant cp and np – it gave similar accuracy, for more complicated implementation)
Experiments - Settings • Datasets • SC that contain 40k points representing the coast lines of Scandinavia • LB that include 53k points corresponding to locations in Long Beach county • Structure: R*-tree • Compare Power method to • Minskew • Global method (fractal)
Experiments - Settings • The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods • Queries: Biased (following the data distribution) • A query workload contains 500 queries • We report the average error i|actiesti|/iacti
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Range search selectivity • the LPL method wins
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Regional distance join selectivity • No known global method in this case • The LPL method wins, with higher margin
Target Problem • for any Lp metric (Lemma3.2) • using a single method
Conclusions • We spot the “density trap” problem of the local uniformity assumption (<- histograms) • we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) • and we solved all posed problems:
Conclusions – cont’d • for any Lp metric • using a single method
Conclusions – cont’d • for any Lp metric (Lemma3.2) • using a single method (LPL & ‘anchors’)