560 likes | 572 Views
Learn about geometric data streams for analyzing spatial structures efficiently. Explore SVM approaches, coresets, and merging techniques for managing massive geometric object sets sequentially. Discover applications in learning, clustering, and more.
E N D
Algorithms for geometric data streams Christian Sohler, TU Dortmund
Introduction Data streams • Massive data set arriving sequentially • Different ways of „arriving“ Examples • Network traffic • Query logs • … Approach • Find algorithms that make a single (a few) pass(es) and process data sequentially
Introduction Geometric data streams • Massive sets of geometric objects arriving sequentially • Objects are typically points • Different form of arrival:- sequence of points- sequence of updates Questions • Find ways to analyze the geometric structure of the input data using small space
Introduction Motivation • Many computational tasks can be interpreted geometrically • Geometric features may be useful in learning and classification • Geometry plays an important role in the application Examples • Learning • Clustering • How ‚clusterable‘ is a data set? • Road traffic prediction
Introduction A basic learning problem • We have two classes of objects
Introduction A basic learning problem • We have two classes of objects
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong ?
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong • Map object‘s description to Euclidean space ?
Introduction A basic learning problem • We have two classes of objects • We are given examples from both classes • Learn from examples to which classfuture objects belong • Map object‘s description to Euclidean space SVM approach • Compute maximum margin hyperplane • Classifiy points according to their side ?
Introduction SVM and SEB (smallest enclosing balls) • Dual of certain SVM formulation is SEB [Tax, Duin, Pattern Recognition Letters, ‘99] • Geometric streaming SEB can be used as SVM heuristic [Rai, Daume III, Venkatasubramanian, IJCAI‘09] • Also: Coresets have been usedto construct CSVMs[Tsang, Kwok, Cheung, Journal of Machine Learning Research, ’05] ?
Introduction Outline • Merge & Reduce • Embeddings into tree metrics • Estimation of distribution of local neighborhoods • Balanced partitions • Approximating properties of balanced partitions
Merge & Reduce Insertion-only streams • Sequence of points p ,…, p from R d n 1
Merge & Reduce Definition [k-median clustering] Given a weighted set P of points in R the k-median problem is to find a set CR of k points (centers) such that cost(P,C) = S w min ||p-c|| is minimized, where w >0 is the weight of point p. d d p pP cC p
Merge & Reduce Coreset[Har-Peled, Mazumdar, STOC’04] A weighted point set S is a (k,e)-coreset of a weighted point set P, if for every set C of k centers | cost(P,C) – cost(S,C) | e cost(P,C). 3 3 3 3 4 3 4
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset Coreset of Union of Coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Observation • Union of two (k,e)-coresets is a (k,e)-coreset • Can compute coreset of a coreset … Input Stream
Merge & Reduce Coresets by pre-clustering [Guha, Mishra, Motwani, O‘Callaghan, FOCS’00; Har-Peled, Mazumdar, STOC’04; Frahling, S., STOC‘05] • Compute a pre-clustering S with >k centers and cost(P,S) e Opt • Size exponential in d k 3 3 3 3 4 3 4
Merge & Reduce Coresets by sampling[Chen, SICOMP’09; Feldman, Monemizadeh, S., SoCG‘07] • Compute a random non-uniform sample • Show that sample approximates all solutions from a net • Size polynomial in d M M/4 M
Merge & Reduce Coresets by reduction to 1D [Har-Peled, Kushal, DCG’07, Feldman, Fiat, Sharir, FOCS‘06] • Uses geometric arguments to solve 1D • Combine with preclusting using line centers • For k-median: Size independent of n (but exponential in d)
Merge & Reduce Open problems • Coresets for k-median of size independent of n and d ? (Partial result in [Feldman, Monemizadeh, S., SoCG’07]) • Coresets for k-median of size O(d/e²) • Coresets for k-median of size poly(d, log n)/e for constant c=c(d)>0 • Coresets for j-subspace 1-median of size poly(e, d, j, log n) ? • Same questions for k-means objective function Remark: Open questions refer to the definition of coresets from this talk. 2-c
Geometric update streams Insertion/deletion model • Stream consists of Insert(p), Delete(p) operations • Points are from {1,…, D} • Stream is consistent, i.e. no Delete(p), if p is not present and noInsert(p), if p is already present in the current set d
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t p s q r
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s q r
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i 2
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-1 2
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics t q p t p s r s i i 2 q 2 i 2 q r p s t r i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s
Streaming algorithms via embeddings into tree metrics Embeddings in tree metrics D(.,.) • ||p-q|| D(p,q) • E[D(p,q)] = O(log D)||p-q|| [Bartal, FOCS’96; Charikar, Chekuri, Goel, Guha, Plotkin, FOCS’98] q t p s r i i 2 2 i 2 q r p s t i-1 2 i-1 i-1 2 2 s q r p i-2 2 i-2 2 r s
Streaming algorithms viaembeddings into tree metrics Estimator for cost of Euclidean minimum spanning tree (EMST) [Indyk, STOC’04] • Write EMST for cost of EMST • Write MST for cost of minimum spanning tree of tree metric D • E[MST ] = O(log D) EMST (linearity of expectation) • Use cost of MST of D as estimator D D
Streaming algorithms via embeddings into tree metrics Observation [Indyk, STOC’04] • The MST of D(.,.) is given by the tree defining the tree metric • #edges of length 2 = #non-empty cells in corresponding grid i t p i 2 s q t q p s r i i 2 2 i 2 r q r p s t
Streaming algorithms viaembeddings into tree metrics Euclidean minimum spanning tree 1. Use O(log D) nested grids G(i) with side length 2 • for each grid • approximate |G(i)| := #nonempty cells in G(i) using F sketch • returnS 2 |G(i)| Theorem [Indyk, STOC’04] The above algorithm computes a O(log D)-approximation to the cost of the minimum spanning tree. i 0 i
Streaming algorithms viaembeddings into tree metrics Results using a similar approach [Indyk, STOC’04] Problem Approx. factor
Streaming algorithms viaestimating the distribution of local neighborhoods Distribution of neighborhoods • Grids G(i) as before • R-neighborhood of C: cells within distance at most R from C • m (i) is number of points in i-th cell of the R-neighborhood of C C,R A cell and its 2-neighborhood
Streaming algorithms viaestimating the distribution of local neighborhoods EMST estimator • Define Z (i) = ( m (i) > 0 ) • EMST can be approximated from the Z (i) • Approx. ratio goes to 1 as R goes to C,R C,R C,R
Streaming algorithms viaestimating the distribution of local neighborhoods EMST estimator • K: Size of R-neighborhood • Z are functions from {1,…,K} to {0,1} • Random (nonempty) C defines distribution over neighborhoods, i.e. over functions Z:{1,…,K} {0,1} • Can still estimate EMST from this distribution C,R
Streaming algorithms viaestimating the distribution of local neighborhoods Algorithm • Sample a certain number of nonempty grid cells and maintain number of points for each cell in their neighborhood • Sample gives estimation of the distribution of the Z (.) • Obtain estimation for EMST from estimated distribution Theorem [Frahling, Indyk, S., IJCGA’07] Let e>0, d be constants.The cost of a Euclidean minimum spanning tree of a point set in R given as an update stream can be estimated with a factor of 1e using polylog(D) space. C,R d
Streaming algorithms viaestimating the distribution of local neighborhoods Open Problems • (1+e)-approximation for matching and/or earth mover‘s distance • Other problems? Approach is not very well understood • General characterization of problems solvable via approximation of the distribution of local neighborhoods
Streaming algorithms viabalanced partitions Estimating the distribution [Frahling, S., STOC’05] • Divide space into regions • For each region maintain #points inside • Balance „error“ among regions • Notion of error depends on problem Example • 1-Median in 1D • Error cell width #points in cell
Streaming algorithms viabalanced partitions Small space? • Problem dependent • Need to show that decomposition in few regions with sufficiently small error exists