Efficient classification for metric data

Efficient classification for metric data Lee-Ad Gottlieb Hebrew U. Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 2 Efficient classification for metric data

Generalization bounds • How do we upper bound the true error? • Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/n • More complex classifier ↔ “easier” to fit to arbitrary data • VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5

Popular approach for classification • Assume the points are in Euclidean space! • Pros • Existence of inner product • Efficient algorithms (SVM) • Good generalization bounds (max margin) • Cons • Many natural settings non-Euclidean • Euclidean structure is a strong assumption • Recent popular focus • Metric space data Efficient classification for metric data

Metric space חיפה • (X,d) is a metric space if • X = set of points • d() = distance function • nonnegative • symmetric • triangle inequality • inner product → norm • norm → metric • But ⇐ doesn’t hold 95km תל אביב 208km 113km באר שבע Efficient classification for metric data

Classification for metric data? • Advantage: often much more natural • much weaker assumption • strings • Images (earthmover distance) • Problem: no vector representation • No notion of dot-product (and no kernel) • What to do? • Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! • Use some NN heuristic?.. NN classifier has ∞ VC-dim! Efficient classification for metric data

Preliminaries: Lipschitz constant • The Lipschitz constantL of a function f: X → R measures its smoothness • It is the smallest value L that satisfies for all points xi,xj in X • Denoted by • Suppose hypothesis h: S → {-1,1} is consistent with sample S • Its Lipschitz constant of h is determined by the closest pair of differently labeled points • Or equivalently ≥ 2/d(S+,S−) -1 +1 Efficient classification for metric data

Preliminaries: Lipschitz extension Lipschitz extension: A classic problem in Analysis given a function f: S → Rfor S inX, extend f to all of X without increasing the Lipschitz constant Example: Points on the real line f(1) = 1 f(-1) = -1 credit: A. Oberman 10 Efficient classification for metric data

Classification for metric data • A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04) • Construction of h onS: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions • Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h • Lipschitz extension problem • For example f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)] over all (xi,yi) in S • Evaluation of h reduces to exact Nearest Neighbor Search • Strong theoretical motivation for the NNS classification heuristic Efficient classification for metric data

Two new directions • The framework of [vLB ‘04] leaves open two further questions: • Constructing h: handling noise • Bias-Variance tradeoff • Which sample points in S should h ignore? • Evaluating h on X • In arbitrary metric space, exact NNS requires Θ(n) time • Can we do better? -1 +1 q ~1 ~1 Efficient classification for metric data

Doubling dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value >0such that every ball can be covered by balls of half the radius • First used by [Assoud ‘83], algorithmically by [Clarkson ‘97]. • The doubling dimension is ddim(M)=log2(M) • A metric is doubling if its doubling dimension is constant • Euclidean: ddim(Rd) = O(d) • Packing property of doubling spaces • A set with diameter diam and minimum inter-point distance a, contains at most (diam/a)O(ddim)points Here ≥7. Efficient classification for metric data

Applications of doubling dimension • Major application to databases • Recall that exact NNS requires Θ(n) time in arbitrary metric space • There exists a linear size structure that supports approximate nearest neighbor search in time 2O(ddim) log n • Database/network structures and tasks analyzed via the doubling dimension • Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] • Image recognition (Vision) [KG --] • Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] • Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] • Clustering [Tal ‘04, ABS ‘08, FM ‘10] • Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] • Further applications • Travelling Salesperson [Tal ‘04] • Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] • Machine learning [BLL ‘09, KKL ‘10, KKL --] • Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] • Message: This is an active line of research… 16

Our dual use of doubling dimension • Interestingly, considering the doubling dimension yields contributes in two different areas • Statistical: Function complexity • We bound the complexity of the hypothesis in terms of the doubling dimension of X and the Lipschitz constant of the classifier h • Computational: efficient approximate NNS Efficient classification for metric data

Statistical contribution • We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension • vLB provided similar bounds using covering numbers and Rademacher averages • Fat-shattering analysis: • L-Lipschitz functions shatter a set → inter-point distance is at least 2/L • Packing property → set has (diam L)O(ddim) points • This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity. Efficient classification for metric data

Statistical contribution • [BST ‘99]: • For any f that classifies a sample of size n correctly, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) . • Likewise, if f is correct on all but k examples, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2. • In both cases, d is bound by the fat-shattering dimension, d ≤ (diam L)ddim + 1 • Done with the statistical contribution … On to the computational contribution. Efficient classification for metric data

Computational contribution • Evaluation of h for new points in X • Lipschitz extension function • f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] • Requires exact nearest neighbor search, which can be expensive! • New tool: (1+)-approximate nearest neighbor search • 2O(ddim) log n + O(-ddim) time • [KL ‘04, HM ‘06, BKL ‘06, CG ‘06] • If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of • g(x) = (1+) f(x) +  • e(x) = (1+) f(x) -  • Note that g(x) ≥ f(x) ≥ e(x) • g(x) and e(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well 2 g(x) f(x) e(x) Efficient classification for metric data

Final problem: bias variance tradeoff • Which sample points in S should h ignore? • If f is correct on all but k examples, we have with probability at least 1− • P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. • Where d ≤ (diam L)ddim + 1 -1 +1 Efficient classification for metric data

Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible +1 -1 Efficient classification for metric data

Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time +1 -1 Efficient classification for metric data

Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time • Minimum vertex cover on a bipartite graph • Equivalent to maximum matching (Konig’s theorem) • Admits an exact solution in O(n2.376) randomized time [MS ‘04] +1 -1 Efficient classification for metric data

Efficient SRM • Algorithm: • For each of O(n2) values of L • Run matching algorithm to find minimum error • Evaluate generalization bound for this value of L • O(n4.376) randomized time • Better algorithm • Binary search over O(n2) values of L • For each value • Run greedy 2-approximation Approximate minimum error in O(n2 log n) time Evaluate approximate generalization bound for this value of L Efficient classification for metric data

Conclusion • Results: • Generalization bounds for Lipschitz classifiers in doubling spaces • Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS • Efficient Structural Risk Minimization • Continuing research: Continuous labels • Risk bound via the doubling dimension • Classifier h determined via an LP • Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints. Efficient classification for metric data

Application: earthmover distance S T Efficient classification for metric data

Efficient classification for metric data

Efficient classification for metric data

Presentation Transcript

Data Mining: Classification

Data Annotation for Classification

Efficient Data Synchronization

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Data Classification

1.2 Data Classification

Efficient kernels for sentence pair classification

Flexible Metric NN Classification

An Efficient Online Algorithm for Hierarchical Phoneme Classification

EPL660: DATA CLASSIFICATION

1.2 Data Classification

Efficient classification for metric data

Efficient Processing of Metric Skyline Queries

DATA CLASSIFICATION

Data Converter Performance Metric

Efficient Distribution Mining and Classification

Efficient packet classification using TCAMs

Data Classification

Data Mining: Classification

Efficient Image Classification on Vertically Decomposed Data

Efficient Comment Classification through NLP and Fuzzy Classification