1 / 46

Geometry-based Sampling in Learning and Classification

This research outlines the use of geometry-based sampling techniques, such as ε-nets and ε-approximators, in learning and classification. It also discusses the failure of naive sampling approaches and the importance of small-variance estimators.

aandress
Download Presentation

Geometry-based Sampling in Learning and Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Geometry-based sampling in learning and classificationOr, Universal -approximators for integration of some nonnegative functions Leonard J. Schulman Caltech Joint with Michael Langberg Open U. Israel work in progress

  2. Outline Vapnik-Chervonenkis (VC) method / PAC learning; -nets, -approximators. Shatter function as cover code. -approximators (core-sets) for clustering; universal approximation of integrals of families of unbounded nonnegative functions. Failure of naive sampling approach. Small-variance estimators. Sensitivity and total sensitivity. Some results on sensitivity; MIN operator on families. Sensitivity for k-medians. Covering code for k-median.

  3. PAC Learning. Integrating {0,1}-valued functions If F is a family of “concepts” (functions f:X  {0,1}) of finite VC dimension d, then for every input distribution  there exists a distribution  with support O((d/2) log 1/), s.t. for every f 2 F, | s f d - s f d| < . Method: (1) Create  by repeated independent sampling x1,…xm from . This creates an estimator T = (1/m) 1m f(xi) which w.h.p. (1+)-approximates any the integral of any specific f. Example: F=characteristic fcns of intervals on . These are {0,1}-valued functions with VC dimension = 2. b a

  4. PAC Learning. Integrating {0,1}-valued functions If F is a family of “concepts” (functions f:X  {0,1}) of finite VC dimension d, then for every input distribution  there exists a distribution  with support O((d/2) log 1/), s.t. for every f 2 F, | s f d - s f d| < . Method: (1) Create  by repeated independent sampling x1,…xm from . This creates an estimator T = (1/m) 1m f(xi) which w.h.p. (1+)-approximates the integral of any specific f. Example: F=characteristic fcns of intervals on . These are {0,1}-valued functions with VC dimension = 2.  b a

  5. PAC Learning. Integrating {0,1}-valued functions If F is a family of “concepts” (functions f:X  {0,1}) of finite VC dimension d, then for every input distribution  there exists a distribution  with support O((d/2) log 1/), s.t. for every f 2 F, | s f d - s f d| < . Method: Create  by repeated independent sampling x1,…xm from . This creates an estimator T = (1/m) 1m f(xi) which w.h.p. (1+)-approximates the integral of any specific f. Example: F=characteristic fcns of intervals on . These are {0,1}-valued functions with VC dimension = 2. n b a

  6. PAC Learning. Integrating {0,1}-valued functions (cont.) Easy to see that for any particular f 2 F, w.h.p. | s f d - s f d| < . But how do we argue this is simultaneously achieved for all the functions f 2 F? Can’t take union bound over infinitely many “bad events”. Need to express that there are few “types” of bad events. To conquer the infinite union bound apply “Red & Green Points” argument. Sample m=O((d/2) log (1/)) “green” points G from . Will use =G=uniform dist. on G. P(G not an -approximator) = P(9 f 2 C badly-counted by G) · P(9 f 2 C: |(f)-G(f)|>) Suppose G is not an -approximator: 9 f: |(f)-G(f)|>. Sample another m=O((d/2) log (1/)) “red” points R from . With probability > ½, |(f)-R(f)|</2. (Markov ineq.) So: P(9 f 2 C: |(f)-G(f)|>) < 2 P(9 f 2 C: |R(f)-G(f)|>/2).

  7. PAC Learning. Integrating {0,1}-valued functions (cont.) P(9 f 2 C: |(f)-G(f)|>) < 2 P(9 f 2 C: |R(f)-G(f)|>/2).  has vanished from our expression! Our failure event depends only on the restriction of f to R [ G. Key: for finite-VC-dimension F, every f 2 F is identical on R [ G to one of a small (much less than 2m) set of functions. These are a “covering code” for F on R [ G. Cardinality (m)=md¿ 2m.

  8. Integrating functions into larger ranges (still additive approximation) Extensions of VC-dimension notions to families of f:X  {0,...n}: Pollard 1984 Natarajan 1989 Vapnik 1989 Ben-David, Cesa-Bianchi, Haussler, Long 1997. Families of functions f:X  [0,1]: extension of VC-dimension notion (analogous to discrete definitions but insists on quantitative separation of values): “fat-shattering”. Alon, Ben-David, Cesa-Bianchi, Haussler 1993 Kearns, Schapire 1994 Bartlett, Long, Williamson 1996 Bartlett, Long 1998 Function classes with finite “dimension” (as above) possess small core-sets for additive-approximation of integrals. Same method still works: simply construct  by repeatedly sampling from . Does not solve multiplicative approximation of nonnegative functions.

  9. What classes of +-valued functions possess core-sets for integration? Why multiplicative approximation? In optimization we often wish to minimize a nonnegative loss function. Makes sense to settle for (1+)-multiplicative approximation (and often unavoidable because of hardness). Example: important optimization problems arise in classification: Choose c1,...ck to minimize: k-median function: cost(fc1,...ck)= s ||x-{c1,...ck}|| d(x) k-means function: cost(fc1,...ck)= s ||x-{c1,...ck}||2 d(x) Or for any >0, cost(fc1,...ck)= s ||x-{c1,...ck}|| d(x)

  10. Core-set can be useful because: Standard algorithmic approach: Replace input  (empirical distribution on huge number of points, or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set)  supported on a small set. Find an optimal (or near-optimal) c1,...ck for . Infer that it is near-optimal for . 

  11. Core-set can be useful because: Standard algorithmic approach: Replace input  (empirical distribution on huge number of points, or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set)  supported on a small set. Find an optimal (or near-optimal) c1,...ck for . Infer that it is near-optimal for . n

  12. Standard algorithmic approach: Replace input  (empirical distribution on huge number of points, or even a continuous distribution given via certain “oracles”) by an -approximator (aka core-set)  supported on a small set. Find an optimal (or near-optimal) c1,...ck for . Infer that it is near-optimal for . In this lecture we focus solely on existence/non-existence of finite-cardinality core-sets; not on how to find them. Theorems will hold for any “input” distribution  regardless of how it is presented. Core-set can be useful because: n c2 c1

  13. Unbounded (dependent on n): Har-Peled, Mazumdar ’04: k-medians & k-means: O(k -d log n) Chen ’06: k-medians & k-means: ~ O(d k2-2 log n) Har-Peled ’06: in one dimension, integration of other families of functions (e.g., monotone): ~ O([family-specific] log n) Bounded (independent of n): Effros, Schulman ’04: k-means (k(d/)d)O(k) deterministically Har-Peled, Kushal ’05: k-medians: O(k2/d), k-means O(k3/d+1) Known core-sets In numerical analysis: quadrature methods for exact integration over canonical measures  (constant on interval or ball; Gaussian; etc). In CS previously: very different approaches from what I’ll present today. (If  is uniform on a finite set, let n=|Support()|.) Our general goal is to find out what families of functions have, for every , bounded core-sets for integration. In particular our method shows existence (but no algorithm) of core-sets for k-median, of size poly(d,k,1/), ~ O(d k3-2).

  14. Why doesn’t “sample from ” work? Ex.1 For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Simple example: 1-means functions. F={(x-a)2}a 2  Let  be: a

  15. For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Simple example: 1-means functions. F={(x-a)2}a 2  Let  be: a Why doesn’t “sample from ” work? Ex.1

  16. For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Simple example: 1-means functions. F={(x-a)2}a 2  Let  be: Construct  by sampling repeatedly from : almost surely all samples will lie in the left-hand singularity. a Why doesn’t “sample from ” work? Ex.1

  17. For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Simple example: 1-means functions. F={(x-a)2}a 2  Let  be: Construct  by sampling repeatedly from : almost surely all samples will lie in the left-hand singularity. If a is at the left-hand singularity, s f d>0, but whp s f d=0. No multiplicative approximation. Underlying problem: the estimator of s f d has large variance. a Why doesn’t “sample from ” work? Ex.1

  18. For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise. These are {0,1}-valued functions with VC dimension = 2. Why doesn’t “sample from ” work? Ex.2 b a

  19. For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise. These are {0,1}-valued functions with VC dimension = 2. Let  be: Why doesn’t “sample from ” work? Ex.2 b a

  20. For additive approximation, the learning theory approach: “construct  by repeatedly sampling from ” was sufficient to obtain a core-set. Why does it fail now? Even simpler example: Interval functions. F={fa,b} where fa,b(x)=1 for x 2 [a,b], 0 otherwise. These are {0,1}-valued functions with VC dimension = 2. Let  be: For unif. dist. on n pts, f = 1/n while for x sampled from , StDev(f(x)) ~ 1/n. Actually nothing works for this family F. For an accurate multiplicative estimate of f =s f d , a core-set would need to contain the entire support of . Why doesn’t “sample from ” work? Ex.2 b a

  21. Even the simpler family of step functions doesn’t have finite core-sets: Why doesn’t “sample from ” work? Ex.3: step functions

  22. Even the simpler family of step functions doesn’t have finite core-sets: Why doesn’t “sample from ” work? Ex.3: step functions

  23. Even the simpler family of step functions doesn’t have finite core-sets: Need geometrically spaced points (factor (1+)) in the support. Why doesn’t “sample from ” work? Ex.3: step functions

  24. General approach: weighted sampling. Sample not from  but from a distribution q which depends on both  and F. Weighted sampling has long been used for clustering algorithms [Fernandez de la Vega, Kenyon’98; Kleinberg, Papadimitriou, Raghavan’98; Schulman’98; Alon, Sudakov’99;...], to reduce the size of the data set. What we’re trying to explain is (a) For what classes of functions can weighted sampling provide an -approximator (core-set); (b) What is the connection with the VC proof of existence of -approximators in learning theory. Return to Ex.1: show 9 small-variance estimator for f =s f d

  25. General approach: weighted sampling. Sample not from  but from a distribution q which depends on both  and F. Sample x from q. The random variable T = x fx / qx is an unbiased estimator of f. Can we design q so Var(T) is small 8 f 2 F? Ideally: Var(T) 2 O(f2) For the case of “1-means in one dimension”, the optimization Given, choose q to minimize maxf 2 F Var(T) can be solved (with mild pain) by Lagrange multipliers. Solution: Let 2=Var(). Center  at 0. Then sample from qx=x(2+x2)/(22). (Note: heavily weights the tails of .) Calculation: Var(T) ·f2. (Now average O(1/2) samples. For any specific f, only 1± error.) Return to Ex.1: show 9 small-variance estimator for f =s f d

  26. For what classes F of nonnegative functions does there exist, for all , an estimator T with Var(T) 2 O(f2)? E.g., what about nonnegative quartics, fx=(x-a)2(x-b)2 ? Shouldn’t have to do Lagrange multipliers each time. Key notion: sensitivity. Define the sensitivity of x w.r.t. (F,): sx = supf 2 F fx/f Define the total sensitivity of F: S(F) = sups sx d Sample from the distribution qx = x sx / S. (Emphasizes sensitive x’s) Theorem 1: Var(T) · (S-1) f2 Proof omitted. Exercise: For “parabolas”, F={(x-a)2}, show S=2. Corollary: Var(T) ·f2 (as previously obtained via Lagrange mults) Theorem 2 (slightly harder): T has a Chernoff bound (distribution has exponential tails). Don’t need this today. Can we generalize the success of Ex.1?

  27. Example 1. Let V be a real or complex vector space of dimension d. For each v=(...vx...) 2 V define an f 2 F by fx=|vx|2. Theorem 3: S(F)=d. Proof omitted. Corollary (again): Quadratics in 1 dimension have S(F)=2. Quartics in 1 dimension have S(F)=3. Can we calculate S for more examples?

  28. Let V be a real or complex vector space of dimension d. For each v=(...vx...) 2 V define an f 2 F by fx=|vx|2. Theorem 3: S(F)=d. Proof omitted. Corollary (again): Quadratics in 1 dimension have S(F)=2. Quartics in 1 dimension have S(F)=3. Quadratics in r dimensions have S(F)=r+1. Can we calculate S for more examples?

  29. Example 2. Let F+G={f+g: f 2 F, g 2 G} Theorem 4 (easy): S(F+G) · S(F)+S(G). Corollary: bounded sums of squares of bounded-degree polynomials have finite S. Example 3. Parabolas on k disjoint regions. Direct sum of vector spaces, so S · 2k. 0 1 2 3 Can we calculate S for more examples?

  30. But all these examples don’t even handle the 1-median functions: Can we calculate S for more examples?

  31. But all these examples don’t even handle the 1-median functions: And certainly not the k-median functions: Can we calculate S for more examples?

  32. But all these examples don’t even handle the 1-median functions: And certainly not the k-median functions: Will return to this... Can we calculate S for more examples?

  33. Question: If F and G have finite total sensitivity, is the same true of MIN(F,G) = {min(f,g): f 2 F, g 2 G} ? Want this for optimization: e.g., k-means or k-median functions are constructed by MIN out of simple families. We know S(Parabolas)=2; what is S(MIN(Parabolas,Parabolas))? What about MIN(F,G)?

  34. Question: If F and G have finite total sensitivity, is the same true of MIN(F,G) = {min(f,g): f 2 F, g 2 G} ? Want this for optimization: eg k-means or k-median functions are constructed by MIN out of simple families. We know S(Parabolas)=2; what is S(MIN(Parabolas,Parabolas))? Answer: unbounded. So total sensitivity does not remain finite under MIN operator. What about MIN(F,G)?

  35. Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions. =e-|x| And recall from earlier: step functions have unbounded total sensitivity. Idea for counterexample:

  36. Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions. =e-|x| And recall from earlier: step functions have unbounded total sensitivity. Idea for counterexample:

  37. Roughly, on a suitable distribution , a sequence of “pairs of parabolas” can mimic a sequence of step functions. =e-|x| And recall from earlier: step functions have unbounded total sensitivity. This counterexample relies on scaling the two parabolas differently. What if we only allow translates of a single “base function”? Idea for counterexample:

  38. Let M be any metric space. Let F={||x-a||}a 2 M (1-median is =1, 1-means is =2) Theorem 5: For any >0, S(F)<1. Note: Bound is independent of M. S is not an analogue of VCdim / cover function; it is a new parameter needed for unbounded fcns. Theorem 6: For any >0, S(MIN(F,...(k times)...,F)) 2 O(k). But remember this is only half the story: bounded S ensures only good approximation of f = s f d for each individual function f 2 F. Also need to handle all f simultaneously – the “VC” aspect. Finite total sensitivity of clustering functions 2-median function in M=2

  39. Recall “Red and Green Points” argument in VC theory: after picking 2m points from , all the {0,1}-valued functions in the concept class fall into just mO(1) equivalence classes by their restriction to R [ G. (“Shatter function” is (m)=O(mVC-dim).) These restrictions are a covering code for the concept class. For +-valued functions use a more general definition. First try: f 2 F is “covered” by g if 8 x 2 R [ G, fx = (1±) gx. But this definition neglects the role of sensitivity. Corrected definition: f 2 F is “covered” by g if 8 x 2 R [ G, |fx - gx| < f sx / 8 S. Notes: (1) Error can scale with f rather than fx. (2) Tolerates more error on high-sensitivity points. A “covering code” (for , R [ G) is a small ((m,) subexponential in m) family G, such that every f 2 F is covered by some g 2 G. Cover codes for families of functions F

  40. So (now focusing on k-median) we need to prove two things: Theorem 6: S(MIN(F1,...(k times)...,F1)) 2 O(k). Theorem 7: (a) In d, (MIN(F1,...(k times)...,F1)) 2 mpoly(k,1/,d) (b) Chernoff bound for s fx/sx dG as an estimator of s fx/sx dR [ G (Recall G = uniform dist. on G.) Today talk only about: Theorem 6 Theorem 7 in the case k=1, d arbitrary. Cover codes for families of functions F

  41. Theorem 6: S(MIN(F1,...(k times)...,F1)) 2 O(k). Proof: Given  let f* be the optimal clustering function, with centers u*1,...u*k, so h = k-median-cost() = s ||x - {u*1,...u*k}|| d. For any x, need to upper bound sx. Let Ui = Voronoi region of u*i. pi = sUi d hi = (1/pi) sUi ||x- u*i|| d h =  pi hi Suppose x 2 U1. Let f be any k-median function, with centers u1,...uk. Closest to u*1 is wlog u1. Let a = ||u*1 - u1||. By Markov inequality, at least p1/2 mass is within 2h1 of u*1. So: f¸ (p1/2) max(0,a-2h1) f¸ h and so f¸ h/2 + (p1/4) max(0,a-2h1) Thm 6: Total sensitivity of k-median functions u*2 u*3 u*1 a x u1

  42. f¸ h/2 + (p1/4) max(0,a-2h1) From the definition of sensitivity, sx = maxf fx / f· maxf ||x-{u1,...,uk}|| / f· maxf ||x-u1|| / f· ... (can show worst case is either a=2h1 or a=1) ... · 4h1/h+ 2||x-u*1||/h + 4/p1 Thus S = s sx d = isUi sx d · isUi [4h1/h+ 2||x-u*1||/h + 4/p1] d = (4/h)  pi hi + (2/h) s ||{x- u*1,...u*k}|| d + i 4 = 4 + 2 + 4k = 6+4k.  (Best possible up to constants.) Thm 6: Total sensitivity of k-median functions

  43. Theorem 7(b): Consider R [ G as having been chosen already; now select G randomly within R [ G. Need Chernoff bound for random variable s fx/sx dG as an estimator of s fx/sx dR [ G . Proof: Recall fx /sx·f, so 0 ·s fx/sx dR [ G·f. Need error O(f) in estimator, but not necessarily O(s fx/sx dR [ G); so standard Chernoff bounds for bounded random variables suffice. Theorem 7(a): Start with case k=1, i.e. family F1 = {||x-a||}a 2d. (Wlog shift  so minimum h is achieved at a=0.) By Markov ineq., ¸ ½ the mass lies in B(0,2h). Cover code: two “clouds” of f’s. Inner cloud: centers “a” sprinkled so the balls B(a,h/mS) cover B(0,3h). Outer cloud: geometrically spaced, factor (1+/mS), to cover B(0,hmS/). NB: Size of the cover code ~ (ms/d)d. Poly in m so “Red/Green” argument works. Thm 7: Chernoff bound for “VC” argument, Cover code

  44. Why is every f=||x-a|| covered by this code? In cases 1 & 2, f is covered by the g whose root b is closest to a. Case 1: a 2 inner ball B(0,3h). Then for all x, |fx-gx| is bounded by Lipshitz property. Case 2: a 2 outer ball B(0, hmS/). This forces f to be large (proportional to a rather than h) which makes it easier to achieve |fx-gx| ·f; again use Lipshitz property. Case 3: a  outer ball B(0, hmS/). In this case f is covered by the constant function gx=a. Again this forces f to be large (proportional to a rather than h), but for x far from 0 this is not enough. Use the inequality h ¸ |x|/sx. Distant points have high sensitivity. Take advantage of the extra tolerance for error on high-sensitivity points. Thm 7: Chernoff bound for “VC” argument, Cover code

  45. For k>1, use similar construction, but start from the optimal clustering f* with centers u*1,...,u*k. Surround each u*i by a cloud of appropriate radius. Given a k-median function f with centers ui, cover it by a function g which, in each Voronoi region Ui of f*, is either a constant or a 1-median function centered at a cloud point nearest ui. This produces a covering code for k-median functions, with log |covering code| 2 ~O(kd log S/) Need m (number of samples from ) to be:  (log |covering code|) £ (Var(T)/f2)  k d S2-2 d k3-2. Thm 7: Chernoff bound for “VC” argument, Cover code. k>1

  46. Efficient algorithm to find a small -approximator? (Suppose Support() is finite.) For {0,1}-valued functions there was a finitary characterization of whether the cover function F was exponential or sub-exponential: largest set shattered by F. Question: Is there an analogous finitary characterization for the cover function for multiplicative approximation of +-valued functions? (Not sufficient that level sets have low VC dimension; step functions are a counterexample.) Some open questions

More Related