1 / 40

A PAC Model for Learning from Labeled and Unlabeled Data

A PAC Model for Learning from Labeled and Unlabeled Data. Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department . Outline of the talk. Supervised Learning PAC Model Sample Complexity Algorithm Design Semi-supervised Learning A PAC Style Model

nova
Download Presentation

A PAC Model for Learning from Labeled and Unlabeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A PAC Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department Maria-Florina Balcan

  2. Outline of the talk • Supervised Learning • PAC Model • Sample Complexity • Algorithm Design • Semi-supervised Learning • A PAC Style Model • Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions • Implications of our Analysis Maria-Florina Balcan

  3. Usual Supervised Learning Problem • Imagine you want a computer program to help you decide which email messages are spam and which are important. • Might represent each message by n features. (e.g., return address, keywords, spelling, etc.). • Take a sample S of data, labeled according to whether they were/weren't spam. • Goal of algorithm is to use data seen so far to produce good prediction rule (a "hypothesis")h for future data. Maria-Florina Balcan

  4. The concept learning setting E.g., • Given data, some reasonable rules might be: • Predict SPAM if unknown AND (sex OR sales) • Predict SPAM if sales + sex – known > 0. • ... Maria-Florina Balcan

  5. Supervised Learning, Big Questions • Algorithm Design • How might we automatically generate rules that do well on observed data? • Sample Complexity/Confidence Bound • What kind of confidence do we have that they will do well in the future? Maria-Florina Balcan

  6. Supervised Learning: Formalization (PAC) • PAC model – nice/standard model for learning from labeled data. • X - instance space • S={(x, l)} - set of labeled examples • examples - assumed to be drawn i.i.d. from some distr. D over X and labeled by some target concept c* • labels 2 {-1,1} - binary classification • Want to do optimization over S to find some hypothesis h, but we wanth to have small error over D. • err(h)=Prx 2 D(h(x)  c*(x)) Maria-Florina Balcan

  7. confidence accuracy Basic PAC Learning Definitions • Algorithm APAC-learns concept class C if for any target c*in C, any distribution D over X, any ,  > 0: • A uses at most poly(n,1/,1/,size(c*)) examples and running time. • With probability 1-, A produces h in C of error at most . • Notation: – true error of h - empirical error of h Maria-Florina Balcan

  8. Sample Complexity: Uniform Convergence Finite Hypothesis Spaces • Realizable Case 1. Prob. a bad hypothesis is consistent with m examples is at most (1-)m 2. So, prob. exists a bad consistent hypothesis is at most |C|(1-)m 3. Set to , solve to get # examples needed at most 1/[ln(|C|) + ln(1/)] • If not too many rules to choose from, then unlikely some bad one will fool you just by chance. Maria-Florina Balcan

  9. Sample Complexity: Uniform Convergence Finite Hypothesis Spaces Realizable Case Agnostic Case • What if there is no perfect h? • Gives hope for local optimization over the training data. Maria-Florina Balcan

  10. Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces • C[S] – the set of splittings of dataset S using concepts from C. • C[m] - maximum number of ways to split m points using concepts in C; i.e. • C[m,D] - expected number of splits of m points from D with concepts in C. • Neat Fact #1: previous results still hold if we replace |C| with C[2m]. • Neat Fact #2: can even replace with C[2m,D]. Maria-Florina Balcan

  11. Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces For instance: Sauer’s Lemma, C[m]=O(mVC-dim(C)) implies: Maria-Florina Balcan

  12. Outline of the talk • Supervised Learning • PAC Model • Sample Complexity • Algorithms • Semi-supervised Learning • Proposed Model • Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions • Implications of our Analysis Maria-Florina Balcan

  13. Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) • Hot topic in recent years in Machine Learning. • Many applications have lots of unlabeled data, but labeled data is rare or expensive: • Web page, document classification • OCR, Image classification Maria-Florina Balcan

  14. Combining Labeled and Unlabeled Data • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM • Co-training • Graph-based methods Maria-Florina Balcan

  15. Can we extend the PAC model to deal with Unlabeled Data? • PAC model – nice/standard model for learning from labeled data. • Goal – extend it naturally to the case of learning from both labeled and unlabeleddata. • Different algorithms are based on differentassumptions about how data should behave. • Question – how to capture many of the assumptions typically used? Maria-Florina Balcan

  16. _ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only Example of “typical” assumption • The separator goes through low density regions of the space/large margin. • assume we are looking for linear separator • belief: should exist one with large separation Maria-Florina Balcan

  17. Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Another Example • Agreement between two parts : co-training. • examples contain two sufficient sets of features, i.e. an example is x=h x1, x2i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: x = hx1, x2i Maria-Florina Balcan

  18. My Advisor Co-training Text info Link info + + - - Maria-Florina Balcan

  19. Proposed Model • Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. • “learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ) • Express relationships that one hopes the target function and underlying distribution will possess. • Goal: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. Maria-Florina Balcan

  20. Proposed Model, cont • Goal: use unlabeled data & our belief to reducesize(C) down to size(highly compatible functions in C) in the previous bounds. • Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. • Require that the degree of compatibility be something that can be estimated from a finite sample. Maria-Florina Balcan

  21. Proposed Model, cont • Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. • Require that the degree of compatibility be something that can be estimated from a finite sample. • Require  to be an expectation over individual examples: • (h,D)=Ex 2 D[(h, x)]compatibility of h with D, (h,x) 2 [0,1] • errunl(h)=1-(h, D)incompatibility of h with D (unlabeled error rate of h) Maria-Florina Balcan

  22. _ + Highly compatible + _ Margins, Compatibility • Margins: belief is that should exist a large margin separator. • Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance  of h. • Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where: • (h,x)=0 if dist(x,h) · • (h,x)=1 if dist(x,h) ¸ Maria-Florina Balcan

  23. _ + Highly compatible + _ Margins, Compatibility • Margins: belief is that should exist a large margin separator. • If do not want to commit to  in advance, define (h,x) to be a smooth function of dist(x,h), e.g.: • Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance  of h. Maria-Florina Balcan

  24. Co-training, Compatibility • Co-training: examples come as pairs h x1, x2i and the goal is to learn a pair of functions h h1, h2i • Hope is that the two parts of the example are consistent. • Legal (and natural) notion of compatibility: • the compatibility of h h1, h2iand D: • can be written as an expectation over examples: Maria-Florina Balcan

  25. Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces, Doubly Realizable Case • Assume (h,x) 2 {0,1}; define CD,() = {h 2 C : errunl(h) ·}. Theorem • Bound the number of labeled examples as a measure of the helpfulness of D w.r.t to  • a helpful distribution is one in which CD,() is small Maria-Florina Balcan

  26. Semi-Supervised Learning Natural Formalization (PAC) • We will say an algorithm "PAC-learns" if it runs in poly time using samples poly in respective bounds. • E.g., can think of ln|C| as # bits to describe target without knowing D, and ln|CD,()| as number of bits to describe target knowing a good approximation to D, given the assumption that the target has low unlabeled error rate. Maria-Florina Balcan

  27. Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces – c* not fully compatible: Theorem Maria-Florina Balcan

  28. Examples of results in our model Sample Complexity - Uniform convergence bounds Infinite Hypothesis Spaces Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) = (h,x). Maria-Florina Balcan

  29. Examples of results in our modelSample Complexity, -Cover-based bounds • For algorithms that behave in a specific way: • first use the unlabeled data to choose a representative set of compatible hypotheses • then use the labeled sample to choose among these Theorem • Can result in much better bound than uniform convergence! Maria-Florina Balcan

  30. Examples of results in our model • Let’s look at some algorithms. Maria-Florina Balcan

  31. Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • X={0,1}n, C – class of disjunctions, e.g. h=x1Ç x2Ç x3Ç x4Ç x7 • For x 2 X, let vars(x) be the set of variables set to 1 by x • For h 2 C, let vars(h) be the set of variables disjoined by h • (h,x)=1 if either vars(x) µ vars(h) or vars(x) Å vars(h)= • Strong notion of margin: • every variable is either a positive indicator or a negative indicator • no example should contain both positive and negative indicators • Can give a simple PAC-learning algorithm for this pair (C,). Maria-Florina Balcan

  32. Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • Use unlabeled sample U to build G on n vertices: • put an edge between i and j if 9 x in U with i,j 2 vars(x). • Use labeled data L to label the connected components. • Output h s. t. vars(h) is the union of the positively-labeled components. • If c* is fully compatible, then no component will get both positive and negative labels. • and • If |U| & |L| are as given in the bounds, then whp err(h)·. + - 011000 101000 unlabeled set U 1 3 4 5 6 2 001000 000011 100100 h=x1Çx2Çx3Çx4 100000 + labeled set L 000011 - Maria-Florina Balcan

  33. Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • Especially non-helpful distribution – the uniform distr. over all examples x with |vars(x)|=1 • get n components; still needs (n) labeled examples • Helpful distribution - one such that w.h.p. the # of components is small • need a lower number of labeled examples Maria-Florina Balcan

  34. + + - - Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Examples h x1, x2i2 Rn£ Rn. • Target functions c1 and c2 are linear separators, assume c1=c2=c*, and that no pair crosses the target plane. • f linear separator in Rn, errunl(f) - the fraction of the pairs that “cross f’s boundary” • Consistency problem: given a set of labeled and unlabeled examples, want to find a separator that is consistent with labeled examples and compatible with the unlabeled ones. • It is NP-hard - Abie. Maria-Florina Balcan

  35. Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Assume independence given the label (both points from D+ or from D-). • [Blum & Mitchell ’98] show can co-train (in polynomial time) if have enoughlabeled data to produce a weakly-usefulhypothesis to begin with. • We show, can learn with only a single labeled example. • Key point: independence given the label implies that the functions with low errunl rate are: • close to c* • close to : c* • close to the all positive function • close to the all negative function Maria-Florina Balcan

  36. Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Nice Tool: a “super simple algorithm” for weak learning a large-margin separator: • pick c at random • If margin=1/poly(n), then a random c has at least 1/poly(n) chance of being a weak predictor Maria-Florina Balcan

  37. Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Assume independence given the label. • Draw a large unlabeled sample S={(x1i,x2i)}. • If also assume large margin, • run the “super-simple alg” poly(n) times • feed each c into [Blum & Mitchell] booster • examine all the hypotheses produced, and pick one h with small errunl, that is far from all-positive and all-negative functions • use labeled example to choose either h or : h • w.h.p. one random c was a weakly-useful predictor; so on at least one of these steps we end up with a hypothesis h with small err(h), and so with small errunl(h) • If don’t assume large margin, • use Outlier Removal Lemma to make sure that at least 1/poly fraction of the points in S1={x1i} have margin at least 1/poly; this is sufficient. Maria-Florina Balcan

  38. Implications of our analysisWays in which unlabeled data can help • If the target is highly compatible with D and have enough unlabeled data to estimate  over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). • By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropyor the size of the smallest -cover). Maria-Florina Balcan

  39. Questions? Maria-Florina Balcan

  40. Thank you ! Maria-Florina Balcan

More Related