420 likes | 625 Views
A PAC Model for Learning from Labeled and Unlabeled Data. Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department . Outline of the talk. Supervised Learning PAC Model Sample Complexity Algorithm Design Semi-supervised Learning A PAC Style Model
E N D
A PAC Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department Maria-Florina Balcan
Outline of the talk • Supervised Learning • PAC Model • Sample Complexity • Algorithm Design • Semi-supervised Learning • A PAC Style Model • Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions • Implications of our Analysis Maria-Florina Balcan
Usual Supervised Learning Problem • Imagine you want a computer program to help you decide which email messages are spam and which are important. • Might represent each message by n features. (e.g., return address, keywords, spelling, etc.). • Take a sample S of data, labeled according to whether they were/weren't spam. • Goal of algorithm is to use data seen so far to produce good prediction rule (a "hypothesis")h for future data. Maria-Florina Balcan
The concept learning setting E.g., • Given data, some reasonable rules might be: • Predict SPAM if unknown AND (sex OR sales) • Predict SPAM if sales + sex – known > 0. • ... Maria-Florina Balcan
Supervised Learning, Big Questions • Algorithm Design • How might we automatically generate rules that do well on observed data? • Sample Complexity/Confidence Bound • What kind of confidence do we have that they will do well in the future? Maria-Florina Balcan
Supervised Learning: Formalization (PAC) • PAC model – nice/standard model for learning from labeled data. • X - instance space • S={(x, l)} - set of labeled examples • examples - assumed to be drawn i.i.d. from some distr. D over X and labeled by some target concept c* • labels 2 {-1,1} - binary classification • Want to do optimization over S to find some hypothesis h, but we wanth to have small error over D. • err(h)=Prx 2 D(h(x) c*(x)) Maria-Florina Balcan
confidence accuracy Basic PAC Learning Definitions • Algorithm APAC-learns concept class C if for any target c*in C, any distribution D over X, any , > 0: • A uses at most poly(n,1/,1/,size(c*)) examples and running time. • With probability 1-, A produces h in C of error at most . • Notation: – true error of h - empirical error of h Maria-Florina Balcan
Sample Complexity: Uniform Convergence Finite Hypothesis Spaces • Realizable Case 1. Prob. a bad hypothesis is consistent with m examples is at most (1-)m 2. So, prob. exists a bad consistent hypothesis is at most |C|(1-)m 3. Set to , solve to get # examples needed at most 1/[ln(|C|) + ln(1/)] • If not too many rules to choose from, then unlikely some bad one will fool you just by chance. Maria-Florina Balcan
Sample Complexity: Uniform Convergence Finite Hypothesis Spaces Realizable Case Agnostic Case • What if there is no perfect h? • Gives hope for local optimization over the training data. Maria-Florina Balcan
Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces • C[S] – the set of splittings of dataset S using concepts from C. • C[m] - maximum number of ways to split m points using concepts in C; i.e. • C[m,D] - expected number of splits of m points from D with concepts in C. • Neat Fact #1: previous results still hold if we replace |C| with C[2m]. • Neat Fact #2: can even replace with C[2m,D]. Maria-Florina Balcan
Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces For instance: Sauer’s Lemma, C[m]=O(mVC-dim(C)) implies: Maria-Florina Balcan
Outline of the talk • Supervised Learning • PAC Model • Sample Complexity • Algorithms • Semi-supervised Learning • Proposed Model • Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions • Implications of our Analysis Maria-Florina Balcan
Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) • Hot topic in recent years in Machine Learning. • Many applications have lots of unlabeled data, but labeled data is rare or expensive: • Web page, document classification • OCR, Image classification Maria-Florina Balcan
Combining Labeled and Unlabeled Data • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM • Co-training • Graph-based methods Maria-Florina Balcan
Can we extend the PAC model to deal with Unlabeled Data? • PAC model – nice/standard model for learning from labeled data. • Goal – extend it naturally to the case of learning from both labeled and unlabeleddata. • Different algorithms are based on differentassumptions about how data should behave. • Question – how to capture many of the assumptions typically used? Maria-Florina Balcan
_ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only Example of “typical” assumption • The separator goes through low density regions of the space/large margin. • assume we are looking for linear separator • belief: should exist one with large separation Maria-Florina Balcan
Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Another Example • Agreement between two parts : co-training. • examples contain two sufficient sets of features, i.e. an example is x=h x1, x2i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: x = hx1, x2i Maria-Florina Balcan
My Advisor Co-training Text info Link info + + - - Maria-Florina Balcan
Proposed Model • Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution. • “learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ) • Express relationships that one hopes the target function and underlying distribution will possess. • Goal: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. Maria-Florina Balcan
Proposed Model, cont • Goal: use unlabeled data & our belief to reducesize(C) down to size(highly compatible functions in C) in the previous bounds. • Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. • Require that the degree of compatibility be something that can be estimated from a finite sample. Maria-Florina Balcan
Proposed Model, cont • Augment the notion of a concept class C with a notion of compatibility between a concept and the data distribution. • Require that the degree of compatibility be something that can be estimated from a finite sample. • Require to be an expectation over individual examples: • (h,D)=Ex 2 D[(h, x)]compatibility of h with D, (h,x) 2 [0,1] • errunl(h)=1-(h, D)incompatibility of h with D (unlabeled error rate of h) Maria-Florina Balcan
_ + Highly compatible + _ Margins, Compatibility • Margins: belief is that should exist a large margin separator. • Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance of h. • Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where: • (h,x)=0 if dist(x,h) · • (h,x)=1 if dist(x,h) ¸ Maria-Florina Balcan
_ + Highly compatible + _ Margins, Compatibility • Margins: belief is that should exist a large margin separator. • If do not want to commit to in advance, define (h,x) to be a smooth function of dist(x,h), e.g.: • Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance of h. Maria-Florina Balcan
Co-training, Compatibility • Co-training: examples come as pairs h x1, x2i and the goal is to learn a pair of functions h h1, h2i • Hope is that the two parts of the example are consistent. • Legal (and natural) notion of compatibility: • the compatibility of h h1, h2iand D: • can be written as an expectation over examples: Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces, Doubly Realizable Case • Assume (h,x) 2 {0,1}; define CD,() = {h 2 C : errunl(h) ·}. Theorem • Bound the number of labeled examples as a measure of the helpfulness of D w.r.t to • a helpful distribution is one in which CD,() is small Maria-Florina Balcan
Semi-Supervised Learning Natural Formalization (PAC) • We will say an algorithm "PAC-learns" if it runs in poly time using samples poly in respective bounds. • E.g., can think of ln|C| as # bits to describe target without knowing D, and ln|CD,()| as number of bits to describe target knowing a good approximation to D, given the assumption that the target has low unlabeled error rate. Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces – c* not fully compatible: Theorem Maria-Florina Balcan
Examples of results in our model Sample Complexity - Uniform convergence bounds Infinite Hypothesis Spaces Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) = (h,x). Maria-Florina Balcan
Examples of results in our modelSample Complexity, -Cover-based bounds • For algorithms that behave in a specific way: • first use the unlabeled data to choose a representative set of compatible hypotheses • then use the labeled sample to choose among these Theorem • Can result in much better bound than uniform convergence! Maria-Florina Balcan
Examples of results in our model • Let’s look at some algorithms. Maria-Florina Balcan
Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • X={0,1}n, C – class of disjunctions, e.g. h=x1Ç x2Ç x3Ç x4Ç x7 • For x 2 X, let vars(x) be the set of variables set to 1 by x • For h 2 C, let vars(h) be the set of variables disjoined by h • (h,x)=1 if either vars(x) µ vars(h) or vars(x) Å vars(h)= • Strong notion of margin: • every variable is either a positive indicator or a negative indicator • no example should contain both positive and negative indicators • Can give a simple PAC-learning algorithm for this pair (C,). Maria-Florina Balcan
Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • Use unlabeled sample U to build G on n vertices: • put an edge between i and j if 9 x in U with i,j 2 vars(x). • Use labeled data L to label the connected components. • Output h s. t. vars(h) is the union of the positively-labeled components. • If c* is fully compatible, then no component will get both positive and negative labels. • and • If |U| & |L| are as given in the bounds, then whp err(h)·. + - 011000 101000 unlabeled set U 1 3 4 5 6 2 001000 000011 100100 h=x1Çx2Çx3Çx4 100000 + labeled set L 000011 - Maria-Florina Balcan
Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • Especially non-helpful distribution – the uniform distr. over all examples x with |vars(x)|=1 • get n components; still needs (n) labeled examples • Helpful distribution - one such that w.h.p. the # of components is small • need a lower number of labeled examples Maria-Florina Balcan
+ + - - Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Examples h x1, x2i2 Rn£ Rn. • Target functions c1 and c2 are linear separators, assume c1=c2=c*, and that no pair crosses the target plane. • f linear separator in Rn, errunl(f) - the fraction of the pairs that “cross f’s boundary” • Consistency problem: given a set of labeled and unlabeled examples, want to find a separator that is consistent with labeled examples and compatible with the unlabeled ones. • It is NP-hard - Abie. Maria-Florina Balcan
Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Assume independence given the label (both points from D+ or from D-). • [Blum & Mitchell ’98] show can co-train (in polynomial time) if have enoughlabeled data to produce a weakly-usefulhypothesis to begin with. • We show, can learn with only a single labeled example. • Key point: independence given the label implies that the functions with low errunl rate are: • close to c* • close to : c* • close to the all positive function • close to the all negative function Maria-Florina Balcan
Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Nice Tool: a “super simple algorithm” for weak learning a large-margin separator: • pick c at random • If margin=1/poly(n), then a random c has at least 1/poly(n) chance of being a weak predictor Maria-Florina Balcan
Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Assume independence given the label. • Draw a large unlabeled sample S={(x1i,x2i)}. • If also assume large margin, • run the “super-simple alg” poly(n) times • feed each c into [Blum & Mitchell] booster • examine all the hypotheses produced, and pick one h with small errunl, that is far from all-positive and all-negative functions • use labeled example to choose either h or : h • w.h.p. one random c was a weakly-useful predictor; so on at least one of these steps we end up with a hypothesis h with small err(h), and so with small errunl(h) • If don’t assume large margin, • use Outlier Removal Lemma to make sure that at least 1/poly fraction of the points in S1={x1i} have margin at least 1/poly; this is sufficient. Maria-Florina Balcan
Implications of our analysisWays in which unlabeled data can help • If the target is highly compatible with D and have enough unlabeled data to estimate over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). • By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropyor the size of the smallest -cover). Maria-Florina Balcan
Questions? Maria-Florina Balcan
Thank you ! Maria-Florina Balcan