A PAC Model for Learning from Labeled and Unlabeled Data

A PAC Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer Science Department Maria-Florina Balcan

Outline of the talk • Supervised Learning • PAC Model • Sample Complexity • Algorithm Design • Semi-supervised Learning • A PAC Style Model • Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions • Implications of our Analysis Maria-Florina Balcan

Usual Supervised Learning Problem • Imagine you want a computer program to help you decide which email messages are spam and which are important. • Might represent each message by n features. (e.g., return address, keywords, spelling, etc.). • Take a sample S of data, labeled according to whether they were/weren't spam. • Goal of algorithm is to use data seen so far to produce good prediction rule (a "hypothesis")h for future data. Maria-Florina Balcan

The concept learning setting E.g., • Given data, some reasonable rules might be: • Predict SPAM if unknown AND (sex OR sales) • Predict SPAM if sales + sex – known > 0. • ... Maria-Florina Balcan

Supervised Learning, Big Questions • Algorithm Design • How might we automatically generate rules that do well on observed data? • Sample Complexity/Confidence Bound • What kind of confidence do we have that they will do well in the future? Maria-Florina Balcan

Supervised Learning: Formalization (PAC) • PAC model – nice/standard model for learning from labeled data. • X - instance space • S={(x, l)} - set of labeled examples • examples - assumed to be drawn i.i.d. from some distr. D over X and labeled by some target concept c* • labels 2 {-1,1} - binary classification • Want to do optimization over S to find some hypothesis h, but we wanth to have small error over D. • err(h)=Prx 2 D(h(x)  c*(x)) Maria-Florina Balcan

confidence accuracy Basic PAC Learning Definitions • Algorithm APAC-learns concept class C if for any target c*in C, any distribution D over X, any ,  > 0: • A uses at most poly(n,1/,1/,size(c*)) examples and running time. • With probability 1-, A produces h in C of error at most . • Notation: – true error of h - empirical error of h Maria-Florina Balcan

Sample Complexity: Uniform Convergence Finite Hypothesis Spaces • Realizable Case 1. Prob. a bad hypothesis is consistent with m examples is at most (1-)m 2. So, prob. exists a bad consistent hypothesis is at most |C|(1-)m 3. Set to , solve to get # examples needed at most 1/[ln(|C|) + ln(1/)] • If not too many rules to choose from, then unlikely some bad one will fool you just by chance. Maria-Florina Balcan

Sample Complexity: Uniform Convergence Finite Hypothesis Spaces Realizable Case Agnostic Case • What if there is no perfect h? • Gives hope for local optimization over the training data. Maria-Florina Balcan

Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces • C[S] – the set of splittings of dataset S using concepts from C. • C[m] - maximum number of ways to split m points using concepts in C; i.e. • C[m,D] - expected number of splits of m points from D with concepts in C. • Neat Fact #1: previous results still hold if we replace |C| with C[2m]. • Neat Fact #2: can even replace with C[2m,D]. Maria-Florina Balcan

Sample Complexity: Uniform Convergence Infinite Hypothesis Spaces For instance: Sauer’s Lemma, C[m]=O(mVC-dim(C)) implies: Maria-Florina Balcan

Outline of the talk • Supervised Learning • PAC Model • Sample Complexity • Algorithms • Semi-supervised Learning • Proposed Model • Examples of results in our model • Sample Complexity • Algorithmic Issues: Co-training of linear separators • Conclusions • Implications of our Analysis Maria-Florina Balcan

Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning) • Hot topic in recent years in Machine Learning. • Many applications have lots of unlabeled data, but labeled data is rare or expensive: • Web page, document classification • OCR, Image classification Maria-Florina Balcan

Combining Labeled and Unlabeled Data • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM • Co-training • Graph-based methods Maria-Florina Balcan

Can we extend the PAC model to deal with Unlabeled Data? • PAC model – nice/standard model for learning from labeled data. • Goal – extend it naturally to the case of learning from both labeled and unlabeleddata. • Different algorithms are based on differentassumptions about how data should behave. • Question – how to capture many of the assumptions typically used? Maria-Florina Balcan

_ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only Example of “typical” assumption • The separator goes through low density regions of the space/large margin. • assume we are looking for linear separator • belief: should exist one with large separation Maria-Florina Balcan

Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor x1- Link info x - Link info & Text info x2- Text info Another Example • Agreement between two parts : co-training. • examples contain two sufficient sets of features, i.e. an example is x=h x1, x2i and the belief is that the two parts of the example are consistent, i.e. 9 c1, c2 such that c1(x1)=c2(x2)=c*(x) • for example, if we want to classify web pages: x = hx1, x2i Maria-Florina Balcan

My Advisor Co-training Text info Link info + + - - Maria-Florina Balcan

Proposed Model • Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. • “learn C” becomes “learn (C,)”(i.e.learn class C under compatibility notion ) • Express relationships that one hopes the target function and underlying distribution will possess. • Goal: use unlabeled data & the belief that the target is compatible to reduce C down to just {the highly compatible functions in C}. Maria-Florina Balcan

Proposed Model, cont • Goal: use unlabeled data & our belief to reducesize(C) down to size(highly compatible functions in C) in the previous bounds. • Want to be able to analyze how much unlabeled data is needed to uniformly estimate compatibilities well. • Require that the degree of compatibility be something that can be estimated from a finite sample. Maria-Florina Balcan

Proposed Model, cont • Augment the notion of a concept class C with a notion of compatibility  between a concept and the data distribution. • Require that the degree of compatibility be something that can be estimated from a finite sample. • Require  to be an expectation over individual examples: • (h,D)=Ex 2 D[(h, x)]compatibility of h with D, (h,x) 2 [0,1] • errunl(h)=1-(h, D)incompatibility of h with D (unlabeled error rate of h) Maria-Florina Balcan

_ + Highly compatible + _ Margins, Compatibility • Margins: belief is that should exist a large margin separator. • Incompatibility of h and D (unlabeled error rate of h) – the probability mass within distance  of h. • Can be written as an expectation over individual examples (h,D)=Ex 2 D[(h,x)] where: • (h,x)=0 if dist(x,h) · • (h,x)=1 if dist(x,h) ¸ Maria-Florina Balcan

_ + Highly compatible + _ Margins, Compatibility • Margins: belief is that should exist a large margin separator. • If do not want to commit to  in advance, define (h,x) to be a smooth function of dist(x,h), e.g.: • Illegal notion of compatibility: the largest s.t. D has probability mass exactly zero within distance  of h. Maria-Florina Balcan

Co-training, Compatibility • Co-training: examples come as pairs h x1, x2i and the goal is to learn a pair of functions h h1, h2i • Hope is that the two parts of the example are consistent. • Legal (and natural) notion of compatibility: • the compatibility of h h1, h2iand D: • can be written as an expectation over examples: Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces, Doubly Realizable Case • Assume (h,x) 2 {0,1}; define CD,() = {h 2 C : errunl(h) ·}. Theorem • Bound the number of labeled examples as a measure of the helpfulness of D w.r.t to  • a helpful distribution is one in which CD,() is small Maria-Florina Balcan

Semi-Supervised Learning Natural Formalization (PAC) • We will say an algorithm "PAC-learns" if it runs in poly time using samples poly in respective bounds. • E.g., can think of ln|C| as # bits to describe target without knowing D, and ln|CD,()| as number of bits to describe target knowing a good approximation to D, given the assumption that the target has low unlabeled error rate. Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds Finite Hypothesis Spaces – c* not fully compatible: Theorem Maria-Florina Balcan

Examples of results in our model Sample Complexity - Uniform convergence bounds Infinite Hypothesis Spaces Assume (h,x) 2 {0,1} and (C) = {h : h 2 C} where h(x) = (h,x). Maria-Florina Balcan

Examples of results in our modelSample Complexity, -Cover-based bounds • For algorithms that behave in a specific way: • first use the unlabeled data to choose a representative set of compatible hypotheses • then use the labeled sample to choose among these Theorem • Can result in much better bound than uniform convergence! Maria-Florina Balcan

Examples of results in our model • Let’s look at some algorithms. Maria-Florina Balcan

Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • X={0,1}n, C – class of disjunctions, e.g. h=x1Ç x2Ç x3Ç x4Ç x7 • For x 2 X, let vars(x) be the set of variables set to 1 by x • For h 2 C, let vars(h) be the set of variables disjoined by h • (h,x)=1 if either vars(x) µ vars(h) or vars(x) Å vars(h)= • Strong notion of margin: • every variable is either a positive indicator or a negative indicator • no example should contain both positive and negative indicators • Can give a simple PAC-learning algorithm for this pair (C,). Maria-Florina Balcan

Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • Use unlabeled sample U to build G on n vertices: • put an edge between i and j if 9 x in U with i,j 2 vars(x). • Use labeled data L to label the connected components. • Output h s. t. vars(h) is the union of the positively-labeled components. • If c* is fully compatible, then no component will get both positive and negative labels. • and • If |U| & |L| are as given in the bounds, then whp err(h)·. + - 011000 101000 unlabeled set U 1 3 4 5 6 2 001000 000011 100100 h=x1Çx2Çx3Çx4 100000 + labeled set L 000011 - Maria-Florina Balcan

Examples of results in our modelAlgorithmic Issues: Algorithm for a simple (C,) • Especially non-helpful distribution – the uniform distr. over all examples x with |vars(x)|=1 • get n components; still needs (n) labeled examples • Helpful distribution - one such that w.h.p. the # of components is small • need a lower number of labeled examples Maria-Florina Balcan

+ + - - Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Examples h x1, x2i2 Rn£ Rn. • Target functions c1 and c2 are linear separators, assume c1=c2=c*, and that no pair crosses the target plane. • f linear separator in Rn, errunl(f) - the fraction of the pairs that “cross f’s boundary” • Consistency problem: given a set of labeled and unlabeled examples, want to find a separator that is consistent with labeled examples and compatible with the unlabeled ones. • It is NP-hard - Abie. Maria-Florina Balcan

Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Assume independence given the label (both points from D+ or from D-). • [Blum & Mitchell ’98] show can co-train (in polynomial time) if have enoughlabeled data to produce a weakly-usefulhypothesis to begin with. • We show, can learn with only a single labeled example. • Key point: independence given the label implies that the functions with low errunl rate are: • close to c* • close to : c* • close to the all positive function • close to the all negative function Maria-Florina Balcan

Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Nice Tool: a “super simple algorithm” for weak learning a large-margin separator: • pick c at random • If margin=1/poly(n), then a random c has at least 1/poly(n) chance of being a weak predictor Maria-Florina Balcan

Examples of results in our modelAlgorithmic Issues: Co-training of linear separators • Assume independence given the label. • Draw a large unlabeled sample S={(x1i,x2i)}. • If also assume large margin, • run the “super-simple alg” poly(n) times • feed each c into [Blum & Mitchell] booster • examine all the hypotheses produced, and pick one h with small errunl, that is far from all-positive and all-negative functions • use labeled example to choose either h or : h • w.h.p. one random c was a weakly-useful predictor; so on at least one of these steps we end up with a hypothesis h with small err(h), and so with small errunl(h) • If don’t assume large margin, • use Outlier Removal Lemma to make sure that at least 1/poly fraction of the points in S1={x1i} have margin at least 1/poly; this is sufficient. Maria-Florina Balcan

Implications of our analysisWays in which unlabeled data can help • If the target is highly compatible with D and have enough unlabeled data to estimate  over all h 2 C, then can reduce the search space (from C down to just those h 2 C whose estimated unlabeled error rate is low). • By providing an estimate of D, unlabeled data can allow a more refined distribution-specific notion of hypothesis space size (such as Annealed VC-entropyor the size of the smallest -cover). Maria-Florina Balcan

Questions? Maria-Florina Balcan

Thank you ! Maria-Florina Balcan

A PAC Model for Learning from Labeled and Unlabeled Data

A PAC Model for Learning from Labeled and Unlabeled Data

Presentation Transcript

Self-taught Learning Transfer Learning from Unlabeled Data

Text Classification from Labeled and Unlabeled Documents using EM

Stochastic Unsupervised Learning on Unlabeled Data

Clustering tagged documents with labeled and unlabeled documents

Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Learning from Positive and Unlabeled Examples

Combining Labeled and Unlabeled Data for Multiclass Text Categorization

Learning with Ambiguously Labeled Training Data

Learning from labelled and unlabeled data

Techniques For Exploiting Unlabeled Data

Motivation : Graph on labeled and unlabeled data W; Laplacian

Text Classification from Labeled and Unlabeled Documents using EM

Combining labeled and unlabeled data for text categorization with a large number of categories

Improving the Graph Mincut Approach to Learning from Labeled and Unlabeled Examples

Learning from Partially Labeled Data

Incorporating Unlabeled Data in the Learning Process

Text Classification from Labeled and Unlabeled Documents using EM

Learning from Labeled and Unlabeled Data using Graph Mincuts

Incorporating Unlabeled Data in the Learning Process

A Theoretical Model for Learning from Labeled and Unlabeled Data

Techniques For Exploiting Unlabeled Data