380 likes | 677 Views
Active Learning of Binary Classifiers. Presenters: Nina Balcan and Steve Hanneke. Outline. What is Active Learning? Active Learning Linear Separators General Theories of Active Learning Open Problems. Supervised Passive Learning. Data Source. Expert / Oracle. Learning Algorithm.
E N D
Active Learning of Binary Classifiers Presenters: Nina Balcan and Steve Hanneke Maria-Florina Balcan
Outline • What is Active Learning? • Active Learning Linear Separators • General Theories of Active Learning • Open Problems Maria-Florina Balcan
Supervised Passive Learning Data Source Expert / Oracle Learning Algorithm Unlabeled examples Labeled examples Algorithm outputs a classifier Maria-Florina Balcan
Incorporating Unlabeled Data in the Learning process • In many settings, unlabeled data is cheap & easy to obtain, labeled data is muchmore expensive. • Web page, document classification • OCR, Image classification Maria-Florina Balcan
Semi-Supervised Passive Learning Data Source Learning Algorithm Expert / Oracle Unlabeled examples Unlabeled examples Labeled Examples Algorithm outputs a classifier Maria-Florina Balcan
Semi-Supervised Passive Learning • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM [Joachims ’98] • Co-training [Blum & Mitchell ’98], [BBY04] • Graph-based methods [Blum & Chawla01], [ZGL03] Maria-Florina Balcan
_ + _ _ + + + _ + + _ _ SVM Transductive SVM Labeled data only Semi-Supervised Passive Learning • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM [Joachims ’98] • Co-training [Blum & Mitchell ’98], [BBY04] • Graph-based methods [Blum & Chawla01], [ZGL03] Maria-Florina Balcan
Semi-Supervised Passive Learning • Several methods have been developed to try to use unlabeled data to improve performance, e.g.: • Transductive SVM [Joachims ’98] • Co-training [Blum & Mitchell ’98], [BBY04] • Graph-based methods [Blum & Chawla01], [ZGL03] Workshops [ICML ’03, ICML’ 05] Books: Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds) Theoretical models:Balcan-Blum’05 Maria-Florina Balcan
Active Learning Data Source Learning Algorithm Expert / Oracle Unlabeled examples Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier Maria-Florina Balcan
What Makes a Good Algorithm? • Guaranteed to output a relatively good classifier for most learning problems. • Doesn’t make too many label requests. Choose the label requests carefully, to get informative labels. Maria-Florina Balcan
Can It Really Do Better Than Passive? • YES! (sometimes) • We often need far fewer labels for active learning than for passive. • This is predicted by theory and has been observed in practice. Maria-Florina Balcan
Active Learning in Practice • Active SVM (Tong & Koller, ICML 2000) seems to be quite useful in practice. At any time during the alg., we have a “current guess” of the separator: the max-margin separator of all labeled points so far. E.g., strategy 1: request the label of the example closest to the current separator. Maria-Florina Balcan
When Does it Work? And Why? • The algorithms currently used in practice are not well understood theoretically. • We don’t know if/when they output a good classifier, nor can we say how many labels they will need. • So we seek algorithms that we can understand and state formal guarantees for. Rest of this talk: surveys recent theoretical results. Maria-Florina Balcan
Standard Supervised Learning Setting • S={(x, l)} - set of labeled examples • drawn i.i.d. from some distr. D over X and labeled by some target concept c*2 C • Want to do optimization over S to find some hyp. h, but we wanth to have small error over D. • err(h)=Prx 2 D(h(x) c*(x)) Sample Complexity, Finite Hyp. Space, Realizable case Maria-Florina Balcan
Sample Complexity: Uniform Convergence Bounds • Infinite Hypothesis Case E.g., if C - class of linear separators in Rd, then we need roughly O(d/) examples to achieve generalization error . Non-realizable case – replace with 2. Maria-Florina Balcan
Active Learning • We get to see unlabeled data first, and there is a charge for every label. • The learner has the ability to choose specific examples to be labeled: - The learner works harder, in order to use fewer labeled examples. • How many labels can we save by querying adaptively? Maria-Florina Balcan
- + w Can adaptive querying help? [CAL92, Dasgupta04] • Consider threshold functions on the real line: hw(x) = 1(x ¸ w),C = {hw: w 2R} • Sample with 1/unlabeledexamples. + - - • Binary search – need just O(log 1/) labels. Active setting: O(log 1/) labels to find an -accurate threshold. Supervised learning needs O(1/) labels. Exponentialimprovement in sample complexity Maria-Florina Balcan
Active Learning might not help [Dasgupta04] In general,number of queries needed depends on C and also on D. h3 C = {linear separators in R1}: active learning reduces sample complexitysubstantially. h2 C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h1 h0 In this case: learning to accuracy requires 1/ labels… Maria-Florina Balcan
Examples where Active Learning helps In general,number of queries needed depends on C and also on D. • C = {linear separators in R1}: active learning reduces sample complexitysubstantially no matter what is the input distribution. • C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis with error rate < . • Dasgupta, Kalai, Monteleoni, COLT 2005 • Freund et al., ’97. • Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan
Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) • Example: data lies on circle in R2 and hypotheses are homogeneouslinear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan
current version space region of uncertainy Region of uncertainty [CAL92] Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan
current version space + + region of uncertainty in data space Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) Maria-Florina Balcan
Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan
Region of uncertainty [CAL92], Guarantees Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’06] Analyze a version of this alg. which is robust to noise. • C- linear separators on the line, low noise, exponential • improvement. • C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. • low noise, need only d2 log 1/ labels to find a hypothesis with error rate < . • realizable case, d3/2 log 1/ labels. • supervised -- d/ labels. Maria-Florina Balcan
wk+1 wk w* γk Margin Based Active-Learning Algorithm [Balcan-Broder-Zhang, COLT 07] Use O(d) examples to find w1 of error 1/8. • iteratek=2, … , log(1/) • rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • label them; • find wk2 B(wk-1, 1/2k )consistent with all these examples. • end iterate Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Key Point Under the uniform distr. assumption for we have · /4 Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea Key Point Under the uniform distr. assumption for we have · /4 Key Point So, it’s enough to ensure that We can do so by only using O(d log( 1/)) labels in round k. Maria-Florina Balcan
General Theories of Active Learning Maria-Florina Balcan
General Concept Spaces • In the general learning problem, there is a concept space C, and we want to find an -optimal classifier h C with high probability 1-. Maria-Florina Balcan
How Many Labels Do We Need? • In passive learning, we know of an algorithm (empirical risk minimization) that needs only labels (for realizable learning), and if there is noise. • We also know this is close to the best we can expect from any passive algorithm. Maria-Florina Balcan
How Many Labels Do We Need? • As before, we want to explore the analogous idea for Active Learning, (but now for general concept space C). • How many label requests are necessary and sufficient for Active Learning? • What are the relevant complexity measures? (i.e., the Active Learning analogue of VC dimension) Maria-Florina Balcan
What ARE the Interesting Quantities? • Generally speaking, we want examples whose labels are highly controversial among the set of remaining concepts. • The likelihood of drawing such an informative example is an important quantity to consider. • But there are many ways to define “informative” in general. Maria-Florina Balcan
What Do You Mean By “Informative”? • Want examples that reduce the version space. • But how do we measure progress? • A problem-specific measure P on C? • The Diameter? • Measure of the region of disagreement? • Cover size? (see e.g., Hanneke, COLT 2007) All of these seem to have interesting theories associated with them. As an example, let’s take a look at Diameter in detail. Maria-Florina Balcan
Diameter (Dasgupta, NIPS 2005) Imagine each pair of concepts separated by distance > has an edge between them. • Define distance d(g,h) = Pr(g(X)h(X)). • One way to guarantee our classifier is within of the target classifier is to (safely) reduce the diameter to size . We have to rule out at least one of the two concepts for each edge. Each unlabeled example X partitions the concepts into two sets. And guarantees some fraction of the edges will have at least one concept contradicted, no matter which label it has. Maria-Florina Balcan
Diameter • Theorem: (Dasgupta, NIPS 2005) If, for any finite subset V C, PrX(X eliminates a ρ fraction of the edges) , then (assuming no noise) we can reduce the diameter to using a number of label requests at most Furthermore, there is an algorithm that does this, which with high probability requires a number of unlabeled examples at most Maria-Florina Balcan
Open Problems in Active Learning • Efficient (correct) learning algorithms for linear separators provably achieving significant improvements on many distributions. • What about binary feature spaces? • Tight general-purpose sample complexity bounds, for both realizable and agnostic. • An optimal active learning algorithm? Maria-Florina Balcan