190 likes | 211 Views
Explore the innovative use of unlabeled data in active learning processes, including web page and document classification algorithms at Yahoo! Research. Maria-Florina Balcan presents methods for semi-supervised passive learning and active learning, optimizing label requests to improve classifier performance. Dive into margin-based active learning algorithms for linear separators, with a focus on realizable cases and bounded noise settings.
E N D
Margin-Based Active Learning Maria-Florina Balcan Carnegie Mellon University Joint with Andrei Broder & Tong Zhang Yahoo! Research Maria-Florina Balcan
Incorporating Unlabeled Data in the Learning Process • OCR, Image classification • Web page, document classification • All the classification problems at Yahoo! Research. Unlabeled data cheap & easy to obtain. Labeled data much more expensive. Maria-Florina Balcan
Semi-Supervised Passive Learning • Several SSL methods developed to use unlabeled data to improve performance, e.g.: • Transductive SVM[Joachims ’98] • Co-training[Blum & Mitchell ’98], • Graph-based methods[Blum & Chawla’01] • Unlabeled data - allows to focus on a priori reasonable classifiers. See Avrim’s talk at the “Open Problems” session. Maria-Florina Balcan
Active Learning • The learner can choose specific examples to be labeled: - The learner works harder to use fewer labeled examples. • Get a set of unlabeled examples from PX. This talk: linear separators. Setting • P distribution over X £ Y; hypothesis class C. Interactively request labels of any of these examples. Goal: find h with small error over P. Minimize the number of label requests. Maria-Florina Balcan
h3 h2 h1 h0 Can Adaptive Querying Help? [CAL ’92, Dasgupta ’04] C = {linear separators in R1}, realizable case. Active setting: O(log 1/) labels to find an -accurate threshold. Exponential improvement in sample complexity. In general,number of queries needed depends on C and P. C ={linear separators in R2}: for some target hyp. no improvement can be achieved. Learning to accuracy requires 1/ labels. Maria-Florina Balcan
When Active Learning Helps In general,number of queries needed depends on C and P. C - homogeneous linear separators in Rd, PX - uniform distribution over unit sphere. Realizable case • O(d log 1/) labels to find a hypothesis with error . [Freund et al., ’97; Dasgupta, Kalai, Monteleoni ’05] Agnostic Case • low noise, O(d2 log 1/) labels to find a hypothesis with error . A2 algorithm [Balcan, Beygelzimer, Langford ’06] [Hanneke ’07] Maria-Florina Balcan
An Overview of Our Results Analyze a class of margin based active learning algorithms for learning linear separators. • C - homogeneous linear separators in Rd, PX - uniform distrib. over unit sphere get exponential improvement in the realizable case. • Naturally extend the analysis to the bounded noise setting. • Dimension independent bounds when we have a good margin distribution. Maria-Florina Balcan
Margin Based Active-Learning, Realizable Case Algorithm Draw m1 unlabeled examples, label them, add them to W(1). • iteratek=2, …, s • find a hypothesis wk-1 consistent with W(k-1). • W(k)=W(k-1). • sample mk unlabeled samples x • satisfying |wk-1¢ x| ·k-1 ; • label them and add them to W(k). • end iterate Maria-Florina Balcan
Margin Based Active-Learning, Realizable Case • Draw m1 unlabeled examples, label them, add them to W(1). • iteratek = 2, …, s • find a hypothesis wk-1 consistent with W(k-1). • W(k)=W(k-1). • sample mk unlabeled samples x satisfying |wk-1T¢ x| ·k-1 • label them and add them to W(k). 1 w2 w3 w1 2 Maria-Florina Balcan
u (u,v) v v u v Margin Based Active-Learning, Realizable Case Theorem PX is uniform over Sd. If and then after iterations ws has error ·. Fact 1 Fact 2 If and Maria-Florina Balcan
w wk-1 w* k-1 Margin Based Active-Learning, Realizable Case • iteratek=2, … ,s • find a hypothesis wk-1 consistent with W(k-1). • W(k)=W(k-1). • sample mk unlabeled samples x • satisfying |wk-1T¢ x| ·k-1 • label them and add them to W(k). Proof Idea Induction: allw consistent with W(k) have error ·1/2k; so,wkhas error· 1/2k. For · 1/2k+1 Maria-Florina Balcan
w wk-1 w* k-1 Proof Idea Under the uniform distr. for · 1/2k+1 Maria-Florina Balcan
w wk-1 w* k-1 Proof Idea Under the uniform distr. for · 1/2k+1 Enough to ensure Can do with only labels. Maria-Florina Balcan
w wk-1 w* Realizable Case, Suboptimal Alternative Could imagine: zero Suboptimal Need need so and labels to find a hyp. with error . Similar to [CAL’92, BBL’06, H’07] Maria-Florina Balcan
Margin Based Active-Learning, Non-realizable Case Guarantee Assume PX is uniform over Sd. Assume that |P(Y=1|x)-P(Y=-1|x)| ¸ for all x. Assume w* is the Bayes classifier. Then The previous algorithm and proof extend naturally, and get again an exponential improvement. Maria-Florina Balcan
Margin Based Active-Learning, Non-realizable Case Guarantee Assume PX is uniform over Sd. Assume that |P(Y=1|x)-P(Y=-1|x)| ¸ for all x. Assume w* is the Bayes classifier. Then The previous algorithm and proof extend naturally, and get again an exponential improvement. Maria-Florina Balcan
Summary • Analyze a class of margin based active learning algorithms for learning linear separators. Open Problems • Analyze a wider class of distributions, e.g. log-concave. • Characterize the right sample complexity terms for the Active Learning setting. Maria-Florina Balcan
Thank you ! Maria-Florina Balcan
Thank you ! Also, special thanks to: Alina Beygelzimer, Sanjoy Dasgupta, and John Langford for useful discussions. Maria-Florina Balcan