200 likes | 248 Views
Learn about active learning and its impact on reducing the number of labeled examples needed for effective classification. Maria Florina Balcan shares insights on adaptive querying, region of uncertainty, and margin-based algorithms.
E N D
Active Learning Lecture 26th Maria Florina Balcan Maria-Florina Balcan
Active Learning Data Source Expert / Oracle Unlabeled examples Learning Algorithm Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier • The learner can choose specific examples to be labeled. • He works harder, to use fewer labeled examples.
What Makes a Good Algorithm? • Guaranteed to output a relatively good classifier for most learning problems. • Doesn’t make too many label requests. • Choose the label requests carefully, to get informative labels. Maria-Florina Balcan
Can It Really Do Better Than Passive? • YES! (sometimes) • We often need far fewer labels for active learning than for passive. • This is predicted by theory and has been observed in practice. Maria-Florina Balcan
- + w Can adaptive querying help? [CAL92, Dasgupta04] hw(x) = 1(x ¸ w),C = {hw: w 2 R} • Threshold fns on the real line: Active Algorithm • Sample with 1/unlabeledexamples; do binary search. + - - • Binary search – need just O(log 1/) labels. Passive supervised: (1/) labels to find an -accurate threshold. Active: only O(log 1/) labels. Exponential improvement. Other interesting results as well.
Active Learning might not help [Dasgupta04] In general,number of queries needed depends on C and also on D. h3 C = {linear separators in R1}: active learning reduces sample complexitysubstantially. h2 C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h1 h0 In this case: learning to accuracy requires 1/ labels… Maria-Florina Balcan
Examples where Active Learning helps In general,number of queries needed depends on C and also on D. • C = {linear separators in R1}: active learning reduces sample complexitysubstantially no matter what is the input distribution. • C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis with error rate < . • Dasgupta, Kalai, Monteleoni, COLT 2005 • Freund et al., ’97. • Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan
Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) • Example: data lies on circle in R2 and hypotheses are homogeneouslinear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan
current version space region of uncertainy Region of uncertainty [CAL92] Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan
current version space + + region of uncertainty in data space Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) Maria-Florina Balcan
Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan
Region of uncertainty [CAL92], Guarantees Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’06] Analyze a version of this alg. which is robust to noise. • C- linear separators on the line, low noise, exponential • improvement. • C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. • low noise, need only d2 log 1/ labels to find a hypothesis with error rate < . • realizable case, d3/2 log 1/ labels. • supervised -- d/ labels. Maria-Florina Balcan
wk+1 wk w* γk Margin Based Active-Learning Algorithm [Balcan-Broder-Zhang, COLT 07] Use O(d) examples to find w1 of error 1/8. • iteratek=2, … , log(1/) • rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • label them; • find wk2 B(wk-1, 1/2k )consistent with all these examples. • end iterate Maria-Florina Balcan
u (u,v) v v Margin Based Active-Learning, Realizable Case Theorem PX is uniform over Sd. If and then after iterations ws has error ·. Fact 1 Fact 2 Maria-Florina Balcan
u (u,v) v v u v Margin Based Active-Learning, Realizable Case Theorem PX is uniform over Sd. If and then after iterations ws has error ·. Fact 1 Fact 3 If and Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Key Point Under the uniform distr. assumption for we have · /4 Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea Key Point Under the uniform distr. assumption for we have · /4 Key Point So, it’s enough to ensure that We can do so by only using O(d log( 1/)) labels in round k. Maria-Florina Balcan