200 likes | 239 Views
Active Learning. Lecture 26th. Maria Florina Balcan. Active Learning. Data Source. Expert / Oracle. Unlabeled examples. Learning Algorithm. Request for the Label of an Example. A Label for that Example. Request for the Label of an Example. A Label for that Example.
E N D
Active Learning Lecture 26th Maria Florina Balcan Maria-Florina Balcan
Active Learning Data Source Expert / Oracle Unlabeled examples Learning Algorithm Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example . . . Algorithm outputs a classifier • The learner can choose specific examples to be labeled. • He works harder, to use fewer labeled examples.
What Makes a Good Algorithm? • Guaranteed to output a relatively good classifier for most learning problems. • Doesn’t make too many label requests. • Choose the label requests carefully, to get informative labels. Maria-Florina Balcan
Can It Really Do Better Than Passive? • YES! (sometimes) • We often need far fewer labels for active learning than for passive. • This is predicted by theory and has been observed in practice. Maria-Florina Balcan
- + w Can adaptive querying help? [CAL92, Dasgupta04] hw(x) = 1(x ¸ w),C = {hw: w 2 R} • Threshold fns on the real line: Active Algorithm • Sample with 1/unlabeledexamples; do binary search. + - - • Binary search – need just O(log 1/) labels. Passive supervised: (1/) labels to find an -accurate threshold. Active: only O(log 1/) labels. Exponential improvement. Other interesting results as well.
Active Learning might not help [Dasgupta04] In general,number of queries needed depends on C and also on D. h3 C = {linear separators in R1}: active learning reduces sample complexitysubstantially. h2 C = {linear separators in R2}: there are some target hyp. for which no improvement can be achieved! - no matter how benign the input distr. h1 h0 In this case: learning to accuracy requires 1/ labels… Maria-Florina Balcan
Examples where Active Learning helps In general,number of queries needed depends on C and also on D. • C = {linear separators in R1}: active learning reduces sample complexitysubstantially no matter what is the input distribution. • C - homogeneous linear separators in Rd, D - uniform distribution over unit sphere: • need only d log 1/ labels to find a hypothesis with error rate < . • Dasgupta, Kalai, Monteleoni, COLT 2005 • Freund et al., ’97. • Balcan-Broder-Zhang, COLT 07 Maria-Florina Balcan
Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) • Example: data lies on circle in R2 and hypotheses are homogeneouslinear separators. current version space + + region of uncertainty in data space Maria-Florina Balcan
current version space region of uncertainy Region of uncertainty [CAL92] Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. Maria-Florina Balcan
current version space + + region of uncertainty in data space Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) Maria-Florina Balcan
Region of uncertainty [CAL92] • Current version space: part of C consistent with labels so far. • “Region of uncertainty” = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of uncertainty in data space Maria-Florina Balcan
Region of uncertainty [CAL92], Guarantees Algorithm: Pick a few points at random from the current region of uncertainty and query their labels. [Balcan, Beygelzimer, Langford, ICML’06] Analyze a version of this alg. which is robust to noise. • C- linear separators on the line, low noise, exponential • improvement. • C - homogeneous linear separators in Rd, D -uniform distribution over unit sphere. • low noise, need only d2 log 1/ labels to find a hypothesis with error rate < . • realizable case, d3/2 log 1/ labels. • supervised -- d/ labels. Maria-Florina Balcan
wk+1 wk w* γk Margin Based Active-Learning Algorithm [Balcan-Broder-Zhang, COLT 07] Use O(d) examples to find w1 of error 1/8. • iteratek=2, … , log(1/) • rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • label them; • find wk2 B(wk-1, 1/2k )consistent with all these examples. • end iterate Maria-Florina Balcan
u (u,v) v v Margin Based Active-Learning, Realizable Case Theorem PX is uniform over Sd. If and then after iterations ws has error ·. Fact 1 Fact 2 Maria-Florina Balcan
u (u,v) v v u v Margin Based Active-Learning, Realizable Case Theorem PX is uniform over Sd. If and then after iterations ws has error ·. Fact 1 Fact 3 If and Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea • iteratek=2, … , log(1/) • Rejection sample mk samples x from D • satisfying |wk-1T¢ x| ·k ; • ask for labels and find wk2 B(wk-1, 1/2k ) • consistent with all these examples. • end iterate Assume wkhas error·. We are done if 9k s.t. wk+1 has error ·/2 and only need O(d log( 1/)) labels in round k. Key Point Under the uniform distr. assumption for we have · /4 Maria-Florina Balcan
wk+1 wk w* γk BBZ’07, Proof Idea Key Point Under the uniform distr. assumption for we have · /4 Key Point So, it’s enough to ensure that We can do so by only using O(d log( 1/)) labels in round k. Maria-Florina Balcan