320 likes | 445 Views
Joint work with. Maria-Florina Balcan Computer Science Department Carnegie Mellon University ninamf@cs.cmu.edu. Jennifer Wortman Computer and Information Science University of Pennsylvania wortmanj@seas.upenn.edu. and. The True Sample Complexity of Active Learning. Steve Hanneke
Joint work with Maria-Florina Balcan Computer Science Department Carnegie Mellon University ninamf@cs.cmu.edu Jennifer Wortman Computer and Information Science University of Pennsylvania wortmanj@seas.upenn.edu and The True Sample Complexity of Active Learning Steve Hanneke Machine Learning Department Carnegie Mellon University shanneke@cs.cmu.edu TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA
Passive Learning Data Source Expert / Oracle Learning Algorithm Raw Unlabeled Data Labeled examples Algorithm outputs a classifier Steve Hanneke 2
Active Learning Data Source Learning Algorithm Expert / Oracle Raw Unlabeled Data Request for the label of an example The label of that example Request for the label of another example The label of that example . . . Algorithm outputs a classifier Steve Hanneke 3
Active Learning Data Source Learning Algorithm Expert / Oracle How many label requests are required to learn? Sample Complexity Raw Unlabeled Data Request for the label of an example The label of that example Request for the label of another example The label of that example e.g., Das04, Das05, DKM05, BBL06, Kaa06, Han07a&b, BBZ07, DHM07 . . . Algorithm outputs a classifier Steve Hanneke 4
Active Learning Sometimes Helps An Example: 1-dimensional threshold functions. - + Steve Hanneke 5
Active Learning Sometimes Helps An Example: 1-dimensional threshold functions. Take m unlabeled examples Repeatedly request the label of the median point between -/+ boundaries. Take any threshold consistent with the observed labels. - - + + + + - - - - - - - - - + + + Used only log(m) label requests, but get a classifier consistent with all m examples! Exponential improvement over passive! Steve Hanneke 6
Outline • Formal Model • An Illustrative Example • Exciting New Results • Conclusions & Open Problems Steve Hanneke 7
Formal Model Steve Hanneke 8
Sample Complexity target-dependent budget Steve Hanneke 9
A More Subtle Example: Intervals Suppose D is uniform on [0,1] - + - 0 1 Steve Hanneke 10
A More Subtle Example: Intervals 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } Suppose D is uniform on [0,1] 0 1 - Suppose the target labels everything “-1” How many labels are needed to verify that a given h has er(h) < ? Need (1/) label requests to guarantee the target isn’t one of these (1/) sample complexity? Steve Hanneke 11
A More Subtle Example: Intervals Algorithm: Take a large number of unlabeled examples. Phase 1: Query random examples until we find a +1 example. (if use all n label requests before finding a +1 example, return the empty interval) Phase 2: Do binary searches to the left and right of the +1 point. After n total label requests, return any consistent hC. - + + + - - - - - - - - - Steve Hanneke 12
Can Active Learning Always Help? Steve Hanneke 13
Active Learning Always Helps! [HLW94] passive algorithm has O(1/) sample complexity. Steve Hanneke 14
Proof Outline Claim 1: The result is certainly true for “threshold-esc” problems – where the problem gets easier the longer we work at it (based on [Hanneke07], “disagreement coefficient” analysis) Claim 2: Any C can be partitioned into C1,C2,C3,… with this property. Claim 3: There is an aggregation algorithm that uses all of C1,C2,C3,… but is never much worse than using just the Ci that contains the target f.
How Much Better? In a sense, the o(1/) results is the strongest possible result at this level of generality. But can we prove stronger results for specific pairs (C,D)? Steve Hanneke 16
Exponential Improvements It is often possible to achieve polylogarithmic sample complexity for all targets. For example: linear separators, under uniform distributions on an r-sphere Axis-aligned rectangles, under uniform distributions on [0,1]r Decision trees with axis-aligned splits, under uniform distributions on [0,1]r Finite unions of intervals on the real line (arbitrary distributions) Can also preserve polylog sample complexities under some transformations: Unions, “close” distributions, mixtures of distributions Steve Hanneke 17
Conclusions Active learning can always achieve a strictly superior asymptotic sample complexity compared to passive learning. For most natural learning problems, the improvements are exponential. Often the amount of improvement is not detectable from observable quantities (e.g., for intervals). Steve Hanneke 18
Open Problems Generalizing to agnostic learning General conditions for learnability at an exponential rate Reversing the order of quantifiers regarding D in the model Steve Hanneke 19
Thank You Special thanks to Yishay Mansour for suggesting this perspective, and to Eyal Even-Dar, Michael Kearns, Larry Wasserman & Eric Xing for discussions and feedback. Steve Hanneke 20
Models of Learning Number of labels before we can find a good classifier vs number of labels before we can both find a good classifier and know we have found one. Steve Hanneke 21
How Much Better? In a sense, the o(1/) results is the strongest possible result at this level of generality. But can we prove stronger results for specific pairs (C,D)? 1 (2) 2 (3) (3) (3) (3) 3 … 0 Steve Hanneke 22
Exponential Improvements We can extend these results by altering the hypothesis class or distribution in ways that preserve learnability at an exponential rate. Steve Hanneke 23
Proof Sketch Steve Hanneke 24
Proof Sketch Steve Hanneke 25
Proof Sketch Steve Hanneke 26
Proof Sketch Steve Hanneke 27
Proof Sketch … … … … … … … Maximum recursion depth ≤ d Steve Hanneke 28
Proof Sketch Steve Hanneke 29
Proof Sketch Steve Hanneke 30
Proof Sketch Steve Hanneke 31