1.1k likes | 1.13k Views
Active Learning COMS 6998-4: Learning and Empirical Inference. Irina Rish IBM T.J. Watson Research Center. Outline. Motivation Active learning approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee
E N D
Active LearningCOMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center
Outline • Motivation • Active learning approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee • Applications • Active Collaborative Prediction • Active Bayes net learning
Standard supervised learning model Given m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/, in the realizable case.
Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?
What is Active Learning? • Unlabeled data are readily available; labels are expensive • Want to use adaptive decisions to choose which labels to acquire for a given dataset • Goal is accurate classifier with minimal cost
Active learning warning • Choice of data is only as good as the model itself • Assume a linear model, then two data points are sufficient • What happens when data are not linear?
Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch
Active Learning Approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee
Problem Many results in this framework, even for complicated hypothesis classes. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. Synthetic instances created were incomprehensible to humans! [Lewis and Gale, 1992] tried training text classifiers. “an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”
Uncertainty Sampling [Lewis & Gale, 1994] • Query the event that the current classifier is most uncertain about • Used trivially in SVMs, graphical models, etc. If uncertainty is measured in Euclidean distance: x x x x x x x x x x
Information-based Loss Function [MacKay, 1992] • Maximize KL-divergence between posterior and prior • Maximize reduction in model entropy between posterior and prior • Minimize cross-entropy between posterior and prior • All of these are notions of information gain
Query by Committee [Seung et al. 1992, Freund et al. 1997] • Prior distribution over hypotheses • Samples a set of classifiers from distribution • Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x C B A
Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q
Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q
Model Example St Ot Probabilistic Classifier Notation T : Number of examples Ot : Vector of features of example t St : Class of example t
Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC
Gender TestB Age Gender S TestA S TestA Age TestB TestC TestC Possible Model Structures
Model Space St Ot Model: Model Parameters: P(St) P(Ot|St) Generative Model: Must be able to compute P(St=i, Ot=ot | w)
Model Parameter Space (W) • W = space of possible parameter values • Prior on parameters: • Posterior over models:
Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q q(W,D) returns t*, the next sample to label
Game while NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D
Simulation ? ? S7=false S2=false S5=true ? O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 hmm… q
Active Learning Flavors • Pool (“random access” to patients) • Sequential (must decide as patients walk in the door)
q? • Recall: q(W,D) returns the “most interesting” unlabelled example. • Well, what makes a doctor curious about a patient?
Uncertainty Sampling Example FALSE TRUE
Uncertainty Sampling GOOD: couldn’t be easier GOOD: often performs pretty well BAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples
Model Entropy P(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0
Current model space entropy Expected model space entropy if we learn St Information-Gain • Choose the example that is expected to most reduce H(W) • I.e., Maximize H(W) – H(W | St)
We usually can’t just sum over all models to get H(St|W) …but we can sample from P(W | D)
Score Function Familiar?
If our objective is to reduce the prediction error, then “the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries”