1 / 108

Active Learning COMS 6998-4: Learning and Empirical Inference

Active Learning COMS 6998-4: Learning and Empirical Inference. Irina Rish IBM T.J. Watson Research Center. Outline. Motivation Active learning approaches Membership queries Uncertainty Sampling Information-based loss functions Uncertainty-Region Sampling Query by committee

clara
Download Presentation

Active Learning COMS 6998-4: Learning and Empirical Inference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active LearningCOMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

  2. Outline • Motivation • Active learning approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee • Applications • Active Collaborative Prediction • Active Bayes net learning

  3. Standard supervised learning model Given m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/, in the realizable case.

  4. Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

  5. What is Active Learning? • Unlabeled data are readily available; labels are expensive • Want to use adaptive decisions to choose which labels to acquire for a given dataset • Goal is accurate classifier with minimal cost

  6. Active learning warning • Choice of data is only as good as the model itself • Assume a linear model, then two data points are sufficient • What happens when data are not linear?

  7. Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch

  8. Active Learning Approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee

  9. Problem Many results in this framework, even for complicated hypothesis classes. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. Synthetic instances created were incomprehensible to humans! [Lewis and Gale, 1992] tried training text classifiers. “an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”

  10. Uncertainty Sampling [Lewis & Gale, 1994] • Query the event that the current classifier is most uncertain about • Used trivially in SVMs, graphical models, etc. If uncertainty is measured in Euclidean distance: x x x x x x x x x x

  11. Information-based Loss Function [MacKay, 1992] • Maximize KL-divergence between posterior and prior • Maximize reduction in model entropy between posterior and prior • Minimize cross-entropy between posterior and prior • All of these are notions of information gain

  12. Query by Committee [Seung et al. 1992, Freund et al. 1997] • Prior distribution over hypotheses • Samples a set of classifiers from distribution • Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x C B A

  13. Infogain-based Active Learning

  14. Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q

  15. Dataset (D) Example

  16. Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q

  17. Model Example St Ot Probabilistic Classifier Notation T : Number of examples Ot : Vector of features of example t St : Class of example t

  18. Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC

  19. Gender TestB Age Gender S TestA S TestA Age TestB TestC TestC Possible Model Structures

  20. Model Space St Ot Model: Model Parameters: P(St) P(Ot|St) Generative Model: Must be able to compute P(St=i, Ot=ot | w)

  21. Model Parameter Space (W) • W = space of possible parameter values • Prior on parameters: • Posterior over models:

  22. Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q q(W,D) returns t*, the next sample to label

  23. Game while NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D

  24. Simulation ? ? S7=false S2=false S5=true ? O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 hmm… q

  25. Active Learning Flavors • Pool (“random access” to patients) • Sequential (must decide as patients walk in the door)

  26. q? • Recall: q(W,D) returns the “most interesting” unlabelled example. • Well, what makes a doctor curious about a patient?

  27. 1994

  28. Score Function

  29. Uncertainty Sampling Example FALSE

  30. Uncertainty Sampling Example FALSE TRUE

  31. Uncertainty Sampling GOOD: couldn’t be easier GOOD: often performs pretty well BAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples

  32. Can we do better thanuncertainty sampling?

  33. Informative with respect to what? 1992

  34. Model Entropy P(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0

  35. Current model space entropy Expected model space entropy if we learn St Information-Gain • Choose the example that is expected to most reduce H(W) • I.e., Maximize H(W) – H(W | St)

  36. Score Function

  37. We usually can’t just sum over all models to get H(St|W) …but we can sample from P(W | D)

  38. Conditional Model Entropy

  39. Score Function

  40. Score Function Familiar?

  41. Uncertainty Sampling & Information Gain

  42. But there is a problem…

  43. If our objective is to reduce the prediction error, then “the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries”

More Related