Active Learning COMS 6998-4: Learning and Empirical Inference

Active LearningCOMS 6998-4: Learning and Empirical Inference Irina Rish IBM T.J. Watson Research Center

Outline • Motivation • Active learning approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee • Applications • Active Collaborative Prediction • Active Bayes net learning

Standard supervised learning model Given m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1. VC theory: need m to be roughly d/, in the realizable case.

Active learning In many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label. What is the minimum number of labels needed to achieve the target error rate?

What is Active Learning? • Unlabeled data are readily available; labels are expensive • Want to use adaptive decisions to choose which labels to acquire for a given dataset • Goal is accurate classifier with minimal cost

Active learning warning • Choice of data is only as good as the model itself • Assume a linear model, then two data points are sufficient • What happens when data are not linear?

Active Learning Flavors Selective Sampling Membership Queries Pool Sequential Myopic Batch

Active Learning Approaches • Membership queries • Uncertainty Sampling • Information-based loss functions • Uncertainty-Region Sampling • Query by committee

Problem Many results in this framework, even for complicated hypothesis classes. [Baum and Lang, 1991] tried fitting a neural net to handwritten characters. Synthetic instances created were incomprehensible to humans! [Lewis and Gale, 1992] tried training text classifiers. “an artificial text created by a learning algorithm is unlikely to be a legitimate natural language expression, and probably would be uninterpretable by a human teacher.”

Uncertainty Sampling [Lewis & Gale, 1994] • Query the event that the current classifier is most uncertain about • Used trivially in SVMs, graphical models, etc. If uncertainty is measured in Euclidean distance: x x x x x x x x x x

Information-based Loss Function [MacKay, 1992] • Maximize KL-divergence between posterior and prior • Maximize reduction in model entropy between posterior and prior • Minimize cross-entropy between posterior and prior • All of these are notions of information gain

Query by Committee [Seung et al. 1992, Freund et al. 1997] • Prior distribution over hypotheses • Samples a set of classifiers from distribution • Queries an example based on the degree of disagreement between committee of classifiers x x x x x x x x x x C B A

Infogain-based Active Learning

Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q

Dataset (D) Example

Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q

Model Example St Ot Probabilistic Classifier Notation T : Number of examples Ot : Vector of features of example t St : Class of example t

Model Example Patient state (St) St : DiseaseState Patient Observations (Ot) Ot1 : Gender Ot2 : Age Ot3 : TestA Ot4 : TestB Ot5 : TestC

Gender TestB Age Gender S TestA S TestA Age TestB TestC TestC Possible Model Structures

Model Space St Ot Model: Model Parameters: P(St) P(Ot|St) Generative Model: Must be able to compute P(St=i, Ot=ot | w)

Model Parameter Space (W) • W = space of possible parameter values • Prior on parameters: • Posterior over models:

Notation • We Have: • Dataset, D • Model parameter space, W • Query algorithm, q q(W,D) returns t*, the next sample to label

Game while NotDone • Learn P(W | D) • q chooses next example to label • Expert adds label to D

Simulation ? ? S7=false S2=false S5=true ? O1 O2 O3 O4 O5 O6 O7 S1 S2 S3 S4 S5 S6 S7 hmm… q

Active Learning Flavors • Pool (“random access” to patients) • Sequential (must decide as patients walk in the door)

q? • Recall: q(W,D) returns the “most interesting” unlabelled example. • Well, what makes a doctor curious about a patient?

1994

Score Function

Uncertainty Sampling Example FALSE

Uncertainty Sampling Example FALSE TRUE

Uncertainty Sampling GOOD: couldn’t be easier GOOD: often performs pretty well BAD: H(St) measures information gain about the samples, not the model Sensitive to noisy samples

Can we do better thanuncertainty sampling?

Informative with respect to what? 1992

Model Entropy P(W|D) P(W|D) P(W|D) W W W H(W) = high …better… H(W) = 0

Current model space entropy Expected model space entropy if we learn St Information-Gain • Choose the example that is expected to most reduce H(W) • I.e., Maximize H(W) – H(W | St)

Score Function

We usually can’t just sum over all models to get H(St|W) …but we can sample from P(W | D)

Conditional Model Entropy

Score Function

Score Function Familiar?

Uncertainty Sampling & Information Gain

But there is a problem…

If our objective is to reduce the prediction error, then “the expected information gain of an unlabeled sample is NOT a sufficient criterion for constructing good queries”

Active Learning COMS 6998-4: Learning and Empirical Inference

Active Learning COMS 6998-4: Learning and Empirical Inference

Presentation Transcript

Group Learning

Introduction to Active Learning

Cellular Networks and Mobile Computing COMS 6998-11, Fall 2012

Software Defined Networking COMS 6998- 8 , Fall 2013

Cellular Networks and Mobile Computing COMS 6998 -11, Fall 2012

Mean Field Inference in Dependency Networks: An Empirical Study

Empirical Research Methods in Computer Science

Active learning

DFL Active Learning Model o bjectives for this session

Software Defined Networking COMS 6998- 8 , Fall 2013

Making Your Head Explode and Other Classroom Delights: The Importance of Active Learning

IT/CS 811 Principles of Machine Learning and Inference

Cellular Networks and Mobile Computing COMS 6998-11, Fall 2012

Cellular Networks and Mobile Computing COMS 6998- 10, Spring 2013

Motivating Active and Group Learning

Active Teaching for Active Learning

Introduction to Predictive Learning

Active Learning

Software Defined Networking COMS 6998 - 10 , Fall 2014