Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University CIKM ’08, Napa Valley, October 2008

unique oracle perfect oracle always right never tired works for free or charges uniformly multiple sources of information imperfect oracles unreliable reluctant expensive or charges non-uniformly Active learning Assumptions and Real World Active Learning Real World

Solution: Proactive Learning • Proactive learningis a generalization of active learning to relax these assumptions • decision-theoretic framework to jointly optimize instance-oracle pair • utility optimization problem under a fixed budget constraint

Outline • Methodology • 3 Scenarios • Reluctance • Fallibility • Variable and Fixed Cost • Evaluation • Problem Setup • Datasets • Results • Conclusion

Scenario 1: Reluctance • 2 oracles: • reliable oracle: expensive but always answers with a correct label • reluctant oracle: cheap but may not respond to some queries • Define a utility score as expected value of information at unit cost

How to simulate oracle unreliability? • depend on factors such as query difficulty (hard to classify), complexity of the data (requires long and time-consuming analysis), etc. In this work, we model it based on query difficulty • Assumptions • Perfect oracle ~ classifier having zero training error on the entire data • Imperfect oracle ~ weak classifier trained on a subset of the entire data • Train a logistic regression classifier on the subset to obtain • Identify instances with • These are the unreliable instances • Challenge: tradeoff between the information value of an instance and the reliability of the oracle

How to estimate ? • Cluster unlabeled data using k-means • Ask the label of each cluster centroid to the reluctant oracle. If • label received: increase of nearby points • no label: decrease of nearby points equals 1 when label received, -1 otherwise • # clusters depend on the clustering budget and oracle fee

Algorithm works in rounds till no budget • At each round, sampling continues until a label is obtained • Be careful: You may spend the entire budget on a single attempt • If no label, decrease the utility of remaining instances: • This is adaptive Penalization of the Reluctant Oracle

Algorithm for Scenario 1

Scenario 2: Fallibility • 2 oracles: • One perfect but expensive oracle • One fallible but cheap oracle, always answers • Alg. Similar to Scenario 1 with slight modifications • During exploration: • Fallible oracle provides the label with its confidence • Confidence = of fallible oracle • If then we don’t use the label but we still update

Outline of Scenario 2

Scenario 3: Non-uniform Cost • Uniform cost: Fraud detection, face recognition, etc. • Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. • 2 oracles: • Fixed-cost Oracle • Variable-cost Oracle

Outline of Scenario 3

Evaluation • Datasets: Face detection, UCI Letter (V-vs-Y), Spambase, and UCI Adult

Oracle Properties and Costs • The cost is inversely proportional to reliability • Higher costs for the fallible oracle since a noisy label should be penalized more than no label at all • Cost ratio creates an incentive to choose between oracles

Underlying Sampling Strategy • Conditional entropy based sampling, weighted by a density measure • Captures the information content of a close neighborhood close neighbors of x

Results: Overall and Reluctance on Spambase Data

Results: Reluctance

Cost varies non-uniformly statistically significant results (p<0.01)

More light on the clustering step • Run each baseline without the clustering step • Entire budget is spent in rounds for data elicitation • No separate clustering budget • Results on Spambase under Scenario 1, cost 1:3

Conclusion • Address issues with the assumptions of active learning • Introduction to a Proactive Learning framework • Analysis of imperfect oracles with differing properties and costs • Expected utility maximization across oracle-instance pairs • Effective against exploitation of a single oracle

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles

Presentation Transcript

Class Imbalance vs. Cost-Sensitive Learning

Active Learning

Active Learning

Active learning

Active Learning

Active Learning

Active Learning

Active Learning

Ensembles for Cost-Sensitive Learning

Active Learning

Active Learning

Paired Sampling in Density-Sensitive Active Learning

Active Learning = Deep Learning

ACTIVE LEARNING “WITH” TECHNOLOGY

Active Learning

Active Cost-sensitive Learning (Intelligent Test Strategies)

Active Learning

Active Learning

Active learning

Class Imbalance vs. Cost-Sensitive Learning