260 likes | 370 Views
Active Learning in Comp Bio 02-750. Jaime Carbonell , Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu /~jgc 4 September 2012. Why is Active Learning Important?. Labeled data volumes unlabeled data volumes 1.2% of all proteins have known structures
E N D
Active Learning in Comp Bio02-750 Jaime Carbonell, Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 4 September 2012
Why is Active Learning Important? • Labeled data volumes unlabeled data volumes • 1.2% of all proteins have known structures • < .01% of all galaxies in the Sloan Sky Survey have consensus type labels • < .0001% of all web pages have topic labels • << E-10% of all internet sessions are labeled as to fraudulence (malware, etc.) • < .0001 of all financial transactions investigated w.r.t. fraudulence • < .01% of all monolingual text is reliably bilingual • If labeling is costly, or limited, select the instances with maximal impact for learning Jaime G. Carbonell, Language Technolgies Institute
Active Learning Relevant to Computational Biology • Protein-structure Learning • Classification into protein family, … • Inferring 3D structure from 2D sequence • Protein-protein interactions (PPIs) • Within-species (e.g. Human) pathways • Host-pathogen (e.g. Human-HIV1) • Evidence-Based Medicine • Cardiology (selecting diagnosis method) • Transplant (selecting immunosuppression) Jaime G. Carbonell, Language Technolgies Institute
Active Learning Relevant to Computational Biology • Instance Selection in Classification • Protein family/sub-family classificaiton • Protein structure prediction • Motif/promoter-sequence prediction • Experiment/source Selection • Protein structure: MRI vs X-ray Crystalography • Granularity of Molecular Dynamics • Source of info for PPI network induction • Target/reactant selection in microarrays • Cascaded active learning (based on results of last experimental cycle) Jaime G. Carbonell, Language Technolgies Institute
Active Learning • Training data: • Special case: • Functional space: • Fitness Criterion: • a.k.a. loss function • Sampling Strategy: Jaime G. Carbonell, Language Technolgies Institute
“Myopic” Sampling Strategies • Random sampling (preserves distribution) • Uncertainty sampling (Lewis, 1996; Tong & Koller, 2000) • proximity to decision boundary • maximal distance to labeled x’s • Density sampling (kNN-inspired McCallum & Nigam, 2004) • Representative sampling (Xu et al, 2003) • Instability sampling (probability-weighted) • x’s that maximally change decision boundary • Ensemble Strategies • Boosting-like ensemble (Baram, 2003) • DUAL (Donmez & Carbonell, 2007) • Dynamically switches strategies • [See Settles 2010 review of Active Learning] Jaime G. Carbonell, Language Technolgies Institute
Which point to sample? Grey = unlabeled Red = class A Brown = class B Jaime G. Carbonell, Language Technolgies Institute
Density-Based Sampling Centroid of largest unsampled cluster Jaime G. Carbonell, Language Technolgies Institute
Uncertainty Sampling Closest to decision boundary Jaime G. Carbonell, Language Technolgies Institute
Maximal Diversity Sampling Maximally distant from labeled x’s Jaime G. Carbonell, Language Technolgies Institute
Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria Jaime G. Carbonell, Language Technolgies Institute
Strategy Selection: No Universal Optimum • Optimal operating range for AL sampling strategies differs • How to get the best of both worlds? • (Hint: ensemble methods, e.g. DUAL) Jaime G. Carbonell, Language Technolgies Institute
How does DUAL do better? • Runs DWUS until it estimates a cross-over • Monitor the change in expected error at each iteration to detect when it is stuck in local minima • DUAL uses a mixture model after the cross-over ( saturation ) point • Our goal should be to minimize the expected future error • If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force • But in practice, we do not know it Jaime G. Carbonell, Language Technolgies Institute
More on DUAL [ECML 2007] • After cross-over, US does better => uncertainty score should be given more weight • should reflect how well US performs • can be calculated by the expected error of US on the unlabeled data* => • Finally, we have the following selection criterion for DUAL: * US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to Jaime G. Carbonell, Language Technolgies Institute
Results: DUAL vs DWUS Jaime G. Carbonell, Language Technolgies Institute
Beyond Dual • Paired Sampling with Geodesic Density Estimation • Donmez & Carbonell, SIAM 2008 • Active Rank Learning • Search results: Donmez & Carbonell, WWW 2008 • In general: Donmez & Carbonell, ICML 2008 • Structure Learning • Inferring 3D protein structure from 1D sequence • Remains open problem Jaime G. Carbonell, Language Technolgies Institute
Issues in Active Learning • Abundance of unlabeled examples • Paucity of labeled examples • High cost of labeling (experimentation, expert) • Selection of appropriate sampling strategies • Dependency on underlying ML method • What if labeling noise, variable costs, …? • Applications abound, including in Comp Bio • Tertiary/quaternary protein structure prediction • Protein-protein interaction prediction • Drug target selection Jaime G. Carbonell, Language Technolgies Institute
Readings • Burr Settles – Comprehensive Survey of AL http://www.cs.cmu.edu/~bsettles/pub/settles.activelearning.pdf • Donmez, P. Carbonell, J. and Bennett, P. “Dual-Strategy Active Learning” http://www.cs.cmu.edu/~jgc/publication/Dual_Strategy_ECML_2007.pdf • Cohn, Ghahramani and Jordan, “Active Learning with Statistical Models” http://dspace.mit.edu/bitstream/handle/1721.1/7192/AIM-1522.pdf;jsessionid=13C2A9BF0DEC1567B9CA33F0C43BC3C3?sequence=2 Jaime G. Carbonell, Language Technolgies Institute
THANK YOU! Jaime G. Carbonell, Language Technolgies Institute
Active Sampling for RankSVM I • Consider a candidate • Assume is added to training set with • Total loss on pairs that include is: • n is the # of training instances with a different label than • Objective function to be minimized becomes: Jaime G. Carbonell, Language Technolgies Institute
Active Sampling for RankSVM II • Assume the current ranking function is • There are two possible cases: • Assume • Derivative w.r.t at a single point or Jaime G. Carbonell, Language Technolgies Institute
Active Sampling for RankSVM III • Substitute in the previous equation to estimate • Magnitude of the total derivative • estimates the ability of to change the current ranker if added into training • Finally, Jaime G. Carbonell, Language Technolgies Institute
Active Sampling for RankBoost I • Again, estimate how the current ranker would change if was in the training set • Estimate this change by the difference in ranking loss before and after is added • Ranking loss w.r.t is (Freund et al., 2003): Jaime G. Carbonell, Language Technolgies Institute
Active Sampling for RankBoost II • Difference in the ranking loss between the current and the enlarged set: • indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance • Finally, the instance with the highest loss differential is sampled: Jaime G. Carbonell, Language Technolgies Institute
Performance Measures • MAP (Mean Average Precision) • MAP is the average of AP values for all queries • NDCG (Normalized Discounted Cumulative Gain) • The impact of each relevant document is discounted as a function of rank position Jaime G. Carbonell, Language Technolgies Institute
Results on TREC03 Jaime G. Carbonell, Language Technolgies Institute