Machine Learning Challenges Comp Bio 02-750

Machine Learning Challenges Comp Bio02-750 Jaime Carbonell, Language Technologies Institute Carnegie Mellon University www.cs.cmu.edu/~jgc 6 September 2012

Today’s topics • Active Learning in Beyond Classification • Rank Learning • Active Rank Learning • Coping with Missing Values • Imputation to the mean • More Advanced Imputation • Coping with imbalanced classes • Minority class discovery & classification • Protein-Protein Interactions: Case in point Jaime G. Carbonell, Language Technolgies Institute

Active Sampling for RankSVM I • Consider a candidate • Assume is added to training set with • Total loss on pairs that include is: • n is the # of training instances with a different label than • Objective function to be minimized becomes: Jaime G. Carbonell, Language Technolgies Institute

Active Sampling for RankSVM II • Assume the current ranking function is • There are two possible cases: • Assume • Derivative w.r.t at a single point or Jaime G. Carbonell, Language Technolgies Institute

Active Sampling for RankSVM III • Substitute in the previous equation to estimate • Magnitude of the total derivative • estimates the ability of to change the current ranker if added into training • Finally, Jaime G. Carbonell, Language Technolgies Institute

Active Sampling for RankBoost I • Again, estimate how the current ranker would change if was in the training set • Estimate this change by the difference in ranking loss before and after is added • Ranking loss w.r.t is (Freund et al., 2003): Jaime G. Carbonell, Language Technolgies Institute

Active Sampling for RankBoost II • Difference in the ranking loss between the current and the enlarged set: • indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance • Finally, the instance with the highest loss differential is sampled: Jaime G. Carbonell, Language Technolgies Institute

Performance Measures • MAP (Mean Average Precision) • MAP is the average of AP values for all queries • NDCG (Normalized Discounted Cumulative Gain) • The impact of each relevant document is discounted as a function of rank position Jaime G. Carbonell, Language Technolgies Institute

Results on TREC03 Jaime G. Carbonell, Language Technolgies Institute

What is Missing? • In active learning the category label is missing, and we can query an oracle, mindful of cost • What else can be missing? • Features: we may not have enough for prediction • Feature combinations: beyond those the classifier is able to generate automatically (e.g. XOR, ratios) • Values of features: Not all instances have values for all their features. • Feature relevance: Some features are noisy or irrelevant • Feature redundancy: e.g. high feature co-variance Jaime G. Carbonell, Language Technolgies Institute

Reducing the Feature Space • Feature selection • Subsample features using IG, MI, … • Well studied, e.g. Yang & Pedersen ICML 1997 • Wrapper methods • Inefficient but accurate, less studied • Feature projection (to lower dimensions) • LDA, SVD, LSI • Slow, well studied, e.g. Falluchi et al 2009 • Kernel functions on feature sub-spaces Jaime G. Carbonell, Language Technolgies Institute

Missing Feature Values • Active learning of features • Not as extensively studied as active instance learning (See Saar-Tsechansky et al, 2007) • Determines which feature values to seek for given instances, or which features across the board • Can be combined with active instance learning • But, what if there is no oracle? • Impossible to get feature values • Too costly or too time consuming • Do we ignore instances with missing features? Jaime G. Carbonell, Language Technolgies Institute

Missing Data Jaime G. Carbonell, Language Technolgies Institute

How to Cope with Missing Features • ML training assumes feature completeness • Filter our features that are mostly missing • Filter out instances with missing features • Impute values for missing features • Radically change ML algorithms • When do we do each of the above? • With lots of data and few missing features… • With sparse training data and few missing… • With sparse data and mostly missing features… Jaime G. Carbonell, Language Technolgies Institute

Missing Feature Imputation • How do we estimate missing feature values? • Infer the mean value across all instances • Infer the mean value in neighborhood • Apply a classifier with other features as input and missing feature value as y (label) • How do we know if it makes a difference? • Sensitivity analysis (extrema, pertubations) • Train without instances with missing features vs instances with imputed values for missing features Jaime G. Carbonell, Language Technolgies Institute

More on Missing Values • Missing Completely at Random (MCAR) • It is generally impossible to prove MCAR or MAR • Missing at Random (MAR) • Statisticians assume MAR as default • Missing values that depend on observables • Imputation via classification/regression • Missing valued that depend on unobservables • Missing depending on the value itself Jaime G. Carbonell, Language Technolgies Institute

Imputation – Example[From: Fan 2008] • How to impute the missing SCL for patient # 5? • Sample mean: (3.8 + 0.6 + 1.1 + 1.3)/4 = 1.7 • By age: (3.8+0.6)/2 = 2.2 • By sex: 1.1 • By education: 1.3 • By race: (3.8 + 0.6 + 1.3)/3 = 1.9 • By ADL: (1.1 + 1.3)/2 = 1.2 • Who is/are in the same “slice” with #5? Jaime G. Carbonell, Language Technolgies Institute

Further Reading • Saar-Tsechansky& Provost http://www.springerlink.com/content/k5m57475n1658723/fulltext.pdf • Yang, Y., Pedersen J.P. A Comparative Study on Feature Selection in Text Categorization ICML 1997, pp412-420 • Gelman chapter: http://www.stat.columbia.edu/~gelman/arm/missing.pdf • Applications in biomed: Lavori, P., R. Dawson and D. Shera (1995) “A Multiple Imputation Strategy for Clinical Trialswith Truncation of Patient Data.” Statistics in Medicine 14: 1913-1925. Jaime G. Carbonell, Language Technolgies Institute

UnbalancedClasses in ML Classifier Unbalanced Unlabeled Data Set Rare Category Detection Learning in Unbalanced Settings Feature Extraction Feature Representation Raw Data Relational Temporal Jaime G. Carbonell, Language Technolgies Institute

Minority Class Discovery Method 1. Calculate problem-specific similarity 2. , , Relevance Feedback Increase t by 1 3. 4. Query No 5. a new class? Yes 6. Output No 7. Budget exhausted? Jaime G. Carbonell, Language Technolgies Institute

Scoring Function • The estimated density • Scoring function: norm of the gradient where Jaime G. Carbonell, Language Technolgies Institute

Abalone 4177 examples 7-dimensional features 20 classes Largest class: 16.50% Smallest class: 0.34% Summary of Real Data Sets • Shuttle • 4515 examples • 9-dimensional features • 7 classes • Largest class: 75.53% • Smallest class: 0.13% Jaime G. Carbonell, Language Technolgies Institute

Results on Real Data Sets Abalone Shuttle MALICE MALICE Interleave Interleave Random sampling Random sampling Jaime G. Carbonell, Language Technolgies Institute

Computational Virology via PPI’s 24 Degree distribution / Hub analysis / Disease checking Graph modules analysis (from bi-clustering study) Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc ) Jaime G. Carbonell, Language Technolgies Institute

Fusion Reverse transcription Transcription Maturation Budding Peterlin and Torono, Nature Rev Immu 2003. HIV-1 host protein interactions HIV-1 depends on the cellular machinery in every aspect of its life cycle. Jaime G. Carbonell, Language Technolgies Institute

PPIs: Protein-Protein Interactions • The cell machinery is run by the proteins • Enzymatic activities, replication, translation, transport, signaling, structural • Proteins interact with each other to perform these functions Through physical contact Indirectly in a protein complex Indirectly in pathway Jaime G. Carbonell, Language Technolgies Institute http://www.cellsignal.com/reference/pathway/Apoptosis_Overview.html

Keywords: binds, cleaves, interacts with, methylated by, myristoylated by etc … Keywords: activates, associates with, causes accumulation of etc … Interactions reported in NIAID “Nef binds hemopoietic cell kinase isoform p61HCK” • Group 1: more likely direct • Group 2: could be indirect • 1063 interactions • 721 human proteins • 17 HIV-1 proteins • 1454 interactions • 914 human proteins • 16 HIV-1 proteins HIV-1 protein Human protein http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/ Jaime G. Carbonell, Language Technolgies Institute

Sources of Labels • Literature • Lab Experiments • Human Experts Feature Importance Active Selection of Instances and Reliable Labelers Jaime G. Carbonell, Language Technolgies Institute

Estimating expert labeling accuracies Solve this through expectation maximization Assuming experts are conditionally independent given true label Jaime G. Carbonell, Language Technolgies Institute

Refined interactome Solid line: probability of being a direct interaction is ≥0.5 Dashed line: probability of being a direct interaction is <0.5 Edge thickness indicates confidence in the interaction Jaime G. Carbonell, Language Technolgies Institute

THANK YOU! Jaime G. Carbonell, Language Technolgies Institute

Machine Learning Challenges Comp Bio 02-750

Machine Learning Challenges Comp Bio 02-750

Presentation Transcript

Bio-image analysis, bio-statistics, programming and machine learning

Machine Learning

Machine Learning

MACHINE LEARNING

CHALLENGES IN BIO-ECONOMY

Machine Learning

Machine Learning

Machine Learning

Machine Learning Challenges in Location Proteomics

Machine Learning

Machine Learning

Comp-02: Replication Options Explored

Active Learning 02-750

LV- BIO-02

Active Learning 02-750

Active Learning in Comp Bio 02-750

Machine learning Courses | Machine Learning Training

Machine Learning

Machine Learning Challenges in Location Proteomics

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Machine Learning

Machine Learning Projects | Machine Learning Applications | Machine Learning Training | Simplilearn