310 likes | 426 Views
Understanding the Language of Virus Proteins to Automatically Detect Drug Resistance. Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University. Outline. HIV & Drug Resistance Phenotype Prediction Models Machine Learning
E N D
Understandingthe Language of Virus Proteins toAutomatically Detect Drug Resistance Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University
Outline • HIV & Drug Resistance • Phenotype Prediction Models • Machine Learning • Language of Proteins • Document Classification of HIV Genotypes • Comparison to state-of-the-art & human experts • Other Area of Application: GPCR • Conclusions
Drug Resistance & HIV • Drug resistance is an obstacle in treatment and control of many infectious diseases • 33.2 million living with AIDS in 2007 • 2.1 million died from AIDS in 2007 • High mutation rate of HIV leads to quasi-species of virus strains inside each patient 4 % 25% diversity
HAART & Genotype Testing • Currently ~25 drugs in 4 main drug classes • Treatments with 3+ drugs (HAART) used to cover as many virus strains as possible in quasi-species Personalized Medicine • Trial-and-error not an option due to cross resistance • Goal: Optimize treatment to take longest for virus population to develop resistance • Current: Phenotype predicted from genotype test results to identify resistance present now
Rule-based Prediction • Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype • Example: Rega and ANRS systems • If at least Z of <list of mutations> are present, then predict resistance level Y to drug X. • Example: HIVdb • Sum the penalty scores from each mutation. • Advantage: easy to understand reason for prediction • Disadvantage: impossible to maintain as more data and drugs become available
Database Search Prediction • Find db sequence most similar to test sequence at all selected mutation positions • Does not interpolate between partial matches • Example: VirtualPhenotypeTM [from Virco] • Advantage: no rules to maintain • Disadvantages: • Human experts still needed to identify mutation positions • Large amount of data needed to ensure a db match
Machine Learning Systems • Systems can “learn” by detecting patterns in training data and deduction • Enables knowledge discovery • Varies in the type of features and learning algorithm • Features: • Presence of mutation • Mutation • Structure-based • Maintenance is just re-running learning algorithm on new data • Takes minutes ~ hours to train, seconds ~ minutes to test a sequence Sufficient for Protease Inhibitors Majority of studies
Glass-box vs. Black-box Learning MUTATIONS RESISTANCE • Glass-box alg. allows knowledge discovery • Black-box alg. more tolerant of extra features • Existing systems trade-off between black-box systems and expert-selected mutations Decision tree for EFV(Beerenwinkel, ‘02) Neural Network: 27 Mutations
Text Document Classification touchdown • Classify document by topic based on words • Trade-off between using all English words or select keyword • Chi-square feature selection found to be best at selecting keywords in text [Yang et al. ‘97] glove hoop ball the the to ball to a ball a basket tackle a to the bat
Doc Classification of HIV Genotypes • View target virus proteins as documents • Alphabet size: 20 amino acids • No word/motif boundaries (e.g. Thai, Japanese) • Features: position-independent n-grams, position-dependent n-grams (mutations) • Extract n-grams from every reading frame • Represent as vector of n-gram counts G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T…
Computing Chi-Square Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997) Observed # of sequences with feature x Expected # of seqs with feature x and resistance level c # of sequences with feature x Total # of sequences # of sequences with resistance level c
Selecting the Features 30.2 29.9 45.1 AAA ( ≥ 1) … A ( ≥ 10) … AA ( ≥ 20)
Overview of Framework G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T… N-grams extracted at every reading frame of protein sequence 12 25 7 15 5 ……… 1 0 0 1 0 Counts of all n-grams Chi-Square Feature Selection Selected n-grams occurring more frequently than their most discriminative thresholds F F T F T …… T F F Classifier
Comparison of Feature Sets • Previous study (Rhee et al., 2006) compared performance of 3 feature sets: • Expert-selected mutations • Treatment-selected mutations (TSM) • Mutations occurring more than 2x in dataset • TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein • Not possible to be specific to each drug • Found human experts or TSM to perform best
χ2 Features vs. TSM & Expert Features • Using same dataset and classifier (decision tree), our X2-selected features performed comparably to TSM and expert-selected mutations
Performance of χ2 features • Evaluated on several learning algorithms • Glass-box: decision tree, naïve Bayes, random forest • Black-box: SVM • Average 100-120 X2 features • Choice of classifier did not make much difference
χ2 vs. State-of-the-Art • Used regression algorithms to predict resistance factor (IC50 ratio) • Comparing the best models from each study for each drug, our model matched or outperformed Rhee et al. on 12 of 16 drugs • Average difference < 0.01
Overlap betweenχ2 and Expert-Selected Feature Set 53 of 54 expert-selected mutations for PIranked 108th or higher by χ2
Overlap betweenχ2 and Expert-Selected Feature Set 20 of 21 expert-selected mutations for NRTIranked 120th or higher by χ2 All 15 expert-selected mutations for NNRTIranked 107th or higher by χ2
in vivo Response to HAART • Phenotype systems predict drug resistance the detected genotype has currently • Not a summation of resistance to individual drugs • Mutations can cause resistance to one drug while increasing sensitivity to another • Minor strains not detected by genotype testing Treatment history • Variation in human host affects response • Adherence [Ying et al., 2007] • Haplotype? Gender? State of health? • Lifestyle habits?
Future work: χ2 on Multi-type Features • Model impact of interaction between all these factors using a feature for each combination • χ2reduces to manageable number of important features before applying to glass-box model • Amortized optimization of HAART requires short-term and long-term response model
GPCR Protein Classification • Given a new protein sequence, classify it into the correct category at each level in the hierarchy • Subfamily classification based on function • G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs
GPCR Protein Classification Complex • Previous classification studies rely on alignment-based features • Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy • Document classification approach with χ2 features and naïve Bayes or decision tree SVM, Neural Nets, Clustering Hidden MarkovModels (HMM) K-NearestNeighbours Decision Trees, Naïve Bayes Simple
GPCR Classification: Level I Subfamily Naïve Bayes with chi-square attained 39.7% reduction in residual error. Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.
GPCR Classification: Level II Subfamily Naïve Bayes with chi-square attained 44.5% reduction in residual error.
N-grams selected by chi-square joined to form motifs found in literature.
Conclusions • Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations • Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise • χ2 identified mutations overlap strongly with human experts • Similar approach had found success in previous work on GPCR proteins • Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible
Thank you!Questions? Betty Cheng (ymcheng@cs.cmu.edu) Jaime Carbonell (jgc@cs.cmu.edu)