1 / 31

Understanding the Language of Virus Proteins for Drug Resistance Detection

This study explores the language of virus proteins to automatically detect drug resistance in HIV, using machine learning and document classification techniques. The goal is to optimize treatment and improve personalized medicine for patients with HIV.

hanshaw
Download Presentation

Understanding the Language of Virus Proteins for Drug Resistance Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understandingthe Language of Virus Proteins toAutomatically Detect Drug Resistance Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University

  2. Outline • HIV & Drug Resistance • Phenotype Prediction Models • Machine Learning • Language of Proteins • Document Classification of HIV Genotypes • Comparison to state-of-the-art & human experts • Other Area of Application: GPCR • Conclusions

  3. Drug Resistance & HIV • Drug resistance is an obstacle in treatment and control of many infectious diseases • 33.2 million living with AIDS in 2007 • 2.1 million died from AIDS in 2007 • High mutation rate of HIV leads to quasi-species of virus strains inside each patient 4 % 25% diversity

  4. HAART & Genotype Testing • Currently ~25 drugs in 4 main drug classes • Treatments with 3+ drugs (HAART) used to cover as many virus strains as possible in quasi-species  Personalized Medicine • Trial-and-error not an option due to cross resistance • Goal: Optimize treatment to take longest for virus population to develop resistance • Current: Phenotype predicted from genotype test results to identify resistance present now

  5. Rule-based Prediction • Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype • Example: Rega and ANRS systems • If at least Z of <list of mutations> are present, then predict resistance level Y to drug X. • Example: HIVdb • Sum the penalty scores from each mutation. • Advantage: easy to understand reason for prediction • Disadvantage: impossible to maintain as more data and drugs become available

  6. Database Search Prediction • Find db sequence most similar to test sequence at all selected mutation positions • Does not interpolate between partial matches • Example: VirtualPhenotypeTM [from Virco] • Advantage: no rules to maintain • Disadvantages: • Human experts still needed to identify mutation positions • Large amount of data needed to ensure a db match

  7. Machine Learning Systems • Systems can “learn” by detecting patterns in training data and deduction • Enables knowledge discovery • Varies in the type of features and learning algorithm • Features: • Presence of mutation • Mutation • Structure-based • Maintenance is just re-running learning algorithm on new data • Takes minutes ~ hours to train, seconds ~ minutes to test a sequence Sufficient for Protease Inhibitors Majority of studies

  8. Glass-box vs. Black-box Learning MUTATIONS RESISTANCE • Glass-box alg. allows knowledge discovery • Black-box alg. more tolerant of extra features • Existing systems trade-off between black-box systems and expert-selected mutations Decision tree for EFV(Beerenwinkel, ‘02) Neural Network: 27 Mutations

  9. Language of Proteins

  10. Text Document Classification touchdown • Classify document by topic based on words • Trade-off between using all English words or select keyword • Chi-square feature selection found to be best at selecting keywords in text [Yang et al. ‘97] glove hoop ball the the to ball to a ball a basket tackle a to the bat

  11. Doc Classification of HIV Genotypes • View target virus proteins as documents • Alphabet size: 20 amino acids • No word/motif boundaries (e.g. Thai, Japanese) • Features: position-independent n-grams, position-dependent n-grams (mutations) • Extract n-grams from every reading frame • Represent as vector of n-gram counts G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T…

  12. Transforming Numeric Features into Binary Features

  13. Computing Chi-Square Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997) Observed # of sequences with feature x Expected # of seqs with feature x and resistance level c # of sequences with feature x Total # of sequences # of sequences with resistance level c

  14. Selecting the Features 30.2 29.9 45.1 AAA ( ≥ 1)  …  A ( ≥ 10)  …  AA ( ≥ 20)

  15. Overview of Framework G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T… N-grams extracted at every reading frame of protein sequence 12 25 7 15 5 ……… 1 0 0 1 0 Counts of all n-grams Chi-Square Feature Selection Selected n-grams occurring more frequently than their most discriminative thresholds F F T F T …… T F F Classifier

  16. Comparison of Feature Sets • Previous study (Rhee et al., 2006) compared performance of 3 feature sets: • Expert-selected mutations • Treatment-selected mutations (TSM) • Mutations occurring more than 2x in dataset • TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein • Not possible to be specific to each drug • Found human experts or TSM to perform best

  17. χ2 Features vs. TSM & Expert Features • Using same dataset and classifier (decision tree), our X2-selected features performed comparably to TSM and expert-selected mutations

  18. Performance of χ2 features • Evaluated on several learning algorithms • Glass-box: decision tree, naïve Bayes, random forest • Black-box: SVM • Average 100-120 X2 features • Choice of classifier did not make much difference

  19. χ2 vs. State-of-the-Art • Used regression algorithms to predict resistance factor (IC50 ratio) • Comparing the best models from each study for each drug, our model matched or outperformed Rhee et al. on 12 of 16 drugs • Average difference < 0.01

  20. Overlap betweenχ2 and Expert-Selected Feature Set 53 of 54 expert-selected mutations for PIranked 108th or higher by χ2

  21. Overlap betweenχ2 and Expert-Selected Feature Set 20 of 21 expert-selected mutations for NRTIranked 120th or higher by χ2 All 15 expert-selected mutations for NNRTIranked 107th or higher by χ2

  22. Top 30 χ2-Ranked Mutations

  23. in vivo Response to HAART • Phenotype systems predict drug resistance the detected genotype has currently • Not a summation of resistance to individual drugs • Mutations can cause resistance to one drug while increasing sensitivity to another • Minor strains not detected by genotype testing  Treatment history • Variation in human host affects response • Adherence [Ying et al., 2007] • Haplotype? Gender? State of health? • Lifestyle habits?

  24. Future work: χ2 on Multi-type Features • Model impact of interaction between all these factors using a feature for each combination • χ2reduces to manageable number of important features before applying to glass-box model • Amortized optimization of HAART requires short-term and long-term response model

  25. GPCR Protein Classification • Given a new protein sequence, classify it into the correct category at each level in the hierarchy • Subfamily classification based on function • G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs

  26. GPCR Protein Classification Complex • Previous classification studies rely on alignment-based features • Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy • Document classification approach with χ2 features and naïve Bayes or decision tree SVM, Neural Nets, Clustering Hidden MarkovModels (HMM) K-NearestNeighbours Decision Trees, Naïve Bayes Simple

  27. GPCR Classification: Level I Subfamily Naïve Bayes with chi-square attained 39.7% reduction in residual error. Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.

  28. GPCR Classification: Level II Subfamily Naïve Bayes with chi-square attained 44.5% reduction in residual error.

  29. N-grams selected by chi-square joined to form motifs found in literature.

  30. Conclusions • Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations • Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise • χ2 identified mutations overlap strongly with human experts • Similar approach had found success in previous work on GPCR proteins • Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible

  31. Thank you!Questions? Betty Cheng (ymcheng@cs.cmu.edu) Jaime Carbonell (jgc@cs.cmu.edu)

More Related