310 likes | 316 Views
This study explores the language of virus proteins to automatically detect drug resistance in HIV, using machine learning and document classification techniques. The goal is to optimize treatment and improve personalized medicine for patients with HIV.
E N D
Understandingthe Language of Virus Proteins toAutomatically Detect Drug Resistance Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University
Outline • HIV & Drug Resistance • Phenotype Prediction Models • Machine Learning • Language of Proteins • Document Classification of HIV Genotypes • Comparison to state-of-the-art & human experts • Other Area of Application: GPCR • Conclusions
Drug Resistance & HIV • Drug resistance is an obstacle in treatment and control of many infectious diseases • 33.2 million living with AIDS in 2007 • 2.1 million died from AIDS in 2007 • High mutation rate of HIV leads to quasi-species of virus strains inside each patient 4 % 25% diversity
HAART & Genotype Testing • Currently ~25 drugs in 4 main drug classes • Treatments with 3+ drugs (HAART) used to cover as many virus strains as possible in quasi-species Personalized Medicine • Trial-and-error not an option due to cross resistance • Goal: Optimize treatment to take longest for virus population to develop resistance • Current: Phenotype predicted from genotype test results to identify resistance present now
Rule-based Prediction • Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype • Example: Rega and ANRS systems • If at least Z of <list of mutations> are present, then predict resistance level Y to drug X. • Example: HIVdb • Sum the penalty scores from each mutation. • Advantage: easy to understand reason for prediction • Disadvantage: impossible to maintain as more data and drugs become available
Database Search Prediction • Find db sequence most similar to test sequence at all selected mutation positions • Does not interpolate between partial matches • Example: VirtualPhenotypeTM [from Virco] • Advantage: no rules to maintain • Disadvantages: • Human experts still needed to identify mutation positions • Large amount of data needed to ensure a db match
Machine Learning Systems • Systems can “learn” by detecting patterns in training data and deduction • Enables knowledge discovery • Varies in the type of features and learning algorithm • Features: • Presence of mutation • Mutation • Structure-based • Maintenance is just re-running learning algorithm on new data • Takes minutes ~ hours to train, seconds ~ minutes to test a sequence Sufficient for Protease Inhibitors Majority of studies
Glass-box vs. Black-box Learning MUTATIONS RESISTANCE • Glass-box alg. allows knowledge discovery • Black-box alg. more tolerant of extra features • Existing systems trade-off between black-box systems and expert-selected mutations Decision tree for EFV(Beerenwinkel, ‘02) Neural Network: 27 Mutations
Text Document Classification touchdown • Classify document by topic based on words • Trade-off between using all English words or select keyword • Chi-square feature selection found to be best at selecting keywords in text [Yang et al. ‘97] glove hoop ball the the to ball to a ball a basket tackle a to the bat
Doc Classification of HIV Genotypes • View target virus proteins as documents • Alphabet size: 20 amino acids • No word/motif boundaries (e.g. Thai, Japanese) • Features: position-independent n-grams, position-dependent n-grams (mutations) • Extract n-grams from every reading frame • Represent as vector of n-gram counts G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T…
Computing Chi-Square Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997) Observed # of sequences with feature x Expected # of seqs with feature x and resistance level c # of sequences with feature x Total # of sequences # of sequences with resistance level c
Selecting the Features 30.2 29.9 45.1 AAA ( ≥ 1) … A ( ≥ 10) … AA ( ≥ 20)
Overview of Framework G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T… N-grams extracted at every reading frame of protein sequence 12 25 7 15 5 ……… 1 0 0 1 0 Counts of all n-grams Chi-Square Feature Selection Selected n-grams occurring more frequently than their most discriminative thresholds F F T F T …… T F F Classifier
Comparison of Feature Sets • Previous study (Rhee et al., 2006) compared performance of 3 feature sets: • Expert-selected mutations • Treatment-selected mutations (TSM) • Mutations occurring more than 2x in dataset • TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein • Not possible to be specific to each drug • Found human experts or TSM to perform best
χ2 Features vs. TSM & Expert Features • Using same dataset and classifier (decision tree), our X2-selected features performed comparably to TSM and expert-selected mutations
Performance of χ2 features • Evaluated on several learning algorithms • Glass-box: decision tree, naïve Bayes, random forest • Black-box: SVM • Average 100-120 X2 features • Choice of classifier did not make much difference
χ2 vs. State-of-the-Art • Used regression algorithms to predict resistance factor (IC50 ratio) • Comparing the best models from each study for each drug, our model matched or outperformed Rhee et al. on 12 of 16 drugs • Average difference < 0.01
Overlap betweenχ2 and Expert-Selected Feature Set 53 of 54 expert-selected mutations for PIranked 108th or higher by χ2
Overlap betweenχ2 and Expert-Selected Feature Set 20 of 21 expert-selected mutations for NRTIranked 120th or higher by χ2 All 15 expert-selected mutations for NNRTIranked 107th or higher by χ2
in vivo Response to HAART • Phenotype systems predict drug resistance the detected genotype has currently • Not a summation of resistance to individual drugs • Mutations can cause resistance to one drug while increasing sensitivity to another • Minor strains not detected by genotype testing Treatment history • Variation in human host affects response • Adherence [Ying et al., 2007] • Haplotype? Gender? State of health? • Lifestyle habits?
Future work: χ2 on Multi-type Features • Model impact of interaction between all these factors using a feature for each combination • χ2reduces to manageable number of important features before applying to glass-box model • Amortized optimization of HAART requires short-term and long-term response model
GPCR Protein Classification • Given a new protein sequence, classify it into the correct category at each level in the hierarchy • Subfamily classification based on function • G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs
GPCR Protein Classification Complex • Previous classification studies rely on alignment-based features • Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy • Document classification approach with χ2 features and naïve Bayes or decision tree SVM, Neural Nets, Clustering Hidden MarkovModels (HMM) K-NearestNeighbours Decision Trees, Naïve Bayes Simple
GPCR Classification: Level I Subfamily Naïve Bayes with chi-square attained 39.7% reduction in residual error. Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.
GPCR Classification: Level II Subfamily Naïve Bayes with chi-square attained 44.5% reduction in residual error.
N-grams selected by chi-square joined to form motifs found in literature.
Conclusions • Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations • Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise • χ2 identified mutations overlap strongly with human experts • Similar approach had found success in previous work on GPCR proteins • Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible
Thank you!Questions? Betty Cheng (ymcheng@cs.cmu.edu) Jaime Carbonell (jgc@cs.cmu.edu)