Understanding the Language of Virus Proteins for Drug Resistance Detection

Understandingthe Language of Virus Proteins toAutomatically Detect Drug Resistance Betty Cheng, Jaime Carbonell Language Technologies Institute, School of Computer Science Carnegie Mellon University

Outline • HIV & Drug Resistance • Phenotype Prediction Models • Machine Learning • Language of Proteins • Document Classification of HIV Genotypes • Comparison to state-of-the-art & human experts • Other Area of Application: GPCR • Conclusions

Drug Resistance & HIV • Drug resistance is an obstacle in treatment and control of many infectious diseases • 33.2 million living with AIDS in 2007 • 2.1 million died from AIDS in 2007 • High mutation rate of HIV leads to quasi-species of virus strains inside each patient 4 % 25% diversity

HAART & Genotype Testing • Currently ~25 drugs in 4 main drug classes • Treatments with 3+ drugs (HAART) used to cover as many virus strains as possible in quasi-species  Personalized Medicine • Trial-and-error not an option due to cross resistance • Goal: Optimize treatment to take longest for virus population to develop resistance • Current: Phenotype predicted from genotype test results to identify resistance present now

Rule-based Prediction • Problem: Predict resistance (high/low/none) to each drug given patient’s HIV genotype • Example: Rega and ANRS systems • If at least Z of <list of mutations> are present, then predict resistance level Y to drug X. • Example: HIVdb • Sum the penalty scores from each mutation. • Advantage: easy to understand reason for prediction • Disadvantage: impossible to maintain as more data and drugs become available

Database Search Prediction • Find db sequence most similar to test sequence at all selected mutation positions • Does not interpolate between partial matches • Example: VirtualPhenotypeTM [from Virco] • Advantage: no rules to maintain • Disadvantages: • Human experts still needed to identify mutation positions • Large amount of data needed to ensure a db match

Machine Learning Systems • Systems can “learn” by detecting patterns in training data and deduction • Enables knowledge discovery • Varies in the type of features and learning algorithm • Features: • Presence of mutation • Mutation • Structure-based • Maintenance is just re-running learning algorithm on new data • Takes minutes ~ hours to train, seconds ~ minutes to test a sequence Sufficient for Protease Inhibitors Majority of studies

Glass-box vs. Black-box Learning MUTATIONS RESISTANCE • Glass-box alg. allows knowledge discovery • Black-box alg. more tolerant of extra features • Existing systems trade-off between black-box systems and expert-selected mutations Decision tree for EFV(Beerenwinkel, ‘02) Neural Network: 27 Mutations

Language of Proteins

Text Document Classification touchdown • Classify document by topic based on words • Trade-off between using all English words or select keyword • Chi-square feature selection found to be best at selecting keywords in text [Yang et al. ‘97] glove hoop ball the the to ball to a ball a basket tackle a to the bat

Doc Classification of HIV Genotypes • View target virus proteins as documents • Alphabet size: 20 amino acids • No word/motif boundaries (e.g. Thai, Japanese) • Features: position-independent n-grams, position-dependent n-grams (mutations) • Extract n-grams from every reading frame • Represent as vector of n-gram counts G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T…

Transforming Numeric Features into Binary Features

Computing Chi-Square Chi-square feature selection is the best for document classification. (Yang & Pedersen, 1997) Observed # of sequences with feature x Expected # of seqs with feature x and resistance level c # of sequences with feature x Total # of sequences # of sequences with resistance level c

Selecting the Features 30.2 29.9 45.1 AAA ( ≥ 1)  …  A ( ≥ 10)  …  AA ( ≥ 20)

Overview of Framework G S V E R D S V E E V L K A F R L F D D G N S G T… G S G M R M S R E Q L L N A W R L F C K D N S H T… G S G E R D S R E E I L K A F R L F D D D N S G T… N-grams extracted at every reading frame of protein sequence 12 25 7 15 5 ……… 1 0 0 1 0 Counts of all n-grams Chi-Square Feature Selection Selected n-grams occurring more frequently than their most discriminative thresholds F F T F T …… T F F Classifier

Comparison of Feature Sets • Previous study (Rhee et al., 2006) compared performance of 3 feature sets: • Expert-selected mutations • Treatment-selected mutations (TSM) • Mutations occurring more than 2x in dataset • TSM trained from additional database of patients treated with a given drug class but no drugs targeting same protein • Not possible to be specific to each drug • Found human experts or TSM to perform best

χ2 Features vs. TSM & Expert Features • Using same dataset and classifier (decision tree), our X2-selected features performed comparably to TSM and expert-selected mutations

Performance of χ2 features • Evaluated on several learning algorithms • Glass-box: decision tree, naïve Bayes, random forest • Black-box: SVM • Average 100-120 X2 features • Choice of classifier did not make much difference

χ2 vs. State-of-the-Art • Used regression algorithms to predict resistance factor (IC50 ratio) • Comparing the best models from each study for each drug, our model matched or outperformed Rhee et al. on 12 of 16 drugs • Average difference < 0.01

Overlap betweenχ2 and Expert-Selected Feature Set 53 of 54 expert-selected mutations for PIranked 108th or higher by χ2

Overlap betweenχ2 and Expert-Selected Feature Set 20 of 21 expert-selected mutations for NRTIranked 120th or higher by χ2 All 15 expert-selected mutations for NNRTIranked 107th or higher by χ2

Top 30 χ2-Ranked Mutations

in vivo Response to HAART • Phenotype systems predict drug resistance the detected genotype has currently • Not a summation of resistance to individual drugs • Mutations can cause resistance to one drug while increasing sensitivity to another • Minor strains not detected by genotype testing  Treatment history • Variation in human host affects response • Adherence [Ying et al., 2007] • Haplotype? Gender? State of health? • Lifestyle habits?

Future work: χ2 on Multi-type Features • Model impact of interaction between all these factors using a feature for each combination • χ2reduces to manageable number of important features before applying to glass-box model • Amortized optimization of HAART requires short-term and long-term response model

GPCR Protein Classification • Given a new protein sequence, classify it into the correct category at each level in the hierarchy • Subfamily classification based on function • G-Protein Coupled Receptors (GPCR) is target of 60% of current drugs

GPCR Protein Classification Complex • Previous classification studies rely on alignment-based features • Karchin et al.(2002) evaluated performance of classifiers at varying levels of complexity and concluded SVMs were necessary to attain 85%+ accuracy • Document classification approach with χ2 features and naïve Bayes or decision tree SVM, Neural Nets, Clustering Hidden MarkovModels (HMM) K-NearestNeighbours Decision Trees, Naïve Bayes Simple

GPCR Classification: Level I Subfamily Naïve Bayes with chi-square attained 39.7% reduction in residual error. Position-independent n-grams outperformed position-specific ones because diversity of GPCR seqs made sequence alignment difficult.

GPCR Classification: Level II Subfamily Naïve Bayes with chi-square attained 44.5% reduction in residual error.

N-grams selected by chi-square joined to form motifs found in literature.

Conclusions • Current phenotype prediction systems require human experts to maintain – either rules or resistance-associated mutations • Text document classification approach led to fully automatic prediction model with comparable results to state-of-the-art yet requiring no human expertise • χ2 identified mutations overlap strongly with human experts • Similar approach had found success in previous work on GPCR proteins • Aim: An automatic prediction model for short-term and long-term viral load response to HAART so that amortized treatment optimization is possible

Thank you!Questions? Betty Cheng (ymcheng@cs.cmu.edu) Jaime Carbonell (jgc@cs.cmu.edu)

Understanding the Language of Virus Proteins for Drug Resistance Detection

Understanding the Language of Virus Proteins for Drug Resistance Detection

Presentation Transcript

Drug Resistance Monitoring

ARV Drug Resistance

Drug Resistance Surveillance

Antiretroviral Drug Resistance

Antimicrobial Drug Resistance

Drug Resistance

Understanding Proteins

HIV Drug Resistance

Antiretroviral Drug Resistance

Antiretroviral Drug Resistance

Understanding the Language of Virus Proteins to Automatically Detect Drug Resistance

Drug Resistance:

Antiretroviral Therapy in Children With Drug Resistance Virus

Transmitted drug resistance

Bacterial Drug Resistance

Understanding HIV Drug Resistance Testing For Case Managers

The Problem of Drug Resistance

ARV Drug Resistance

Drug Test Cassette as the Most Convenient Tool to Detect Drug

Hair Drug Testing – A Best Way To Detect The Drug Level

Drug Resistance

Drug Resistance