310 likes | 520 Views
Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach. Laurent Jacob Jean-Philippe Vert. Introduction. Predicting interactions between small molecules and proteins Vital to the drug discovery process Key to understanding biological processes. 3 classes of drug targets
E N D
Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach Laurent Jacob Jean-Philippe Vert
Introduction • Predicting interactions between small molecules and proteins • Vital to the drug discovery process • Key to understanding biological processes • 3 classes of drug targets • G-protein-coupled receptors (GPCRs) • Enzymes • Ion channels
Classical Methods • Consider each target independently from other proteins • Ligand-based approach • Compare to known ligands of the target • Requires knowledge about other ligands of a given target • Structure-based or docking approaches • Uses 3D structure of the target to determine how well a ligand can bind • Requires 3D structure of the target • Very time consuming • Cannot apply if no ligand or 3D structure is known for a given target
Chemogenomics • Chemical space: • set of all small molecules • Biological space: • set of all proteins or protein families • Mine the entire chemical space for interactions with the biological space • Knowledge of some ligands for a target can help to predict ligands for similar targets
Chemogenomic Approaches • Ligand-based chemogenomics • Look at families or subfamilies of proteins • Model ligands at the level of a family • Target-based chemogenomics • Cluster receptors based on ligand binding site similarity • Use known ligands for each cluster to infer shared ligands • Target-ligand approach • Use binding information for targets to predict ligands for another target in a single step
Previous Experiments • Bock and Gough (2005) • Describe ligand-receptor complexes by merging ligand and target descriptors • Use machine learning methods to predict if a ligand-receptor pair forms a complex • Erhan et al. (2006) • Merge a set of ligand descriptors with a set of receptor descriptors in a framework of neural networks and support vector machines • Offers a large flexibility in the choice of descriptors
Proposed Method • Investigates different types of descriptors • Builds upon recent developments in kernel methods • In bio- and cheminformatics • Tests different methods for prediction of ligands • For 3 major classes of targets • Shows that the choice of representation greatly effects accuracy • New kernel based on hierarchies of receptors outperforms all other descriptors • Performs especially well for targets with few or no known ligands
Learning Problem • Given n target/molecule pairs (t1,c1), …, (tn, cn) known to form complexes or not • Each pair is represented by a vector (t,c) • Estimate a linear function • f(t,c)=w┬(t,c) • Whose sign is used to predict if a chemical c can bind to a target t • The vector w is estimated from the training set
Vector Representation • Represent a molecule c by a vector lig(c)Rdc • Encode physiochemical and structural properties • Model interactions between small molecules and a single target • Represent a protein t by a vector tar(t)Rdt • Capture properties of the proteins sequence or structure • Infer models that predict the structural or functional class of a protein • Need to represent a pair (c,t) in a single vector • Capture interactions between features of the molecule and protein that can be useful predictors • Multiply a descriptor of c with a descriptor of t
Tensor Product • (c,t) = lig(c) tar(t) • Represent the set of all possible products of features of c and t • dc x dtvector • The (i,j)-th entry is the product of the i-th entry of lig(c) by the j-th entry of tar(t) • Size may be prohibitively large • Use kernel methods
Kernel Trick • Can process large- or infinite-dimensional patters if the inner product between any two patterns can be computed • Can factorize the inner product between two tensor product vectors • (lig(c) tar(t))┬ (lig(c’) tar(t’)) • = lig(c)┬ lig(c’) x tar(t)┬ tar(t’) • Obtain the inner product between two tensor products • K((c,c’),(t,t’))= Kligand(c,c’) x Ktarget(t,t’) • Kligand(c,c’)= lig(c)┬lig(c’) • Ktarget(t,t’)= tar(t) ┬tar(t’)
Kernels For Ligands • Have been impressive advances in use of SVM in chemoinformatics • Kernels have been designed using: • Physiochemical properties of molecules • 2D or 3D fingerprints • Comparison of 2D and 3D structures of molecules • Detection of common substructures in 2D graphs • Encoding various properties of 3D structures • Used in single-target virtual screening and prediction of pharmacokinetics and toxicity
Tanimoto Kernel • Classical choice • State-of-the-art performance • Kligand(c,c’) = lig(c)┬ lig(c’) / [lig(c)┬ lig(c) + lig(c’)┬ lig(c’) - lig(c)┬ lig(c’)] • lig(c)┬ is a binary vector • Bits indicate if the 2D structure of c contains all linear paths of length l or less as a subgraph • Choose l=8 • Used ChemCPP software to compute
Kernels For Targets • SVM and kernel methods are widely used in bioinformatics • Various Kernels have been proposed based on: • Amino-acid sequence of proteins • 3D structures of proteins • Pattern of occurrences of proteins in multiple sequenced genomes • Used for various tasks related to structural or functional classification of proteins
Dirac Kernel • KDirac(t,t’) • = 1 if t = t’ • = 0 otherwise • Represents different targets as orthonormal vectors • Orthogonality between two proteins t and t’ implies orthogonality between all pairs (c,t) and (c’,t’) for any two molecules c and c’ • Learning is performed independently for each target protein • Does not share any information of known ligands between different targets
Multitask Kernel • Kmultitask(t,t’) = 1 + Kdirac(t,t’) • Removes the orthogonality • Combines target-specific properties of the ligands and general properties across all targets • Allows sharing of information during learning • Preserves the specificities of the ligands for each target • Does not weigh much how known interactions should contribute
Mismatch and Local Alignment Kernels • Empirical observations suggest that molecules that bind to t are only likely to bind to t’ if they are similar in terms of structure or evolutionary history • Can be detected by comparing protein sequences • Mismatch kernel: • compares short sequences of amino acids up to some number of mismatches • Choose 3mers with a maximum of one mismatch • Local alignment kernel: • uses the alignment score between the primary sequences of proteins to measure their similarity
Hierarchy Kernel • Khierarchy(t,t’)=(h(t), h(t’)) • h(t) has a feature for each node in the hierarchy • Is set to 1 if the node is part of t’s hierarchy • Is set to 0 otherwise • Plus one feature is constantly set to 1 • Use data from the target and data from other targets, giving it smaller weight • Performed the best in the experiments
Enzyme Hierarchy • Enzyme Commission numbers • International Union of Biochemistry and Molecular Biology (1992) • Classifies by the chemical reaction they catalyze • Four-level hierarchy • For example, • EC 1 includes oxidoreductases • EC 1.2 includes oxidoreductases that act on the aldehyde or oxo group of donors • EC 1.2.2 has NAD+ or NADP+ as an acceptor • EC 1.2.2.1 caltalyze the oxidation of formate to bicarbonate • Enzymes that are close in the hierarchy should have similar ligands
GPCR Hierarchy • GPCRs are grouped into four classes • Group A: rhodopsin family • Group B: secretin family • Group C: metabotropic family • Group D: regroups more divers receptors • KEGG database subdivides rhodopsin family into three subgroups • Amine receptors • Peptide receptors • Other receptors • And adds a second level of classification based on the type of ligands or known subdivisions
Ion Channel Hierarchy • The KEGG database divides ion channels into 8 classes • Cys-loop superfamily • Glutamate-gated cation channels • Epithelial and related Na+ channels • Voltage-gated cation channels • Related to voltage-gated cation channels • Related to inward rectifier K+ channels • Chloride channels • Related to ATPase-linked transporters • Each class is further subdivided • By, for example, the type of ligands or type of ion passing through the channel
Data Extraction • Extracted compound interaction data from KEGG BRITE database • Known compounds for each target • Type of interaction • Enzymes: inhibitor, cofactor, effector • GPCR: antagonist, full/partial agonist • Ion Channels: pore blocker, positive/negative allosteric modulator, agonist, antagonist • Did not take into account • Orthologs of targets • Enzymes with same EC number • Compounds with no molecular descriptor • Primarily peptides • Targets with no known compounds
Data Points • Generated as many negative ligand-target pairs as known ligand-target pairs • Randomly chose ligands • Produced false negatives • Need experimentally confirmed negative pairs • 2436 data points for enzymes • 675 enzymes, 524 compounds • 798 data points for GPCRs • 100 receptors, 219 compounds • 2230 data points for ion channels • 114 channels, 462 compounds
Known Ligands Distribution of the number of known ligands per target for enzymes, GPCR, and ion channel datasets • Each bar indicates the proportion of targets for which a given number of training points are available • Few compounds are known for most targets Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
Experiments • Experiment 1 • Trained an SVM classifier on • all points involving other targets of the family • plus a fraction of points involving t • Tested on the remaining data points for t • Assesses the accuracy for a given target when using ligands for other targets for training • Experiment 2 • Trained an SVM classifier using only interactions that did not involve t • Tested on data points that did involve t • Simulated making predictions for targets with no known ligands • Measured performance using the area under the ROC curve (AUC)
Results: Experiment 1 Mean AUC on each dataset with various target kernels • Hierarchy kernel shows significant improvements • Sharing information for known ligands of different targets • Incorporating prior information into the kernels
Gram Matrices Target kernel Gram matrices (Ktar) for ion channels with multitask, hierarchy, and local alignment kernels • Hierarchy kernel adds structure information • Local alignment kernel retains some substructures • For GPCR and enzymes, almost no structure is found by the sequence kernels Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
Relative Improvement Relative improvement of the hierarchy kernel against the Dirac kernel as a function of the number of known ligands for enzymes, GPCR, and ion channel datasets • Strong improvement when few ligands are known • Decreases when enough training points become available • After a certain point, performance is impaired Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409
Results: Experiment 2 Mean AUC on each dataset with various target kernels • Dirac kernel showed random behavior • Learning with no training data • Hierarchy kernel still gives reasonable results • 1.7%, 5.1%, 7.2% loss for enzymes, GPCR, and ion channels compared to the first experiment
References • Rognan D: Chemogenomic approaches to rational drug design. Br J Pharmacol 2007, 152:38-52. • Kanehisa M, Goto S, Kawashima S, Nakaya A: {The KEGG databases at GenomeNet}. Nucl. Acids Res. 2002, 30:42-46. • Jacob L, Vert J: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24:2149-2156. • Erhan D, L'Heureux P, Yue SY, Bengio Y: Collaborative Filtering on a Family of Biological Targets. Journal of Chemical Information and Modeling 2006, 46:626-635. • Bock JR, Gough DA: Virtual Screen for Ligands of Orphan G Protein-Coupled Receptors. Journal of Chemical Information and Modeling 2005, 45:1402-1414.