390 likes | 700 Views
Improving Protein-Ligand Binding Affinity Prediction using Random Forest. Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong 1 April 2014 http :// istar.cse.cuhk.edu.hk JackyLeeHongJian@Gmail.com. Protein-Ligand Docking.
E N D
Improving Protein-Ligand Binding Affinity Prediction using Random Forest Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong 1 April 2014http://istar.cse.cuhk.edu.hk JackyLeeHongJian@Gmail.com
Protein-Ligand Docking • Protein: large pharmaceutical target • Ligand: small drug candidate Hongjian Li, Kwong-Sak Leung, TakanoriNakane and Man-Hon Wong. iview: an interactive WebGL visualizer for protein-ligand complex. BMC Bioinformatics, 15(1):56, 2014.
Docking Problem Definition • Given a protein and a ligand, predict • binding conformation (pose generation) • binding affinity (scoring) Pose 1 Pose 2 8.274 pKd 8.011 pKd
Redocking Success Rates • idock and Vina predicted up to 9 conformations • RMSDi : ith highest binding affinity, i= 1,2,…, 9 • RMSDmin : closest conformation, min(RMSDi) Hongjian Li, Kwong-Sak Leung, Pedro J. Ballester and Man-Hon Wong. istar: A Web Platform for Large-Scale Protein-Ligand Docking. PLoS ONE, 9(1):e85678, 2014.
Redocking Success Rates • PDBbind v2013 refined set (N = 2959) • N = 2959 – 66 (size_xyz < 3) – 2 (Sr,Cs) = 2893
Accuracy of Vina Scoring Function • SD=2.85 kcal/mol on training set (N = 1300) • SD=2.75 kcal/mol on test set (N=116) • 104x in disassociation constant Kd Oleg Trott and Arthur J. Olson. AutoDockVina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31 (2):455–461, 2010.
Scoring Functions • Force field based • AMBER • CHARMM • MedusaScore • VDWScore • Knowledge based • ITScore/SE • DrugScore • SMoG • STScore • Empirically based • GoldScore • ChemScore • ASP • X-Score • AutoDockVina • ID-Score • Cyscore Martiniano Bello, MarletMartínez-Archundia, and Jos Correa-Basurto. Automated docking for novel drug discovery. Expert Opinion on Drug Discovery, pages 1–14, 2013.
Assumed Functional Form • Vina
Assumed Functional Form • Predetermined, theory-inspired, additive
Machine-Learning Scoring Functions • No modelling assumptions, non-parametric • Implicitly capture binding effects that are hard to model explicitly • Random forest • RF-Score • Support vector regression • SVR-Score, MD-SVR Pedro J. Ballester and John B. O. Mitchell. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics, 26(9):1169–1175, 2010. Liwei Li, Bo Wang, and Samy O. Meroueh. Support Vector Regression Scoring of Receptor–Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries. Journal of Chemical Information and Modeling, 51(9):2132-2138, 2011. Wenhu Zhan, Daqiang Li, JinxinChe, Liangren Zhang, Bo Yang, Yongzhou Hu, Tao Liu, Xiaowu Dong. Integrating docking scores, interaction profiles and molecular descriptors to improve the accuracy of molecular docking: Toward the discovery of novel Akt1 inhibitors. European Journal of Medicinal Chemistry, 75:11-20, 2014.
The Question • Q1: Does circumventing an assumed functional form lead to better predictions?
Four Models • AutoDockVina • Undisclosed nonlinear optimization • MLR::Vina • Multiple linear regression with Vina features • RF::Vina • Random forest with Vina features • RF::VinaElem • Random forest with Vina and RF-Score features
Model 1: AutoDockVina • Vina’s score for the kthconformer: • k=1 for crystal pose only
Model 2: MLR::Vina • Quasi linear • Grid search on w6 • 101 values samples from 0 to 1 with a step size of 0.01 • 16 values sampled from 0.005 to 0.020 with a step size of 0.001
Model 3: RF::Vina • 6 Vina features • P=500 trees • mtry = 1 to 6 • Best mtry on Out-of-Bag (OOB) data
RF Construction & Prediction • RF trains binary trees using the CART algorithm • RF grows tree without pruning from a bootstrap sample of the training data • RF selects the best split at each node from a small number (mtry) of randomly chosen features. mtryϵ {1,…,36} • RF stops splitting a node with <=5 samples • Prediction from an individual tree is the arithmetic mean of its samples in a leaf node • P = 500 trees, RF prediction is arithmetic mean
RF Construction & Prediction • Out-of-bag (OOB) data as internal validation • Possible mtryvalues cover all the feature subset sizes, i.e. {1,…,36}
Model 4: RF::VinaElem • 6 Vina features + 36 RF-Score features • 4 and 9 atom types for protein and ligand, resp. • Occurrences for a particular j–iatom type pair • dcutoff=12Å,Z(C)=6,Z(N)=7,Z(O)=8,Z(P)=15,Z(S)=16
Visualization of Feature Vector • Protein-ligand atomic contacts as blue lines Pedro J. Ballester. Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression. Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, 7632:14–25, 2012.
PDBbind v2012 • A: experimentally determined structures as of 2012 • B: protein-ligand, DNA/RNA-ligand, protein-DNA/RNA, prot-prot complexes • C: Complexes with Kd, Ki, or IC50 • 7,121 protein-ligand complexes • D: Prot-lig complexes with Kd or Ki • E: 67 non-redundant clusters • BLAST cutoff of 90% sequence identity • With highest, lowest, medium affinity Renxiao Wang, Xueliang Fang, Yipin Lu, and ShaomengWang. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. Journal of Medicinal Chemistry, 47(12):2977–2980, 2004. Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-YieYang, and Shaomeng Wang. The PDBbind Database: Methodologies and Updates. Journal of Medicinal Chemistry, 48(12):4111–4119, 2005.
PDBbind v2007 Benchmark • Refined set (N = 1300) • Training set: 1105 complexes • Test set: 195 complexes • Measured binding affinities span >12 orders of magnitude • 65 non-redundant clusters
PDBbind v2013 Blind Benchmark • Test set: refined13\refined12 (N = 382) • 4 training sets • refined02 (N = 792) • refined07 (N = 1300) • refined10 (N = 2059) • refined12 (N = 2897) • | Train ∩ Test | = 0
Performance Measures • Root Mean Square Error RMSE • Standard deviation in linear correlation SD • Pearson correlation Rp • Spearman rank-correlation Rs
Results and Discussion • Vina vs. MLR::Vina vs RF::Vina vs RF::VinaElem • Overfitting • MLSFs are remarkably more accurate than Empirical SFs • The impact of data representation on the applicability domain of SFs • Machine-learning SFs assimilate data better than Empirical SFs • Achieved improvement over AutoDockVina using Random Forest • MLSFs are a largely unexplored research area
MLR::Vina Better than Vina • w6=0.010 for MLR::Vina
RF::Vina Better than MLR::Vina • w6=0.010 for MLR::Vina • seed=99789 for RF::Vina
RF::VinaElem Better than RF::Vina • seed=99789 for RF::Vina • seed=72551 for RF::VinaElem
Ligand- and Protein-Only Features • With and without Nrot • QSAR models
Overfitting • Machine-learning SFs tend to overfit training data • Challenged by Cao and Li • MLR::Vina • SD=1.71 on training set (N = 1105) • SD=1.91 on test set (N = 195) • RF::Vina • SD=0.60 on training set (N = 1105) • SD=1.62 on test set (N = 195) Yang Cao and Lei Li. Improved protein–ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics, 2014.
MLSFs More Accurate • AutoDockVina with and without coupling a linear model
MLSFs More Accurate • RF::VinaElem with and without coupling a linear model
MLSFs More Accurate • 22 SFs on the PDBbindv2007 benchmark
Applicability Domain • Same training set • Same test set • Same features • … implies • Same applicability domain • For predictive purposes
Rescoring Vina • Available as rf-score-3.tgz for Linux & Windows • N = 1300 vs N = 2897
Largely Unexplored Research Area • Protein family-specific SFs • Pose generation error • Data selection criteria • Waters and ions • Fingerprint descriptors
Conclusion • Q1: Does circumventing an assumed functional form lead to better predictions? • A1: Yes.
Ongoing Directions • Q2: Under what conditions? • Q3: How to improve predictions of docked poses?