Improving Protein-Ligand Binding Affinity Prediction using Random Forest

Improving Protein-Ligand Binding Affinity Prediction using Random Forest Hongjian Li Department of Computer Science and Engineering Chinese University of Hong Kong 1 April 2014http://istar.cse.cuhk.edu.hk JackyLeeHongJian@Gmail.com

Protein-Ligand Docking • Protein: large pharmaceutical target • Ligand: small drug candidate Hongjian Li, Kwong-Sak Leung, TakanoriNakane and Man-Hon Wong. iview: an interactive WebGL visualizer for protein-ligand complex. BMC Bioinformatics, 15(1):56, 2014.

Docking Problem Definition • Given a protein and a ligand, predict • binding conformation (pose generation) • binding affinity (scoring) Pose 1 Pose 2 8.274 pKd 8.011 pKd

Redocking Success Rates • idock and Vina predicted up to 9 conformations • RMSDi : ith highest binding affinity, i= 1,2,…, 9 • RMSDmin : closest conformation, min(RMSDi) Hongjian Li, Kwong-Sak Leung, Pedro J. Ballester and Man-Hon Wong. istar: A Web Platform for Large-Scale Protein-Ligand Docking. PLoS ONE, 9(1):e85678, 2014.

Redocking Success Rates • PDBbind v2013 refined set (N = 2959) • N = 2959 – 66 (size_xyz < 3) – 2 (Sr,Cs) = 2893

Accuracy of Vina Scoring Function • SD=2.85 kcal/mol on training set (N = 1300) • SD=2.75 kcal/mol on test set (N=116) • 104x in disassociation constant Kd Oleg Trott and Arthur J. Olson. AutoDockVina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31 (2):455–461, 2010.

Scoring Functions • Force field based • AMBER • CHARMM • MedusaScore • VDWScore • Knowledge based • ITScore/SE • DrugScore • SMoG • STScore • Empirically based • GoldScore • ChemScore • ASP • X-Score • AutoDockVina • ID-Score • Cyscore Martiniano Bello, MarletMartínez-Archundia, and Jos Correa-Basurto. Automated docking for novel drug discovery. Expert Opinion on Drug Discovery, pages 1–14, 2013.

Assumed Functional Form • Vina

Assumed Functional Form • Predetermined, theory-inspired, additive

Machine-Learning Scoring Functions • No modelling assumptions, non-parametric • Implicitly capture binding effects that are hard to model explicitly • Random forest • RF-Score • Support vector regression • SVR-Score, MD-SVR Pedro J. Ballester and John B. O. Mitchell. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics, 26(9):1169–1175, 2010. Liwei Li, Bo Wang, and Samy O. Meroueh. Support Vector Regression Scoring of Receptor–Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries. Journal of Chemical Information and Modeling, 51(9):2132-2138, 2011. Wenhu Zhan, Daqiang Li, JinxinChe, Liangren Zhang, Bo Yang, Yongzhou Hu, Tao Liu, Xiaowu Dong. Integrating docking scores, interaction profiles and molecular descriptors to improve the accuracy of molecular docking: Toward the discovery of novel Akt1 inhibitors. European Journal of Medicinal Chemistry, 75:11-20, 2014.

The Question • Q1: Does circumventing an assumed functional form lead to better predictions?

Four Models • AutoDockVina • Undisclosed nonlinear optimization • MLR::Vina • Multiple linear regression with Vina features • RF::Vina • Random forest with Vina features • RF::VinaElem • Random forest with Vina and RF-Score features

Model 1: AutoDockVina • Vina’s score for the kthconformer: • k=1 for crystal pose only

Model 2: MLR::Vina • Quasi linear • Grid search on w6 • 101 values samples from 0 to 1 with a step size of 0.01 • 16 values sampled from 0.005 to 0.020 with a step size of 0.001

Model 3: RF::Vina • 6 Vina features • P=500 trees • mtry = 1 to 6 • Best mtry on Out-of-Bag (OOB) data

RF Construction & Prediction • RF trains binary trees using the CART algorithm • RF grows tree without pruning from a bootstrap sample of the training data • RF selects the best split at each node from a small number (mtry) of randomly chosen features. mtryϵ {1,…,36} • RF stops splitting a node with <=5 samples • Prediction from an individual tree is the arithmetic mean of its samples in a leaf node • P = 500 trees, RF prediction is arithmetic mean

RF Construction & Prediction • Out-of-bag (OOB) data as internal validation • Possible mtryvalues cover all the feature subset sizes, i.e. {1,…,36}

Model 4: RF::VinaElem • 6 Vina features + 36 RF-Score features • 4 and 9 atom types for protein and ligand, resp. • Occurrences for a particular j–iatom type pair • dcutoff=12Å,Z(C)=6,Z(N)=7,Z(O)=8,Z(P)=15,Z(S)=16

Visualization of Feature Vector • Protein-ligand atomic contacts as blue lines Pedro J. Ballester. Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression. Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science, 7632:14–25, 2012.

PDBbind v2012 • A: experimentally determined structures as of 2012 • B: protein-ligand, DNA/RNA-ligand, protein-DNA/RNA, prot-prot complexes • C: Complexes with Kd, Ki, or IC50 • 7,121 protein-ligand complexes • D: Prot-lig complexes with Kd or Ki • E: 67 non-redundant clusters • BLAST cutoff of 90% sequence identity • With highest, lowest, medium affinity Renxiao Wang, Xueliang Fang, Yipin Lu, and ShaomengWang. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. Journal of Medicinal Chemistry, 47(12):2977–2980, 2004. Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-YieYang, and Shaomeng Wang. The PDBbind Database: Methodologies and Updates. Journal of Medicinal Chemistry, 48(12):4111–4119, 2005.

PDBbind v2007 Benchmark • Refined set (N = 1300) • Training set: 1105 complexes • Test set: 195 complexes • Measured binding affinities span >12 orders of magnitude • 65 non-redundant clusters

PDBbind v2013 Blind Benchmark • Test set: refined13\refined12 (N = 382) • 4 training sets • refined02 (N = 792) • refined07 (N = 1300) • refined10 (N = 2059) • refined12 (N = 2897) • | Train ∩ Test | = 0

Performance Measures • Root Mean Square Error RMSE • Standard deviation in linear correlation SD • Pearson correlation Rp • Spearman rank-correlation Rs

Results and Discussion • Vina vs. MLR::Vina vs RF::Vina vs RF::VinaElem • Overfitting • MLSFs are remarkably more accurate than Empirical SFs • The impact of data representation on the applicability domain of SFs • Machine-learning SFs assimilate data better than Empirical SFs • Achieved improvement over AutoDockVina using Random Forest • MLSFs are a largely unexplored research area

MLR::Vina Better than Vina • w6=0.010 for MLR::Vina

RF::Vina Better than MLR::Vina • w6=0.010 for MLR::Vina • seed=99789 for RF::Vina

RF::VinaElem Better than RF::Vina • seed=99789 for RF::Vina • seed=72551 for RF::VinaElem

Ligand- and Protein-Only Features • With and without Nrot • QSAR models

Overfitting • Machine-learning SFs tend to overfit training data • Challenged by Cao and Li • MLR::Vina • SD=1.71 on training set (N = 1105) • SD=1.91 on test set (N = 195) • RF::Vina • SD=0.60 on training set (N = 1105) • SD=1.62 on test set (N = 195) Yang Cao and Lei Li. Improved protein–ligand binding affinity prediction by using a curvature-dependent surface-area model. Bioinformatics, 2014.

MLSFs More Accurate • AutoDockVina with and without coupling a linear model

MLSFs More Accurate • RF::VinaElem with and without coupling a linear model

MLSFs More Accurate • 22 SFs on the PDBbindv2007 benchmark

Applicability Domain • Same training set • Same test set • Same features • … implies • Same applicability domain • For predictive purposes

MLSFs Assimilate Data Better

Rescoring Vina • Available as rf-score-3.tgz for Linux & Windows • N = 1300 vs N = 2897

Largely Unexplored Research Area • Protein family-specific SFs • Pose generation error • Data selection criteria • Waters and ions • Fingerprint descriptors

Conclusion • Q1: Does circumventing an assumed functional form lead to better predictions? • A1: Yes.

Ongoing Directions • Q2: Under what conditions? • Q3: How to improve predictions of docked poses?

Improving Protein-Ligand Binding Affinity Prediction using Random Forest

Improving Protein-Ligand Binding Affinity Prediction using Random Forest

Presentation Transcript

Random Forest

Random Forest

PROTEIN BINDING

Ligand configurational entropy and protein binding

Protein-Ligand Docking

Using DOCK to characterize protein ligand interactions

Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach

Computational Design of Ligand-binding Proteins with High Affinity and Selectivity

Protein-ligand geometry

Predicting ligand binding sites on protein surface

Improving Web Search Results Using Affinity Graph

An Atomic Four-Body Potential for the Prediction of Protein-Ligand Binding Affinity

Improving Web Search Results Using Affinity Graph

Ligand Binding Site Prediction for HIV-1 Protease using Shape Comparison Techniques

Chapter 5.1: Protein Function - Reversible Binding of Protein to a Ligand

Domain-Based Protein-Protein Interaction Prediction Using Random Decision Forest Framework

Ligand-binding site prediction based on 3D protein modeling

Protein-protein and Protein-ligand Docking

Q- SiteFinder : an energy-based method for the prediction of protein- ligand binding sites

Weather Prediction Model using Random Forest Algorithm and Apache Spark

Protein-Ligand Docking