Realizing Prospective QSAR through Data Fusion and Modern Descriptors

Realizing Prospective QSAR through Data Fusion and Modern Descriptors Curt Breneman*, N. Sukumar, Mark Embrechts, Jack Huang, Kristin Bennett and C. Matt Sundling August 20, 2007

Prospective QSAR: What are we missing?

Alignment-free Molecular Property Descriptors Multi-Latent Analysis Modeling Tools pH 4 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 pH 5 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Mol 15 Mol 16 Mol 17 pH 6 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Mol 15 Mol 16 Mol 17 pH 7 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Mol 15 pH 8 Mol 1 Mol 2 Mol 3 Mol 4 Mol 5 Mol 6 Mol 7 Mol 8 Mol 9 Mol 10 Mol 11 Mol 12 Mol 13 Mol 14 Multi-Objective Learning Non-linear Model Building and Validation Methods A Noble Goal: Mapping Chemistry to Biology

Classic statements that led to the “Rule of One”… • “Who wants to hear actors talk?” – H.M. Warner, 1927 • “Forget it – no civil war picture ever made a nickel” – MGM executive, in 1937, advising against production of “Gone with the Wind” • “I think there might be a market for maybe five computers” – Thomas Watson, IBM, 1943 • “Computers in the future may weigh no more than 1.5 tons” – Popular Mechanics, 1949 • “There is no reason anyone would want a computer in their home” – Ken Olsen, founder of Digital Equipment Corporation, 1977

Model Parsimony “Rules” • “Simple models are better” • “Interpretable models are better” • Reality: need to balance predictive ability and interpretability

QSAR/QSPR Observations • Even though we have thousands of descriptors at our disposal, we are still missing something… • Different descriptor types may not play well together, but each type may carry valuable information in different forms • Special methods are needed to collect and combine intrinsic information • Some descriptor types vary linearly with modeling endpoints • (many/most do not! This forces a choice…) • Non-linear modeling, classification, pattern recognition, or…? • When can particular models be successfully applied? • Reality: Model Applicability Domain knowledge and Data Fusion are key to future success

The Data Mining Process WISDOM UNDERSTANDING KNOWLEDGE INFORMATION DATA

Data Mining Domains Database Marketing Problem understanding phase Finance Data understanding phase Deployment phase Medicine: Signal Processing Diagnostics Manufacturing QSAR Data preparation phase Evaluation phase Homeland Security Detection of contraband Intrusion Detection Fraud Detection Modeling phase Text mining Bioinformatics

Knowledge Discovery and Data Fusion Domain expert molecular understanding FUSED DATA Database #1 Database #n

Data Fusion: A Combination of Strengths

Data Fusion: Definition • Data fusion is the process of combining data to refine state estimates and predictions • (Alan N. Steinberg, Christopher L. Bowman, Franklin E. White, 1999) • Data fusion refers to the process of combining data from disparate databases/sources • such that the resulting information is in some sense “better” than would be possible • when these databases/sources were used individually. • - Synergism: the whole is more than the sum of the parts • - The term “better” can mean more accurate, more complete or more dependable • can also mean more understandable, transparent or comprehensive • - Data sources for a fusion process do not have to originate from the same/similar sources • - Ability to deal with conflicting data • - Key issue is being able to produce interim results that the algorithm can revise • as more data becomes available

Different Definitions: European and American Views • European definition: emphasizes different databases • American definition: emphasizes different fusion levels • Both emphasize that fusion should exceed sum of parts • Recent trends: • - Data fusion taxonomies that fit all possible application domains • - Auto-fusion • - Mixing of fusion levels • - New paradigms for level-3 fusion (e.g., meta-learning)

JDL Data Fusion Model: Background • Data Fusion Model developed by the Joint Directors of Laboratories (JDL) data group • (a US DoD government committee overseeing US defense technology, 1987) • Purpose: • - categorize different types of fusion processes • - provide a common frame of reference for fusion discussions • - facilitate understanding of problems types where data fusion is applicable • - codify the commonality among problems • - aid in the extension of previous solutions • - provide a framework for investment in automation

Data Fusion Domain Source Pre-Processing Level One Object Refinement Level Two Situation Refinement Level Three Threat Refinement Sources Database Management System Support Database Fusion Database Level Four Process Refinement Original JDL Data Fusion Model (1992)

Auto-Fusion model for QSAR (2006)

Auto-Fusion Levels for QSAR (2007) • Level I: Database (Descriptor) Level • Fuse different descriptors in a flat file • Descriptor selection with filters • Descriptor selection with wrappers • Level II: Feature Level • Generate new features and integrate features • PCA (principal component analysis) features • PLS (projection on latent variables) features • ICA (independent component analysis) features • Kohonen, SVM … features • Level III: Model Fusion • Consensus models (e.g., LOO and BOO) • LOO (leave-one-out models) • BOO (bootstrap models) • Random forest models •  Voting schemes • ROC boosting schemes (classification only) • Meta-Learning models

Albumin Binding: Example Dataset • Binding affinities to human serum albumin (originally 95 molecules) • 95  93 molecules, activity from consistent experiments (2 molecules were dropped) • 83 training data, 10 test data (same as in original paper) • Fusion of MOE and Pest descriptors

Descriptor sets used in sample analysis • MOE (i3D, 2D) • PEST / TAE • - shape descriptors • - histogram descriptors • - wavelet descriptors - statistical descriptors

Descriptor Types and Information Content Electronic wavefunction or simulation-based ‘3D descriptors’ (e.g. shape/property hybrids) INFORMATION CONTENT OBFUSCATION COMPLEXITY COMPUTATION TIME ‘2D descriptors’ (e.g. connectivity information) Molecular formulae / simple descriptive information • Hierarchy of descriptors (data content) Molecular Structures Descriptors Model Activity

Surface Property Distribution Histograms (RECON/TAE) Descriptors Molecular surface property distributions can be represented as RECON/TAE histogram bin descriptors

Wavelet Representations of Molecular Surface Properties

Molecular Surface Encoding: Wavelet Coefficients

PEST: Molecular Shape/Property Hybrid Encoding • A TAE property-encoded surface is subjected to internal ray reflection analysis. • A ray is reflected throughout inside the electron density isosurface until the molecular surface is adequately sampled. • Molecular shape information is obtained by recording the ray-path segment lengths, reflection angles and property values at each point of incidence. • Adds shape information that encode the spatial relationships of surface properties • Alignment-free • Curt M. Breneman, C. Matthew Sundling, N. Sukumar, Lingling Shen, William P. Katt and Mark J. Embrechts, “New developments in PEST shape/property hybrid descriptors” J. Computer-Aided Mol. Design, 17, 231–240, (2003) • Karthigeyan Nagarajan, Randy Zauhar, and William J. Welsh, “Enrichment of Ligands for the Serotonin Receptor Using the Shape Signatures Approach” J. Chem. Inf. Model., 45, 49-57 (2005)

Models • Models • - PLS (leave-one out aggregated bagging model) • - K-PLS (leave-one out aggregated bagging model) • Procedure • - Eliminate non-changing features • - Eliminate descriptors 4-5 sigma outliers • - Eliminate cousin descriptors • Metrics • - r2_Loo and R2_LOO provide anticipated performance on test data • - q2_BAG and Q2_BAG give actual performance on (stratified) test data (same as in paper)

Characteristics of High-Quality Predictive Models • All descriptors used in the model are significant • None of the descriptors account for single peculiarities • No leverage or outlier compounds in the training set (Gisbert, 2006.) • Cross-validation performance should show: • Significantly better performance than that of randomized tests • Training set and external test set homogeneity

AutoFusion Levels for Albumin Data • Level 1: Data level fusion - The combination of several sources of raw data generated by different descriptor generation algorithms, such as PEST and MOE, to produce new (raw) data. • There is empirical evidence that merely concatenating multidimensional descriptors often only marginally improves the model performance. Data alignment is necessary before implementing predictive modeling for multi-source data. • Level 2: Feature level fusion - Each descriptor generator algorithm provides calculated descriptors from which a feature vector is extracted right after the data alignment. • Feature level fusion combines various feature vectors from different descriptors into a new single data set followed by a common model selection, modeling and prediction procedure. The feature extraction can be based on PCA, PLS and/or independent component analysis (ICA). • Level 3: Hybrid data-decision level fusion - this architecture follows a hybrid scheme, combining data and decision level fusion. • The procedure from beginning till model selection is identical to the data level fusion. Predictive models are based on bagged models.

Albumin Data: Level 1 Data Fusion • Data level fusion of descriptors marginally outperforms MOE descriptors alone • Linear models tend to be more robust with better performance

MOE Descriptors: Albumin PLS & KPLS Level 1 K-PLS (5 LVs, s = 7) PLS (5 LVs)

Level 1: Bootstrapped Y-scrambling evaluation

Albumin Data: Level 2 – Feature Level Fusion • Based on fusing Principal Components (PCAs or ICAs) • Level-2 data fusion gives improved results • Best fused models tend to be nonlinear K-PLS models • Level 2 fusion procedure requires tuning

Albumin Data: Level 3 – Decision Level Fusion • All our bagged predictions are a form of level-3 decision level fusion • Bagging predictions brings a slight improvement in Q2, bust vastly reduces the variance • Other level-3 fusion schemes are possible: • - Consensus modeling: based on results from many vastly different models • - Random Forest (RF) models: a form of consensus modeling • - (RF) ROC boosting (Evangelista, Embrechts & Szymanski): for classification only • - MetaLearning: new proposed concept (see next slide)

Level 3: Decision Level Fusion via Meta Learning • Conceptual Idea: • Consider many different models (K-PLS, SVM, PLS, neural networks, K-NN) • - models could be level-2 data fusion models • Use predictions from hundreds of different models as new Meta Descriptors • Build meta Model for Level 3 Data Fusion Predictions • Refinements: • Let tuning parameters vary freely (e.g., 2 – 10 latent variables) • Take predictions for all parameter settings as a new Meta Fingerprint Descriptors • Discussion • Elegant way to perform consensus modeling • Tuning is not necessary  robust models • Practical implementation requires extensive software development/integration • Next Steps • New descriptor types that add complementary information to the fusion model, such as…

First hydration shell descriptors of CspB protein Hydration-based descriptors developed and implemented by Shekhar Garde, Rahul Godawat and Ishita Manjrekar at RPI/RECCR Protein amino acids (green = hydrophobic, blue = positively charged, red = negatively charged) local electron density projected onto the triangulated protein surface local water-O density local water-H density

Water O fluctuation Structure Electron density New Representations: Simulation-based hydration descriptors Statistical analysis of the dynamics of water distributions solvating proteins used to create a set of regional property descriptors: • average local water density, • water density fluctuations, • local water orientations, • electron density profile due to water packing and orientations (polarization), • electrostatic potential on protein surface induced by the vicinal water structuring, • dynamics of local water.

Hydration descriptors through PMF expansion • Developing an efficient alternative to full simulations by means of a potentials-of-mean-force expansion • employing a library of lower-order correlation functions derived from explicit simulations to predict the average equilibrium density and the orientation profile of water in the space surrounding biomolecules or ligands. Water density values in space surrounding an alpha-helix (left) and a protein X (right) predicted using the PMF expansion (cyan) and obtained from exact simulation (magenta)

Summary • Data Fusion in Prospective Modeling: • Enables multiple relationships within data to be quantified • Level 2 Fusion outperforms combined descriptor fields • Current Methods: • Some already employ specific aspects of Data Fusion • More can be done to extract maximum information from datasets • Descriptor deficits require more attention • The Way Forward • Additional descriptor types that incorporate entropy terms and probe dynamics • Robust modeling through Level 3 Fusion and Meta-Learning • Rigorous model (and method) validation protocols

ACKNOWLEDGMENTS • Current and Former members of the DDASSL group • Breneman Research Group (RPI Chemistry) • N. Sukumar • M. Sundling • Min Li • Long Han • Jed Zaretski • Theresa Hepburn • Mike Krein • Steve Mulick • Shiina Akasaka • Hongmei Zhang • C. Whitehead (Pfizer Global Research) • L. Shen (BNPI) • L. Lockwood (Syracuse Research Corporation) • M. Song (Synta Pharmaceuticals) • D. Zhuang (Simulations Plus) • W. Katt (Yale University chemistry graduate program) • Q. Luo (J & J) • Embrechts Research Group (RPI DSES) • Tropsha Research Group (UNC Chapel Hill) • Bennett Research Group (RPI Mathematics) • Collaborators: • Tropsha Group (UNC Chapel Hill - CECCR) • Cramer Research Group (RPI Chemical Engineering) • Funding • NIH (GM047372-07) • NIH (1P20HG003899-01) • NSF (BES-0214183, BES-0079436, IIS-9979860) • GE Corporate R&D Center • Millennium Pharmaceuticals • Concurrent Pharmaceuticals • Pfizer Pharmaceuticals • ICAGEN Pharmaceuticals • Eastman Kodak Company • Chemical Computing Group (CCG)

Reserve Slides

(Con)Fusion of Terminology: Data Mining  Data Fusion Data Fusion Data Mining Reference: Alan N. Steinberg, Christopher L. Bowman, Franklin E. White, “Revisions to the JDL Data Fusion Model,” Proc. of the SPIE Sensor Fusion: Architectures, Algorithms, and Applications III, pp 430-441, 1999.

Realizing Prospective QSAR through Data Fusion and Modern Descriptors