Toxicological Relationships Between Proteins Obtained From a Molecular Spam Filter

Toxicological Relationships Between Proteins Obtained Froma Molecular Spam Filter Florian Nigsch & John Mitchell F. Nigsch, et al., J. Chem. Inf. Model.,48, 306-318 (2008) F. Nigsch, et al., Toxicology and Applied Pharmacology, 231, 225-234 (2008) F. Nigsch, et al., J. Chem. Inf. Model.,48, 2313-2325 (2008)

Toxicological Relationships Between Proteins Obtained Froma Molecular Spam Filter Florian Nigsch & John Mitchell Now at Novartis Institutes, Boston

Toxicological Relationships Between Proteins Obtained Froma Molecular Spam Filter Florian Nigsch & John Mitchell Soon moving to University of St Andrews

Spam • Unsolicited (commercial) email • Approx. 90% of all email traffic is spam • Where are the legitimate messages? • Filtering

Analogy to Drug Discovery • Huge number of possible candidates • Virtual screening to help in selection process

High affinity to protein target Soluble Permeable Absorbable High bioavailability Specific rate of metabolism Renal/hepatic clearance? Volume of distribution? Low toxicity Plasma protein binding? Blood-Brain-Barrier penetration? Dosage (once/twice daily?) Synthetic accessibility Formulation (important in development) Properties of Drugs

Multiobjective Optimisation Synthetic accessibility Bioactivity Solubility Toxicity Permeability Metabolism Huge number of candidates …

Multiobjective Optimisation Synthetic accessibility Bioactivity Drug Solubility Toxicity U S E L E S S Permeability Metabolism Huge number of candidates … most of which are useless!

Winnow Algorithm • Invented in late 1980s by Nick Littlestone to learn Boolean functions • Name from the verb “to winnow” • High-dimensional input data • Natural Language Processing (NLP), text classification, bioinformatics • Different varieties (regularised, Sparse Network Of Winnow - SNOW, …) • Error-driven, linear threshold, online algorithm

Feature Space - Chemical Space m = (f1,f2,…,fn) f3 f3 f2 COX2 CDK2 f1 Feature spaces of high dimensionality CDK1 f2 DHFR f1

Combinations of Features Combinations of molecular features to account for synergies.

Features of Molecules Based on circular fingerprints

Training Example

Workflow For predicting protein targets

Protein Target Prediction • Which protein does a given molecule bind to? • Virtual Screening • Multiple endpoint drugs - polypharmacology • New targets for existing drugs • Prediction of adverse drug reactions (ADR) • Computational toxicology

Predicted Protein Targets • Selection of 233 classes from the MDL Drug Data Report • ~90,000 molecules • 15 independent 50%/50% splits into training/test set

Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

Model for target prediction Annotated library of toxic molecules MDL Toxicity database ~150,000 molecules Standardisation MySQL database For each molecule we predict the likely target Correlations between predicted protein targets and known toxicity codes Canonical (23) Full (490) Computational Toxicology

Toxicological Relationships Outline (1) • Protein target prediction allows us to link (predictively) 150,000 toxic organic molecules to 233 specific protein targets • Each target is treated as a single protein, although may be sets of related proteins) • Toxicological databases link (experimentally) these 150,000 molecules to 23 toxicity classes • Combining these two sources of data matches the 233 proteins with the 23 toxicity classes

Toxicological Relationships Outline (1) • Protein target prediction allows us to link (predictively) 150,000 toxic organic molecules to 233 specific protein targets • Each target is treated as a single protein, although may be sets of related proteins • Toxicological databases link (experimentally) these 150,000 molecules to 23 toxicity classes • Combining these two sources of data matches the 233 proteins with the 23 toxicity classes

Toxicological Relationships Outline (2) • For each protein target, we have a profile of association with the 23 toxicity classes • Proteins with similar profiles are clustered together • We demonstrate that these clusters of proteins can be physiologically meaningful.

Predictions Obtained Highest ranking class IS predicted protein target Protein code j Target Prediction L70 - Changes in liver weight<Liver Y07 - Hepatic microsomal oxidase<Enzyme inhibition M30 - Other changes<Kidney, Urether, and Bladder L30 - Other changes<Liver Toxicity codesi Result matrix R = (rij) rij incremented for each prediction. Protein targets Toxcodes ( ) … r11 r12 r21

Toxicity Annotations FULL TOXICITY CODES (490) Y41 : Glycolytic < Metabolism (intermediary) < Biochemical CANONICAL TOXICITY CODES (23)

Cardiac - G Kainic acid receptor Adrenergic alpha2 Phosphodiesterase III cAMP Phosphodiesterase O6-Alkylguanine-DNA alkyltransferase Vascular - H Angiotensin II AT2 Dopamine (D2) Bombesin Adrenergic alpha2 5-HT antagonist Proteins by Toxicity

Top 5 Proteins by Toxicity 68 distinct proteins for 23 toxicity classes, i.e., 3.0 proteins per canonical toxicity code. Lanosterol 14alpha-Methyl Demethylase 5 Glucose-6-phosphate Translocase 4 IL-6 4 Benzodiazepine Antagonist 3 Kainic Acid Receptor 3 Proteins and their connectivities

Clustering of Toxicity Classes Clustering of toxicity classes: based on predicted protein associations from the result matrix

Correlation Between Toxicity Classes Correlations between toxicity classes: 23 by 23 correlation matrix

Correlation Between Proteins Correlations between proteins:233 by 233 correlation matrix

Correlation Between Proteins Correlations between proteins: 233 by 233 correlation matrix Cluster 1 (proteins 6-11)

We will look at two specific clusters, which are called Cluster 1 and Cluster 4.

Carbonic Anhydrase Inhibitor Estrogen Receptor Modulator LHRH Agonist Aromatase Inhibitor Cysteine Protease Inhibitor DHFR Inhibitor Cluster 1 • Cluster 1 (proteins 6-11) • Within-cluster correlation (without auto-correlation) r = 0.95

Carbonic Anhydrase Inhibitor Estrogen Receptor Modulator LHRH Agonist Aromatase Inhibitor Cysteine Protease Inhibitor DHFR Inhibitor Cluster 1 Cluster 1 • Within-cluster correlation (without auto-correlation) r = 0.95 Proteins involved in breast cancer

Cluster 1 Proteins involved in breast cancer

Toxicological Relationships Between Proteins Obtained From a Molecular Spam Filter