An Introduction to Open Small-molecule Resources of High Utility for Systems Biologists

An Introduction to Open Small-molecule Resources of High Utility for Systems Biologists Tutorial for the International Conference on Systems Biology Göteborg, August 2008 Christopher Southan, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK

Context • Medicinal chemistry has a long history of providing a bridge between biology and chemistry by identifying compounds that produce biological effects • It is increasingly recognised that bioactive compounds are an essential part of the perturbation toolbox for systems biology • Advancing biological knowledge vial a broad spectrum of small molecule investigations can lead to improved understanding not only of systems biology but also disease mechanisms and new opportunities for therapeutic intervention

Systems Chemical Biology Oprea et al. Nat Chem Biol. 2007 (8):447-50 PMID: 17637771 “The increasing availability of data related to genes, proteins and their modulation by small molecules has provided a vast amount of biological information leading to the emergence of systems biology and the broad use of simulation tools for data analysis. However, there is a critical need to develop cheminformatics tools that can integrate chemical knowledge with these biological databases and simulation approaches, with the goal of creating systems chemical biology.”

Chemical Biology goes back a long way ….

So does Bioactive Compound Structure Representation…..

But .... Times Have Changed for Chemical Information

Strophanthidin: from 1952 to 2008: Now just a click to Hinxton…

Or Bethesda….

The times have also changed for Chemical Biology

And the Union of Chemistry and Biology

November 2004: The Seeds of Revolution

PubChem and ChEBI: Revolutionary Consequences • Arrival of the ”missing entity” of formal and linked chemical structure representation within the global web of bioinformatic relationships

PubChem and ChEBI: Revolutionary Consequences • Arrival of the ”missing entity” of formal and linked chemical structure representation within the global web of bioinformatic relationships • Ability to search across links between biochemical data, biological effects and chemical structure information

PubChem and ChEBI: Revolutionary Consequences • Arrival of the ”missing entity” of formal and linked chemical structure representation within the global web of bioinformatic relationships • Ability to search across links between biochemical data, biological effects and chemical structure information • Deposition not just of HTS results but a wide range of other types of screening data directly linked to chemical structure information in public repositories

PubChem and ChEBI: Revolutionary Consequences • Arrival of the ”missing entity” of formal and linked chemical structure representation within the global web of bioinformatic relationships • Ability to search across links between biochemical data, biological effects and chemical structure information • Deposition not just of HTS results but a wide range of other types of screening data directly linked to chemical structure information in public repositories • Proliferation of cheminformatics tools, databases, nomenclatures, and ontologies in the public domain

PubChem and ChEBI: Revolutionary Consequences • Arrival of the ”missing entity” of formal and linked chemical structure representation within the global web of bioinformatic relationships • Ability to search across links between biochemical data, biological effects and chemical structure information • Deposition not just of HTS results but a wide range of other types of screening data directly linked to chemical structure information in public repositories • Proliferation of cheminformatics tools, databases, nomenclatures, and ontologies in the public domain • A quantum jump in the global enablement of chemical biology and medicinal chemistry

Post-Revolution How Many Compounds are Out There ? • Chemical Structure Lookup Service – 36 million, 100 sources • ChemSpider 21.5 million 150 sources • PubChem - 19,296,269 70 sources • SureChem 9 million from US, European and WO patents, But how many are verified as bioactive ?

Relationships in Bioactive Chemical Space metabolomes & natural products drugs chem genomics & sys biol probes assay data drug-like cpds from literature & patents Protein Sequences

SD/MOL files IUPAC standard name Sketched Image SMILES InChI codes InChI strings Experimental 3D structure Code names (CID 121880) Generic, trade and MeSH names CAS numbers Database acession numbers e.g. PubChem CID, SID, ChEBI ID, ChemSpider ID Searchable Chemical Structure Designations and Representations in Databases All can be exact-match searched, some allow simillarity searching, some also inter-convert

SD/MOLfile The basic MDL chemical table files of atoms, bonds, connectivity and 3D coordinates benzene ACD/Labs0812062058 6 6 0 0 0 0 0 0 0 0 1 V2000 1.9050 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.9050 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7531 -0.1282 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7531 -2.7882 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.3987 -0.7932 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.3987 -2.1232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 3 1 2 0 0 0 0 4 2 2 0 0 0 0 5 3 1 0 0 0 0 6 4 1 0 0 0 0 6 5 2 0 0 0 0

Experimental 3D Structures Cn3D view of PDB 1I7G on the left PubChem tesaglitazar=CID 208901 on the right

Interconverts with 2D sketchers Can then be searched Human readable SMILES -simplified molecular input line entry notation for encoding molecular structures

Structure Sketchers/Converters

IUPAC Systematic Naming of Organic Chemical Compounds • International Union of Pure and Applied Chemistry (IUPAC) • Should human readable and allow an unambiguous structural formula to be drawn • Usable for automated text-to-structure conversion • Taxol (2aR,4S,4aS,6R,9S,11S,12S,12aR,12bS)-1,2a,3,4,4a,6,9,10,11,12,12a,12b-Dodecahydro- 4,6,9,11,12,12b-hexahydroxy-4a,8,13,13-tetramethyl-7,11-methano-5H-cyclodeca(3,4)benz(1,2b)oxet-5-one 6,12b-diacetate, 12-benzoate, 9-ester with (2R,3S)-N-benzoyl-3-phenylisoserine

IUPAC International Chemical Identifier (InChI) Textual Identifier for Chemical Substances • A formalized string conversion of IUPAC names but not human readable • Express more information than the simpler SMILES notation and differ in that every structure has a unique InChI string • InChI algorithm converts structural information in a three-step process: normalization (to remove redundant information), canonicalization (to generate a unique number label for each atom), and serialization (to give a string of characters) but without explicit 3D information • The 25 character InChIKey is a hashed version of the full InChI designed to allow for easy web searches of chemical compounds (e,g, Google)

CAS Registry Number • Unique numeric identifier Contains up to 10 digits, divided by hyphens into three parts, e.g. 58-08-2 for caffeine (Google it) • Has no chemical significance • Widely used but not open-access because the source chemical information links to the CAS commercial databases e.g. SciFinder • Consequently the consistency of mappings to open identifiers cannot be verified

PubChem Identifiers: CIDs and SIDs • PubChem is the NCBI informatics backbone for the NIH Molecular Libraries Initiative • A suite of three databases, PubChem Compound unique structures with computed properties ) PubChem BioAssay ( results supplied by depositors) and PubChem Substance( deposited compound structures) • The ten MLI-funded screening centers are run cellular and target-based HTS’s using a compound collection of ~ 250 K and submitting the results to PubChem

PubChem is now a Global Hub Including bioinformatic dbs with in-links MMDB, PDB ligands 55K ChEBi, enzyme ligands 8K P u b C h e m ZINC, ready-to-dock 3.8 mill KEGG, drugs and metabolites 14K ChemBank, chemical genomics 0.4 mill Human Metabolite db 2K ChemIDplus, NIH tox data 383K MEROPS protease inhibitors ChemSpider 20 million DrugBank, drugs and targets 4K Drugs of the Future 3.4K GPCR-Ligand Database Nature Chemical Biology 0.8 K LIPID MAPS, metabolism 8.8K

Searchable Measures of Chemical Similarity • 1D: measured or computed molecular properties, e.g., molecular weight, number of rings, molecular surface area or volume, pKa, logP etc • 3D: map a molecular surface, chemical graphs, spectral descriptors, distribution of electrostatic charge around a molecule • 2D fingerprints are by far the most common, based on a bit-string encoding of substructural occurrences

Molecular Fingerprints for Similarity Searching • Each bit in the fingerprint (or fragment bit-string) represents one molecular fragment. Typical length is ~1000 bits • The bit string for a molecule records the presence (“1”) or absence (“0”) of each fragment in the molecule • Compare fingerprints of two molecules to identify common bits and hence common substructures (and hence overall structural resemblance)

A B a c b Tanimoto Chemical Similarity • Tally features: • Unique (a,b) • Both on (c) • Both off (d) • Similarity Formula • Tanimoto=c/(a+b+c) Beware: Chemical Similarity searches are not standardised between databases

PubChem Chemical Searching

Bio-Chem Data Joins

A Pharmaceutical Portfolio from PubChem

Disambiguation From: Wells et al. Reaching for high-hanging fruit in drug discovery at protein–protein interfaces 1R6N 1Y2F

OSRA: Optical Structure Recognition

Checking Chemical Patents • Taking Nutlin-3 as an example the SMILES entry from PubChem CC(C)OC1=C(C=CC(=C1)OC)C2=NC(C(N2C(=O)N3CCNC(=O)C3)C4=CC=C(C=C4)Cl)C5=CC=C(C=C5)Cl was pasted into the SureChem search box • There are nine exact matches including the granted patent application from Roche shown below

Exploring Relationships in Entrez BLAST Sequence Similarity Protein Sequence Biological Terms MeSH indexed Literature PubMed VAST Structure Similarity Protein 3D Structures Bioactivity Assay Results 2D Chemical Structure Similarity (3D soon) Small Molecule Structures Protein Sequences Activity Profile Similarity

Linkage between Swiss-Prot-DrugBank-PubChem-MMDB (411) (15728) = 181 (2501) see these marketed target links

An Introduction to Open Small-molecule Resources of High Utility for Systems Biologists