Bridging Bioinformatics and Chem(o)informatics

Bridging Bioinformatics and Chem(o)informatics Gary Wiggins School of Informatics Indiana University wiggins@indiana.edu Yan He (SLIS MLS Student) Meredith Saba (SLIS MLS Student)

Provocative Thought “While much bioscience is published with the knowledge that machines will be expected to understand at least part of it, almost all chemistry is published purely for humans to read.” • Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3201.

Overview of the Talk • Review of ACS CINF 2004 Papers • Review of Relevant Articles • Public Chemistry Databases and Data Repositories with Bioinformatics Info/Links • Overview of Web Services • NIH-funded Projects Underway or Planned at Indiana University

“The Bigger Picture — Linking Bioinformatics to Cheminformatics” • American Chemical Society Division of Chemical Information (CINF) Symposium, Anaheim, Spring 2004 • All-day session with 16 papers • http://www.acscinf.org/new/docs/meetings/227nm/227cinfabstracts.htm

Problems from ACS CINF 2004 • Both technical and people factors hinder knowledge exchange between biology and chemistry. (Lipinski) • People Problems per Chris Lipinski • Meta data capture is complicated by people issues, particularly those between chemists and biologists. • Discipline-based disconnects occur distressingly often and are frequently overlooked as a cause of lost productivity.

Interdisciplinary Collaborations: Biology and Chemistry • [What’s] “... important for these collaborations is, not only do you have to accept the other guy’s paradigm or at least live with it; you have to be willing to accept the other guy’s foibles or your perception of the other guy’s foibles (and recognize the opposite of this). We each have our own approaches to how we do science, and it’s just different cultures.” --Thom Kauffman interview in ACS LiveWire, March 2005, 7.3. http://pubs.acs.org/4librarians/livewire/2006/7.3/profile.html

Some Questions from the ACS CINF 2004 Symposium • "Find all proteins related to protein A (i.e. within a given path length of A) in a protein interaction graph, and retrieve related assay results and compound structures.” • “Find all pathways where compound X inhibits or slows a reaction, and retrieve Gene Ontology classifications for all proteins involved in the reaction.”

Problems from ACS CINF 2004 • Commercial vs. public data • Batch mode data processing possible in biology, but primitive in chemistry • Primary HTS data has a very high noise factor • Data format standardization problem • Chemoinformatics and bioinformatics use completely different data formats and analysis tools • Chemical and protein sequence information has been largely analyzed separately

Solutions from ACS CINF 2004 • Linking biological and chemical information in computational approaches to predict biological activity, ADME profiles, and adverse drug reactions (ADR) • Energetics of binding for more accurate and sensitive chemical representation of DNA-protein interactions • A discovery informatics platform that facilitates archival, sharing, integration, and exploration of synthetic methods and biological activity data

Solutions from ACS CINF 2004 • Data pipelining approach makes it possible to apply bioinformatics and chemoinformatics data and analyses together. • Visualizations are the best way for people to understand data.

Solutions from ACS CINF 2004 • Cabinet (Chemical And Biological Information NETwork, formerly Fedora) servers include • Metabolic pathway network chart (Empath) • Protein-Ligand Association Network (Planet) • Enzyme Commission Codebook (EC Book) • Traditional Chinese Medicines (TCM) • World Drug Index (WDI), and others. • Built on the Daylight HTTP toolkit • http://www.metaphorics.com/products/cabinet.html

What is Chemoinformatics? (Brown) • “…the essence of chemoinformatics is integration and focus rather than its components, which are independent disciplines.” • Supporting disciplines: • Chemical information • Computational chemistry • Chemometrics

Chemoinformatics and Disease

Toolkits as Integrators (Brown) • Companies such as Daylight, Advanced Visual Systems, OpenEye, and SciTegic provide integration systems for: • Statistical methods • Text mining • Computational chemistry • Visualization

Genego’s MetaDrug Product • Toxicogenomics platform for the prediction of human drug metabolism and toxicity of novel compounds • Enables the visualization of pre-clinical and clinical high-throughput data in the context of the complete biological system • Integrates chemical, biological, and protein function data • http://www.genego.com/

BioWisdom • Examination of vast amounts of available information using its Sofia KnowledgeScan methodology • SRS data integration platform • http://www.biowisdom.com/

Lessons from Hip Hop (Salamone) • Mashup technique • Bring together disparate informatics, biological, chemical, and imaging information when conducting research • Example of an integration tool: iSpecies.org • A search for a species returns a page with NCBI genomics information, Yahoo images of the species, and articles culled from Google Scholar

iSpecies.org Search • For mus musculus

Chemogenomics and Chemoproteomics (Gagna) • Chemogenomics (def.)—The description of all potential drugs that can be used against all possible target sites, OR the actions of target-specific chemical ligands and how they are used to globally examine genes • Chemoproteomics (def.)—Uses chemistry to characterize protein structure and functions • They are “. . . a form of chemical biology brought up to date in the area of genome and proteome analysis.”

New Interdisciplinary Journals • ACS Chemical Biology (ACS) • ChemBioChem; A European Journal of Chemical Biology (Wiley/VCH) • Chemical Biology and Drug Design (Blackwell) • JBIC; Journal of Biological and Inorganic Chemistry (Springer) • Journal of Biochemical and Molecular Toxicology (Wiley) • Molecular Biosystems (RSC) • Nature Chemical Biology (Nature Publishing) • Organic & Biomolecular Chemistry (RSC)

Open Source Software (Geldenhuys) • Log P calculator from Interactive Analysis • http://www.logp.com • University of Utah’s Computational Science and Engineering Online • Can submit jobs for molecular mechanics, quantum chemical calculations, and biomolecular interfaces for viewing PDB files • http://www.cse-online.net • Virtual Computational Chemistry Laboratory • http://www.vcclab.org

The Blue Obelisk (Guha) • Several open chemistry and chemoinformatics projects that have pooled forces to enhance interoperability • Maintain: • Chemoinformatics Algorithms Dictionary • Data Repository for standardized data for chemical properties and other facts (e.g., mass) • http://www.blueobelisk.org/

BlueObelisk.org • Working collaboratively on projects such as: • Chemistry Development Kit (CDK) • JChemPaint • Jmol • JUMBO • NMRShiftDB • Octet • Open Babel • QSAR • World Wide Molecular Matrix (WWMM)

Barriers to the Use of Open Source Software • Unix command line • Problem: Lack of known standards and datasets of compounds for validation, e.g., in docking programs

Lessons from the Human Genome Project (Austin) • Keys to success in the HGP were: • Comprehensiveness • Commitment to open access to the sequence as a research tool without encumbrance • Proposed tools for a “genome functionation toolbox”: • Whole-genome transcriptome and proteome characterization • Development of small inhibitory RNAs (siRNAs) and knockout mice for every gene • Small molecules and the druggable genome

ChemDB http://cdb.ics.uci.edu/CHEM/Web/

ChEBI, Chemical Entities of Biological Interest • Dictionary of molecular entities focused on small chemical compounds • Features an ontological classification, showing the relationships between molecular entities or classes of entities and their parents and/or children

Vioxx Entry in ChEBI

The IUPAC International Chemical Identifier (InChI) • Open source, non-proprietary, public-domain identifier for chemicals • String of characters that uniquely represent a molecular substance • Independent of the way the chemical structure is drawn • Enables reliable structure recognition and easy linking of diverse data compilations • Accepts as input MOLfiles (or SDfiles) and CML files • Download the program to your computer at: • http://www.iupac.org/inchi/license.html

Generation of InChI for Vioxx with wInChI

Vioxx Entry in PubChem Compounds Found with InChI

Vioxx Bioassay Data in PubChem

Vioxx PubChem Link to External Sources of Information

The Elsevier MDL/NIH Link via PubChem and DiscoveryGate • Cross-indexes PubChem to the Compound Index hosted on Elsevier MDL’s DiscoveryGate platform • MDL added 5 million structures from PubChem to their index, resulting in over 14 million unique chemical structures • Links go both ways • Can move from biological data in PubChem to bioactivity, chemical sourcing, synthetic methodology, and EHS data in DiscoveryGate sources

Elsevier MDL’s xPharm • Comprehensive set of records linking: • Agents (compounds) (2300) • Targets (600) • Disorders (450) • Principles that govern their interactions (180) • Answers questions such as: • What targets are associated with control of blood pressure? • What adverse effects are associated with monoamine oxidase inhibitors?

Text Datamining (Banville) • “In the pharmaceutical field, it is ideally the marriage of biological and chemical information that needs to be the ultimate focus of text data mining applications.” • Problems: • Lack of universal publication standards for identifying each unique chemical entity • Selective indexing policies of A&I services • Need to understand how chemical structures link to biological processes

Chemical Datamining Software • SureChem • http://surechem.reeltwo.com/ • CLiDE • Recognizes structures, reactions, and text • http://www.simbiosys.ca/clide/ • OSCAR • “OSCAR1” to check experimental data • http://www.ch.cam.ac.uk/magnus/checker.html • http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/AuthoringTools/ExperimentalDataChecker/ • CSR (Chemical Structure Reconstruction) • http://www.scai.fraunhofer.de/uploads/media/MZ-ERCIM05_04.pdf • MDL DocSearch—combines MDL’s Isentris platform and EMC’s Documentum

Themes from SwissProt’s 20th Anniversary Conference, “In silico Analysis of Proteins” • Knowledgebases, databases and other information resources for proteins • Sequence searches and alignments • Protein sequence analysis • Protein structure prediction, analysis and visualization • Proteomics data analysis

Chemoinformatics Databases (Jónsdóttir) • Lists databases relevant to drug discovery and development, including: • General databases • DBs for screening compounds • DBs for medicinal agents • DBs with ADMET properties • DBs with physico-chemical properties • Curiously does not mention Chemical Abstracts

Databases with Protein and Ligand Information (Jónsdóttir) • Protein Data Bank • Target Registration Database • Relibase—uses structural info to analyze protein-ligand interactions; Relibase+ for protein-protein interaction searching • Cambridge Structural Database • KEGG LIGAND DB for enzyme reactions • http://www.genome.ad.jp/ligand

Other Databases with Protein and Ligand Information • SitesBase--a database of known ligand binding sites within the PDB • http://www.bioinformatics.leeds.ac.uk/sb/main.html • Binding MOAD • http://www.bindingmoad.org/ • sc-PDB (Kellenberger) • http://bioinfo-pharma.u-strasbg.fr:8080/scPDB/index.jsp

sc-PDB http://bioinfo-pharma.u-strasbg.fr:8080/scPDB/index.jsp

Isatin Search on sc-PDB

Other Databases with Protein-Protein Interaction Data (Jónsdóttir) • YPD, Yeast Proteome Database (for proteins from S. cerevisiae) • http://www.biobase.de/pages/index.php?id=139 • Human Protein Reference Database • http://www.hprd.org/ • BIND, Biomolecular Interaction Network Database (ceased as of 11/16/2005?) • http://www.bind.ca/Action

International Molecular Exchange (IMEx) Consortiumhttp://imex.sourceforge.net/ • BIND (http://www.blueprint.org) The Blueprint Initiative AsiaPte. Ltd, Singapore and The Blueprint Initiative North America,Toronto Canada • DIP (http://dip.doe-mbi.ucla.edu) UCLA-DOE Institute for Genomics & Proteomics • IntAct (http://www.ebi.ac.uk/intact), EMBL–European Bioinformatics Institute, Hinxton, UK; • MINT (http://mint.bio.uniroma2.it/mint/) University of Rome “Tor Vergata”, Rome Italy • MPact (http://mips.gsf.de/genre/proj/mpact), MIPS / Institute for Bioinformatics, Munich, Germany.

Protein Sites from IU I533 Students and others • LigandDepot—integrated source for small molecules • http://ligand-depot.rutgers.edu/index.html • PSIPRED Protein Structure Prediction Server • http://bioinf.cs.ucl.ac.uk/psipred/ • DSSP--a database of secondary structure assignments (and much more) for all protein entries in the PDB • http://swift.cmbi.ru.nl/gv/dssp/ • Dr. Predrag Radivojac’s I690 class on Structural Bioinformatics • http://www.informatics.indiana.edu/predrag/2006springi690/2006springi690.htm

Protein Secondary Structure Prediction • Methods • Neural Network • Rule Based • Other Machine Learning • Homology Based

Protein Secondary Structure Prediction Software • PredictProtein • http://www.predictprotein.org/ Chou-Fasman http://fasta.bioch.virginia.edu/fasta_www/chofas.htm • NN Predict • http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

Bridging Bioinformatics and Chem(o)informatics

Bridging Bioinformatics and Chem(o)informatics

Presentation Transcript

Chemical Informatics and Bioinformatics Programs at Indiana University

Bioinformatics Chem 434

Bioinformatics, Translational Bioinformatics, Personalized Medicine

Bridging cheminformatics and bioinformatics using protein structures

BIOINFORMATICS Introduction

Bioinformatics CSM17 Week1:What is Bioinformatics?

SIG BIOINFORMATICS

Bioinformatics Research Centre University of Glasgow

An overview of Bioinformatics

The Convergence of Bioinformatics and Medical Informatics

What is Bioinformatics?

Bioinformatics

Basics bioinformatics

Bioinformatics Drug Informatics Vaccine Informatics Chemoinformatics

Bioinformatics

Delivering Bioinformatics Training: Bridging the Gaps Between Computer Science and Biomedicine

Mathematics-Bioinformatics

BIOINFORMATICS Introduction

The Convergence of Bioinformatics and Medical Informatics

Bridging cheminformatics and bioinformatics using protein structures