1 / 15

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics. CHI BioIT World, Boston, 2007 Christopher Southan, Global Compound Sciences, AstraZeneca R&D Mölndal, Sweden.

RexAlvis
Download Presentation

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics CHI BioIT World, Boston, 2007 Christopher Southan, Global Compound Sciences, AstraZeneca R&D Mölndal, Sweden

  2. Revolutions in Public Cheminformatics Increasingly Compliment Commercial Databases • Formal representation of the ”missing entity” of chemical structure within the global Web of bioinformatic relationships • Ability to search links between biological effects, protein names, sequence data, and chemical information • Deposition of HTS results and other types of screening or bioactivity data, directly linked to chemical structure information • Proliferation of open cheminformatic tools and downloadable data sets • 53 Entrez-selectable compound data sources now in PubChem

  3. Explicit Compound-to-sequence Links Increasing commercial and public availability of annotated relationships …..document (or database entry) “W “ includes assay data “X” that defines compound “Y” as an activity modulator of protein “Z”……. provide crucial value in medicinal chemistry informatics We selected the following to include in our comparative analysis: ~ 130,000 cpds, ~1,300 sequences, ~7,000 papers ~ 1.5 million cpds ~ 2,000 sequences ~ 20,000 patents and papers ~ 4,000 cpds, 502 sequences 83 protein targets with bioassay data, and ~6,000 cpds in PDB structures

  4. A Linking Example

  5. Produce standardised comparisons between public and non-proprietary commercial sources Include databases or subsets with explicit chemistry-target or other types of bioactive links Review similarities and differences in content Project Objectives Methods • Normalise downloaded sources by removing fragments • Derive canonical tautomer • Generate, compare and retain unique molecular hashcodes • Prepare “all-against-all” content overlap matrix • Perform selected merges and Venn-type complete overlaps

  6. GVKBio 1,488,288 GVKBio Journals 542,858 GVKBio Patents 1,034,548 GVKBio Drug 1,933 WOMBAT 128,120 PubChem 7,268,193 PubChem Prous 3,318 PubChem PDB 5,626 PubChem actives 35,671 PubChem pharmacol 6,070 Bioprint 2,437 ZINC FDA 1,200 DrugBank 3,723 DrugBank small mol 1,018 DrugBank exp drugs 2,737 Dict. Nat.Prod. 132,831 MDDR 159,867 MDDR launched 1,118 CMC 8,189 Post-filtration Compound Counts

  7. All-vs-all Result Matrix

  8. Result Overview • Our post-filtration unique compound numbers were typically 5% to 20% lower than those given by the databases • This facilitated standardised comparisons • Self-comparisons and subset numbers were consistent • On a pair wise basis in the 19 X 19 matrix • no single set was entirely covered by any other • no cells were null i.e. all shared some content • Larger databases showed significant non-overlap suggestive of unique content • None of the “known drug” sets overlapped exactly • For small differences it was not possible to discriminate between technical errors in structure files or genuine unique content • Relative coverage result should not be taken as an implicit criticism or endorsement of any particular database

  9. GVKBIO • At just under 1.5 million GVKBIO is divided between journals and patents at approximately 1:2 ratio, with an overlap of 89,000 • GVKBIO covers 93% of WOMBAT and is ~10x larger • WOMBAT has captured over 7000 compounds not found in GVKBIO • 29% of GVKBIO is represented in PubChem, split evenly between journals and patents • Includes 25% of cpds reported as active in any of the screening data sets in PubChem and 70% with a pharmacology link in PubChem via MeSH.

  10. PubChem • 48% overlap with ChemNavigator (not in matrix) • Only 3% screened within the system so far (11% active) • Largest coverage of every other database, except WOMBAT, of which PubChem covers some 3,000 less compounds than GVKBIO • 46% of DNP, 42% of MDDR, 92% of DrugBank, 93% of CMC and 95% of BioPrint and MDDR launched • Covers 0.43 mill of GVKBIO • GVKB patent overlap shows that the number of PubChem compounds with potential claims is 238,000

  11. a Key Test Set • Prous “Drugs of the Future” is a review journal for new compounds in development • 3318 cpds in PubChem with document outlinks (but no inlinks) • 1374 in PubChem MeSH pharmacology • Selected overlaps • 2,628 in GVKBIO (with document-cpd-sequence links) • 733 in GVKBIO Drugs (“ “ “) • 994 in WOMBAT ( “ “ “) • 1,875 in MDDR, 734 in MDDR launched • 543 in DrugBank • Numbers allow inferences on triage through different sources

  12. PubChem GVKBIO 6,825,265 353,623 1,013,848 86,143 3,162 34,674 4,150 WOMBAT Venn-type Overlaps Highlight Unique Content 1.49 mill 7.27 mill 128K

  13. Merge of all Bioactives • Bioprint, CMC, DNP, DrugBank, GVKBIO, GVKBIO DD, MDDR, PubChem Prous, PubChem PDB, PubChem active, PubChem Pharmacol, ZINK FDA, WOMBAT (not entire PubChem) • Gives 1,976,273 • Filtered to unique content reduces to 1,741, 392 • Relatively small redundancy collapse 234, 881 (11%) • Indicates substantial unique content and possibilities for further analysis

  14. Conclusions • Our filtration and comparison methods clarify compound content • Public sources have essential value and complementarity to commercial sources • Bioactive coverage is expanding in PubChem sub-sets • DrugBank provides exemplary linkage between bioinformatics and cheminformatics • Public sources now offer data mining and linking functionality with no commercial eqivalent • Journal and patent “backfilling” of cpd<->sequence relationships by expert curation is largely only covered by commercial databases • GVKBIO has highest coverage of cpd <-> sequence links and patent content • Some commercial dbs could define their content and extraction methods more explicitly • On-line-only data access models are becoming less attractive

  15. Reference and Acknowledgments Reference: “Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics” Chris Southan, Péter Várkonyi and Sorel Muresan, Current Topics In Medicinal Chemistry, 2007, in press Many thanks to: Prof. Tudor Oprea, Sunset Molecular, for WOMBAT data

More Related