Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

1. Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics CHI BioIT World, Boston, 2007 Christopher Southan, Global Compound Sciences, AstraZeneca R&D M�lndal, Sweden

2. Revolutions in Public Cheminformatics Increasingly Compliment Commercial Databases Formal representation of the �missing entity� of chemical structure within the global Web of bioinformatic relationships Ability to search links between biological effects, protein names, sequence data, and chemical information Deposition of HTS results and other types of screening or bioactivity data, directly linked to chemical structure information Proliferation of open cheminformatic tools and downloadable data sets 53 Entrez-selectable compound data sources now in PubChem PubChem output like small pharma so farPubChem output like small pharma so far

3. Explicit Compound-to-sequence Links Increasing commercial and public availability of annotated relationships �..document (or database entry) �W � includes assay data �X� that defines compound �Y� as an activity modulator of protein �Z��. provide crucial value in medicinal chemistry informatics We selected the following to include in our comparative analysis: ~ 130,000 cpds, ~1,300 sequences, ~7,000 papers ~ 1.5 million cpds ~ 2,000 sequences ~ 20,000 patents and papers ~ 4,000 cpds, 502 sequences 83 protein targets with bioassay data, and ~6,000 cpds in PDB structures PubChem output like small pharma so farPubChem output like small pharma so far

4. A Linking Example

5. Project Objectives Methods Produce standardised comparisons between public and non-proprietary commercial sources Include databases or subsets with explicit chemistry-target or other types of bioactive links Review similarities and differences in content PubChem output like small pharma so farPubChem output like small pharma so far

6. Post-filtration Compound Counts GVKBio 1,488,288 GVKBio Journals 542,858 GVKBio Patents 1,034,548 GVKBio Drug 1,933 WOMBAT 128,120 PubChem 7,268,193 PubChem Prous 3,318 PubChem PDB 5,626 PubChem actives 35,671 PubChem pharmacol 6,070 Bioprint 2,437 ZINC FDA 1,200 DrugBank 3,723 DrugBank small mol 1,018 DrugBank exp drugs 2,737 Dict. Nat.Prod. 132,831 MDDR 159,867 MDDR launched 1,118 CMC 8,189

7. All-vs-all Result Matrix

8. Result Overview Our post-filtration unique compound numbers were typically 5% to 20% lower than those given by the databases This facilitated standardised comparisons Self-comparisons and subset numbers were consistent On a pair wise basis in the 19 X 19 matrix no single set was entirely covered by any other no cells were null i.e. all shared some content Larger databases showed significant non-overlap suggestive of unique content None of the �known drug� sets overlapped exactly For small differences it was not possible to discriminate between technical errors in structure files or genuine unique content Relative coverage result should not be taken as an implicit criticism or endorsement of any particular database

9. GVKBIO At just under 1.5 million GVKBIO is divided between journals and patents at approximately 1:2 ratio, with an overlap of 89,000 GVKBIO covers 93% of WOMBAT and is ~10x larger WOMBAT has captured over 7000 compounds not found in GVKBIO 29% of GVKBIO is represented in PubChem, split evenly between journals and patents Includes 25% of cpds reported as active in any of the screening data sets in PubChem and 70% with a pharmacology link in PubChem via MeSH.

10. PubChem 48% overlap with ChemNavigator (not in matrix) Only 3% screened within the system so far (11% active) Largest coverage of every other database, except WOMBAT, of which PubChem covers some 3,000 less compounds than GVKBIO 46% of DNP, 42% of MDDR, 92% of DrugBank, 93% of CMC and 95% of BioPrint and MDDR launched Covers 0.43 mill of GVKBIO GVKB patent overlap shows that the number of PubChem compounds with potential claims is 238,000

11. a Key Test Set Prous �Drugs of the Future� is a review journal for new compounds in development 3318 cpds in PubChem with document outlinks (but no inlinks) 1374 in PubChem MeSH pharmacology Selected overlaps 2,628 in GVKBIO (with document-cpd-sequence links) 733 in GVKBIO Drugs (� � �) 994 in WOMBAT ( � � �) 1,875 in MDDR, 734 in MDDR launched 543 in DrugBank Numbers allow inferences on triage through different sources

12. Venn-type Overlaps Highlight Unique Content

13. Merge of all Bioactives Bioprint, CMC, DNP, DrugBank, GVKBIO, GVKBIO DD, MDDR, PubChem Prous, PubChem PDB, PubChem active, PubChem Pharmacol, ZINK FDA, WOMBAT (not entire PubChem) Gives 1,976,273 Filtered to unique content reduces to 1,741, 392 Relatively small redundancy collapse 234, 881 (11%) Indicates substantial unique content and possibilities for further analysis

14. Conclusions Our filtration and comparison methods clarify compound content Public sources have essential value and complementarity to commercial sources Bioactive coverage is expanding in PubChem sub-sets DrugBank provides exemplary linkage between bioinformatics and cheminformatics Public sources now offer data mining and linking functionality with no commercial eqivalent Journal and patent �backfilling� of cpd<->sequence relationships by expert curation is largely only covered by commercial databases GVKBIO has highest coverage of cpd <-> sequence links and patent content Some commercial dbs could define their content and extraction methods more explicitly On-line-only data access models are becoming less attractive

15. Reference and Acknowledgments Reference: �Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics� Chris Southan, P�ter V�rkonyi and Sorel Muresan, Current Topics In Medicinal Chemistry, 2007, in press Many thanks to: Prof. Tudor Oprea, Sunset Molecular, for WOMBAT data

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

Presentation Transcript

Informatics in Public Health

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

Medicinal Chemistry

Advanced Medicinal Chemistry

Advanced Medicinal Chemistry

Medicinal Chemistry

MEDICINAL CHEMISTRY

New Opportunities in Public Procurement

Medicinal Chemistry-2

Medicinal Chemistry

Medicinal chemistry

Advanced Medicinal Chemistry

Panel on Public Health Informatics New Opportunities for Public Health Practitioners

INTEGRATED MEDICINAL CHEMISTRY

Advanced Medicinal Chemistry

Substituents and Bio-isosteres in Medicinal Chemistry

Chemistry 910 Practical Medicinal Chemistry

Medicinal Chemistry

Medicinal chemistry

Medicinal Chemistry