150 likes | 354 Views
Revolutions in Public Cheminformatics Increasingly Compliment Commercial Databases . Formal representation of the "missing entity" of chemical structure within the global Web of bioinformatic relationshipsAbility to search links between biological effects, protein names, sequence data, and chemica
E N D
1. Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics CHI BioIT World, Boston, 2007
Christopher Southan, Global Compound Sciences, AstraZeneca R&D Mölndal, Sweden
2. Revolutions in Public Cheminformatics Increasingly Compliment Commercial Databases Formal representation of the ”missing entity” of chemical structure within the global Web of bioinformatic relationships
Ability to search links between biological effects, protein names, sequence data, and chemical information
Deposition of HTS results and other types of screening or bioactivity data, directly linked to chemical structure information
Proliferation of open cheminformatic tools and downloadable data sets
53 Entrez-selectable compound data sources now in PubChem PubChem output like small pharma so farPubChem output like small pharma so far
3. Explicit Compound-to-sequence Links Increasing commercial and public availability of annotated relationships
…..document (or database entry) “W “ includes assay data “X” that defines compound “Y” as an activity modulator of protein “Z”…….
provide crucial value in medicinal chemistry informatics
We selected the following to include in our comparative analysis:
~ 130,000 cpds, ~1,300 sequences, ~7,000 papers
~ 1.5 million cpds ~ 2,000 sequences ~ 20,000 patents and papers
~ 4,000 cpds, 502 sequences
83 protein targets with bioassay data, and ~6,000 cpds in PDB structures
PubChem output like small pharma so farPubChem output like small pharma so far
4. A Linking Example
5. Project Objectives Methods Produce standardised comparisons between public and non-proprietary commercial sources
Include databases or subsets with explicit chemistry-target or other types of bioactive links
Review similarities and differences in content
PubChem output like small pharma so farPubChem output like small pharma so far
6. Post-filtration Compound Counts GVKBio 1,488,288
GVKBio Journals 542,858
GVKBio Patents 1,034,548
GVKBio Drug 1,933
WOMBAT 128,120
PubChem 7,268,193
PubChem Prous 3,318
PubChem PDB 5,626
PubChem actives 35,671
PubChem pharmacol 6,070
Bioprint 2,437
ZINC FDA 1,200
DrugBank 3,723
DrugBank small mol 1,018
DrugBank exp drugs 2,737
Dict. Nat.Prod. 132,831
MDDR 159,867
MDDR launched 1,118
CMC 8,189
7. All-vs-all Result Matrix
8. Result Overview Our post-filtration unique compound numbers were typically 5% to 20% lower than those given by the databases
This facilitated standardised comparisons
Self-comparisons and subset numbers were consistent
On a pair wise basis in the 19 X 19 matrix
no single set was entirely covered by any other
no cells were null i.e. all shared some content
Larger databases showed significant non-overlap suggestive of unique content
None of the “known drug” sets overlapped exactly
For small differences it was not possible to discriminate between technical errors in structure files or genuine unique content
Relative coverage result should not be taken as an implicit criticism or endorsement of any particular database
9. GVKBIO At just under 1.5 million GVKBIO is divided between journals and patents at approximately 1:2 ratio, with an overlap of 89,000
GVKBIO covers 93% of WOMBAT and is ~10x larger
WOMBAT has captured over 7000 compounds not found in GVKBIO
29% of GVKBIO is represented in PubChem, split evenly between journals and patents
Includes 25% of cpds reported as active in any of the screening data sets in PubChem and 70% with a pharmacology link in PubChem via MeSH.
10. PubChem 48% overlap with ChemNavigator (not in matrix)
Only 3% screened within the system so far (11% active)
Largest coverage of every other database, except WOMBAT, of which PubChem covers some 3,000 less compounds than GVKBIO
46% of DNP, 42% of MDDR, 92% of DrugBank, 93% of CMC and 95% of BioPrint and MDDR launched
Covers 0.43 mill of GVKBIO
GVKB patent overlap shows that the number of PubChem compounds with potential claims is 238,000
11. a Key Test Set Prous “Drugs of the Future” is a review journal for new compounds in development
3318 cpds in PubChem with document outlinks (but no inlinks)
1374 in PubChem MeSH pharmacology
Selected overlaps
2,628 in GVKBIO (with document-cpd-sequence links)
733 in GVKBIO Drugs (“ “ “)
994 in WOMBAT ( “ “ “)
1,875 in MDDR, 734 in MDDR launched
543 in DrugBank
Numbers allow inferences on triage through different sources
12. Venn-type Overlaps Highlight Unique Content
13. Merge of all Bioactives Bioprint, CMC, DNP, DrugBank, GVKBIO, GVKBIO DD, MDDR, PubChem Prous, PubChem PDB, PubChem active, PubChem Pharmacol, ZINK FDA, WOMBAT (not entire PubChem)
Gives 1,976,273
Filtered to unique content reduces to 1,741, 392
Relatively small redundancy collapse 234, 881 (11%)
Indicates substantial unique content and possibilities for further analysis
14. Conclusions Our filtration and comparison methods clarify compound content
Public sources have essential value and complementarity to commercial sources
Bioactive coverage is expanding in PubChem sub-sets
DrugBank provides exemplary linkage between bioinformatics and cheminformatics
Public sources now offer data mining and linking functionality with no commercial eqivalent
Journal and patent “backfilling” of cpd<->sequence relationships by expert curation is largely only covered by commercial databases
GVKBIO has highest coverage of cpd <-> sequence links and patent content
Some commercial dbs could define their content and extraction methods more explicitly
On-line-only data access models are becoming less attractive
15. Reference and Acknowledgments
Reference:
“Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics” Chris Southan, Péter Várkonyi and Sorel Muresan, Current Topics In Medicinal Chemistry, 2007, in press
Many thanks to:
Prof. Tudor Oprea, Sunset Molecular, for WOMBAT data