1 / 56

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space

This resource provides a resolver for different chemical structure identifiers, allowing conversion between representations and structure identifiers.

dgolson
Download Presentation

NCI/CADD Chemical Identifier Resolver: Indexing and Analysis of Available Chemistry Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCI/CADD Chemical Identifier Resolver:Indexing and Analysis of Available Chemistry Space Markus Sitzmann1, Wolf-Dietrich Ihlenfeldt2, andMarc C. Nicklaus1 [1] Computer-Aided Drug Design Group, Chemical Biology Laboratory, NCI-Frederick, NIH, DHHS [2] Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany

  2. Small Molecule Databases • since the early 2000s: number of databases “publishing” small molecules grew enormously, e.g. PubChem, ChemSpider, ChEMBL, DrugBank – what is the overlap, how many small-molecules are there currently? • ambiguities in the representation of small molecules (e.g. tautomerism, salts, ionic resonance forms) • growing number of chemical structure identifiers (InChI/InChIKey, PubChem SID/CID, ChemSpider ID, ChEBI ID, …)

  3. Chemical Identifier Resolver SYBYL Line Notation SMILES CAS Registry Number chemicalnames GIF image SD File ChemNavigator SID chemical structure CML FDA UNII NCI/CADD Identifiers NSC number MRV InChI/InChIKey PubChem SID/CID ChemSpider ID ChEBI ID Chemical Formula PDB Ligand ID

  4. NCI/CADD Web Resources Chemical Identifier Resolver Works as a resolver for different chemical structure identifiers. Allows one to convert a givenstructure identifier into anotherrepresentation or structureidentifier. first beta release: July 2009 current release (beta4): April 2011 http://cactus.nci.nih.gov/chemical/structure

  5. NCI/CADD Web Resources Chemical Identifier Resolver • it is usable by a simple URL API: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation” XML format: http://cactus.nci.nih.gov/chemical/structure/”identifier”/”representation”/xml example: http://cactus.nci.nih.gov/chemical/structure/Tamiflu/cas 204255-11-8 MIME type: text/plain • if a request is not resolvable: HTTP404 status message

  6. NCI/CADD Public Web Resources Chemical Identifier Resolver chemical names IUPAC names (by OPSIN) CAS numbers SMILES strings IUPAC InChI/InChIKeys NCI/CADD IdentifiersCACTVS HASHISYNSC numberPubChem SID ChemSpider ID ChemNavigator SID ZINC FDA UNII /smiles /names, /iupac_name /cas /inchi, /stdinchi /inchikey, /stdinchikey /ficts, /ficus, /uuuuu /image /file, /sdf /mw, /monoisotopic_mass /formula/twirl, /3d /urls /chemspider_id /pubchem_sid /chemnavigator_sid resolver http://cactus.nci.nih.gov/chemcial/structure “representation” “identifier”

  7. NCI/CADD Web Resources Chemical Identifier Resolver representation identifier MIME type http request http response calculation of therequested structurerepresentation identifier is afull structure representation (e.g. SMILES, InChI) detection ofthe identifiertype e.g. InChI, GIF image structure identifier is ahashed structurerepresentation (e.g. InChIKey), trivial nameetc. e.g. CAS number, chemical name CACTVS NCI/CADD Chemical Structure Database (CSDB) database lookup

  8. NCI/CADD Web Resources Chemical Identifier Resolver representation identifier MIME type http request http response calculation of therequested structurerepresentation identifier is afull structure representation (e.g. SMILES, InChI) detection ofthe identifiertype e.g. InChI, GIF image structure identifier is ahashed structurerepresentation (e.g. InChIKey), trivial nameetc. e.g. CAS number, chemical name CACTVS database lookup NCI/CADD Chemical Structure Database (CSDB)

  9. Chemical Identifier Resolver Resolving Chemical Names http://cactus.nci.nih.gov/chemical/structure/L-alanin/smiles/xmls?resolver=name_by_chemspider,name_by_opsin,name_by_cir <request string="L-alanin" representation="smiles"> <data id="1" resolver="name_by_chemspider" string_class="Chemical Name (ChemSpider)"> <item id="1">C[C@H](N)C(O)=O</item> </data> <data id="2" resolver="name_by_opsin" string_class="IUPAC Name (OPSIN)"> <item id="1">C[C@H](N)C(O)=O</item> </data> <data id="3" resolver="name_by_cir" string_class="Chemical Name (CIR)"> <item id="1“>C[C@H](N)C(O)=O</item> </data> </request>

  10. ChemNavigator iResearch Librarycompilation of commercially availablescreening compounds from ~330 inter-national chemistry suppliers PubChem databaseincluding Open NCI database, EPA DSSTox databases, NIAID HIVdatabases, NIST Webbook, NLM ChemIDplus, ChemSpider … Commercial Sources / othersAsinex, Comgenex, eMolecules,ChEMBL, … Chemical Identifier Resolver Chemical Structure Database (CSDB) PubChem ~38% ChemNav. iResearch Lib. ~56% ~6% others currently: ~150 chemical structure databases ~120 million structure records~81.6 million unique structures by NCI/CADD FICuS Identifier ~84 million unique structures by Std. InChIKey

  11. NCI/CADD Structure Identifiers FICTS, FICuS, uuuuu

  12. O O H H N N N H 2 Unique Representation of Chemical Structures NCI/CADD Structure Identifiers • based on hashcodes calculated by the chemoinformatics toolkit CACTVS 9850FD9F9E2B4E25 • CACTVS hashcodes: • represent a chemical structure uniquely as16-digit hexadecimal number (64-bitunsigned) • high sensitivity to structural features of a compound • change if connectivity changes

  13. Unique Representation of Chemical Structures NCI/CADD Structure Identifiers Molfile SDF SMILES ChemDraw cdx PDB hashcodecalculation structurenormalization original structure record NCI/CADDIdentifier parentstructure E_HASHISY SDF SMILES database

  14. Unique Representation of Chemical Structures NCI/CADD Structure Identifiers Molfile SDF SMILES ChemDraw cdx PDB hashcodecalculation structurenormalization original structure record NCI/CADDIdentifier parentstructure E_HASHISY SDF SMILES database FICTS FICuS uuuuu • calculation of a set of parent structures with differentsensitivity to chemical features • representation of chemical structures on different levels

  15. O O- H N N N H 2 Unique Representation of Chemical Structures NCI/CADD Structure Identifiers sensitive / not sensitive Fragments Isotopes Charges Tautomers Stereo FICTS FICuS uuuuu 4A122D094098B50D-FICTS-01-1D 0E26B623DF7FAD30-FICuS-01-70 9850FD9F9E2B4E25-uuuuu-01-27 Na+ <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum>

  16. O O O O H H N O H O H N N N H H N 2 N H N H N N H 2 2 O O H H N O N N H 2 - O H N + N N H 3 O O O N a O H N H O H N 1 5 N N H O H 2 N N H H N 2 N N H stereoisomers tautomer salt charged form O + N a - O H N N N H 2 histidine isotope “errors”

  17. O O O O H H N O H O H N N N H H N 2 N H N H N N H 2 2 O O H H N O N N H 2 - O H N + N N H 3 O O O N a O H N H O H N 1 5 N N H O H 2 N N H H N 2 N N H E92E4BA2869F3611-FICTS 6C16DE2351F9FF50-FICTS 8A7AD1EB498CC76A-FICTS stereoisomers tautomer salt charged form O + N a - O H N N N H 2 histidine E5F83F10C5DB080A-FICTS A3DAE0788050DDE4-FICTS 9850FD9F9E2B4E25-FICTS FICTS isotope “errors” B2FDA68AEDA06DB9-FICTS E5F83F10C5DB080A-FICTS 9850FD9F9E2B4E25-FICTS

  18. O O O O H H N O H O H N N N H H N 2 N H N H N N H 2 2 O O H H N O N N H 2 - O H N + N N H 3 O O O N a O H N H O H N 1 5 N N H O H 2 N N H H N 2 N N H E92E4BA2869F3611-FICuS 9850FD9F9E2B4E25-FICuS 8A7AD1EB498CC76A-FICuS stereoisomers tautomer salt charged form O + N a - O H N N N H 2 histidine E5F83F10C5DB080A-FICuS A3DAE0788050DDE4-FICuS 9850FD9F9E2B4E25-FICuS FICuS isotope “errors” B2FDA68AEDA06DB9-FICuS E5F83F10C5DB080A-FICuS 9850FD9F9E2B4E25-FICuS

  19. O O O O H H N O H O H N N N H H N 2 N H N H N N H 2 2 O O H H N O N N H 2 - O H N + N N H 3 O O O N a O H N H O H N 1 5 N N H O H 2 N N H H N 2 N N H 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu stereoisomers tautomer salt charged form O + N a - O H N N N H 2 histidine 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu uuuuu isotope “errors” 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-uuuuu 9850FD9F9E2B4E25-FICuS

  20. O O O O H H N O H O H N N N H H N 2 N H N H N N H 2 2 O O H H N O N N H 2 - O H N + N N H 3 O O O N a O H N H O H N 1 5 N N H O H 2 N N H H N 2 N N H HNDVDQJCIGZPNO-RXMQYKEDSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N HNDVDQJCIGZPNO-YFKPBYRVSA-N stereoisomers tautomer salt charged form O + N a - O H N N N H 2 histidine UHPNKBYGGMJTIM-UHFFFAOYSA-M HNDVDQJCIGZPNO-UHFFFAOYSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N Std. InChIKey isotope “errors” UHPNKBYGGMJTIM-UHFFFAOYSA-M HNDVDQJCIGZPNO-CDYZYAPPSA-N HNDVDQJCIGZPNO-UHFFFAOYSA-N

  21. NCI/CADD Chemical Structure Database Structure Normalization original record original record original record original record original record original record original record original record original record original record original record 119.8 million originalstructure records in CSDB

  22. NCI/CADD Chemical Structure Database Structure Normalization original record FICTS original record original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record FICTS original record 83.1 million FICTSparent structures original record 119.8 million originalstructure records in CSDB

  23. NCI/CADD Chemical Structure Database Structure Normalization original record FICTS original record original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS FICuS original record FICTS 81.6 million FICuSparent structures original record 83.1 million FICTSparent structures original record 119.8 million originalstructure records in CSDB

  24. NCI/CADD Chemical Structure Database Structure Normalization original record FICTS original record original record FICTS FICuS original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS 76.2 million uuuuuparent structures original record FICTS 81.6 million FICuSparent structures original record 83.1 million FICTSparent structures original record 119.8 million originalstructure records in CSDB

  25. NCI/CADD Chemical Structure Database Structure Normalization original record FICTS original record original record FICTS FICuS original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS uuuuu original record FICTS FICuS 76.2 million uuuuuparent structures original record FICTS 81.6 million FICuSparent structures original record 83.1 million FICTSparent structures original record tautomer- invariant 119.8 million originalstructure records in CSDB

  26. Tautomer Analysis How much “chemical space” is “just generated” by drawing tautomers?

  27. NCI/CADD Chemical Structure Database Tautomer Analysis • CACTVS: generation of all formal tautomers for a given organic compound (prototropic tautomerism) • rule set of 21 transforms encoded as (CACTVS-extended) SMIRKS • rule set is systematically applied to the original structure(and all tautomers that have been generated in previous steps) • tautomer generation is limited to 1000 SMIRKS transform operations/structure • all tautomers are ranked by a scoring function • the highest ranked tautomer is defined as thecanonical tautomer

  28. NCI/CADD Chemical Structure Database Tautomer Analysis • 21 SMIRKS transform rules: rule 12: furanones rule 1: 1.3 (thio)keto/(thio)enol rule 13: keten/ynol exchange rule 2: 1.5 (thio)keto/(thio)enol rule 14: ionic nitro/aci-nitro rule 3: simple (aliphatic) imine rule 15: pentavalent nitro/aci-nitro rule 4: special imine rule 16: oxim/nitroso rule 5: 1.3 aromatic heteroatom H shift rule 17: oxim/nitroso via phenol rule 6: 1.3 heteroatom H shift rule 18: cyanic/iso-cyanic acids rule 7: 1.5 (aromatic) heteroatom H shift (1) rule 19: formamidinesulfinic acids rule 8: 1.5 aromatic heteroatom H shift (2) rule 20: isocyanides rule 9: 1.7 (aromatic) heteroatom H shift rule 21: phosphonic acids rule 10: 1.9 (aromatic) heteroatom H shift rule 11: 1.11 (aromatic) heteroatom H shift

  29. NCI/CADD Chemical Structure Database Tautomer Analysis FICuS FICuS starting from the set of FICuS parent structures we systematically generatedall tautomers based on the 21 SMIRKS rule set available in CACTVS FICuS generated 680 million tautomers FICuS FICuS FICuS 70.6 million FICuSparent structures (2009 DB version) for 1.7% of theFICuS parentstructures the enumeration was not exhaustive

  30. NCI/CADD Chemical Structure Database Tautomer Analysis tautomeric overlap within each individual database release (%) 90 80 70 numberdatabase releases 60 50 frequency 40 30 20 10 0 2.0 0.0 0.5 1.0 1.5 average:~0.3% of original structure records

  31. NCI/CADD Chemical Structure Database Tautomer Analysis Ambinter BIND BindingDB ChemNavigator KEGG NCI Open Database NIST WebBook NLM ChemIDplus NMRShiftDB Thomson Pharma Wombat tautomeric overlap within each individual database release (%) Asinex ChemBridge ComGenex ChemNavigator Columbia University Molecular Screening Center EPA DSSTox Specs 90 80 70 NCI/DTP PASS Training Set SGC-Ox numberdatabase releases 60 50 frequency 40 ChemDB ZINC 30 ChEBI ChemSpider 20 10 0 2.0 0.0 0.5 1.0 1.5 average:~0.3% of original structure records

  32. NCI/CADD Chemical Structure Database Tautomer Analysis occurrence of “tautomerism-critical” molecules within each individual database release (%) 30 25 20 numberdatabase releases 15 frequency 10 5 0 0.5 2.5 4.5 6.5 8.5 10.5 12.5 14.5 16.5 18.5 20.5 22.5 24.5 average:~9.5% of FICuS parent structures percentage of FICuS parent structure in each database releaseoccurring somewhere in CSDB with a conflict

  33. O H N O N Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) • HPMBP is used in liquid membranes(selective removal of metal ions) • selectivity and efficiency depends on the tautomeric form of HPMBP which itself depends on solvent and concentration of HPMBP He, D.; Li Z.; Ma M.; Huang J.; Yang Y. Study of extraction characteristics of HPMBP. 1. Tautomer and extraction characteristics. J. Chem. Eng. Data 2009, 54(10), 2944-2947

  34. O O O R/S N H N N O H O O N N N R/S O H O H O O H E/Z R/S E/Z H N H N H N N O O O O H N N N N Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) CACTVS generates 7 tautomers canonical tautomer by CACTVS 5 tautomers have potential stereo center on atoms or bonds

  35. O O O R/S N H N N O O O N N N R/S O H O H O O H E/Z R/S E/Z N H N H N N O O O O H N N N N Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) 3 tautomers have CAS Registry Numbersassigned H 33064-14-1 4551-69-1 859 references 49 references (no stereo) H 127117-31-1 3 references (Z)

  36. O H N O N Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) occurrences in databases indexed in CSDB O O R/S N N O H O N N 12 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 6 databases R/S O H O H O O H E/Z R/S E/Z H N H N H N N O O O O H N N N N 1 database (no stereo)

  37. O O R / S H N N O O N N O H O H O O H R / S E / Z R / S E / Z H N H N H N N O O H O O N N N N Example for a Tautomer “Conflict” HPMBP (1-Phenyl-3-methyl-4-benzoyl-pyrazolone-5) Ambinter ChemDB ChemSpider DiscoveryGate ChemNavigator Thomson Pharma occurrences in databases O N O H N 12 databases 16 databases (no stereo) 3 databases (R) 2 databases (S) 6databases ACD 3D ACX Ambinter BioByte QSAR ChemBank ChemBridge ChemDB ChemSpider DiscoveryGate EPA GCES MLSMR NCI Open Database NIST MS-Lib NLM ChemIDplus Sigma-Aldrich Thomson Pharma ChemDB ACD 3D Ambinter BindingDB ChemBank ChemDB ChemSpider ChemNavigator MLSMR NIAID Scripps Screening Center Thomson Pharma ZINC ChemSpider ECOTOX ZINC ChemSpider ZINC 1 database (no stereo)

  38. Scaffold Analysis

  39. N H O N N H O N N H N NCI/CADD Chemical Structure Database Scaffold Analysis level 1 level 2 example molecular scaffold tree Schuffenhauer et al.J. Chem. Inf. Model. 2007, 47, 47-58 N O N simple scaffold Bemis et al.J. Med. Chem. 1996, 39, 2887-2893 O S O archetype scaffold Bemis et al.J. Med. Chem. 1996, 39, 2887-2893

  40. NCI/CADD Chemical Structure Database Scaffold Analysis CSDB uuuuu compound set 76.2 million

  41. N H O N N H O N N H N NCI/CADD Chemical Structure Database Scaffold Analysis level 2 level 1 molecular scaffold tree 8.1 million scaffolds CSDB uuuuu compound set simple scaffold 6.8 million scaffolds 76.2 million archetype scaffold 0.8 million scaffolds

  42. N H O N N H O N NCI/CADD Chemical Structure Database Scaffold Analysis level 2 number of unique scaffolds per hierarchy level level 1 molecular scaffold tree 8.1 million scaffolds CSDB 8.0 80.0 7.0 70.0 uuuuu compound set 6.0 60.0 5.0 50.0 Number of unique structures (in million) 76.2 million 4.0 40.0 Number of Unique Scaffolds (in millions) 3.0 30.0 2.0 20.0 1.0 10.0 0 0 1 2 3 4 5 6 7 8 9 10 Hierarchy Level

  43. Atom Neighborhoods

  44. H N H O H O NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) MNA level 1 MNA level 2 HC C(C(CC-H)C(CC-C)-H(C)) HO C(C(CC-H)C(CN-H)-H(C)) CHCC C(C(CC-H)C(CN-H)-C(C-O-O)) CHCN C(C(CC-H)N(CC)-H(C)) CCCC C(C(CC-C)N(CC)-H(C)) CCOON(C(CN-H)C(CN-H)) NCC -H(C(CC-H)) OHC -H(C(CN-H)) OC -H(-O(-H-C)) -C(C(CC-C)-O(-H-C)-O(-C)) -O(-H(-O)-C(C-O-O)) -O(-C(C-O-O)) Filimonov D., Poroikov V., Borodina Yu., Gloriozova T. J.Chem. Inf. Comput. Sci., 1999, 39 (4), 666-670.

  45. CSDB uuuuu compound set 76.2 million NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA)

  46. CSDB uuuuu compound set 76.2 million NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) Unique MNAs 13,426 level 1 918,516 level 2

  47. CSDB uuuuu compound set 76.2 million NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) Unique MNAs 13,426 1.3 billion relationships level 1 ~ 17 MNAs per uuuuu parent structure 918,516 2.3 billion relationships level 2 ~ 30 MNAs per uuuuu parent structure

  48. CSDB uuuuu compound set 76.2 million NCI/CADD Chemical Structure Database Multilevel Neighborhoods of Atoms (MNA) Unique MNAs 13,426 1.3 billion relationships level 1 ~ 17 MNAs per uuuuu parent structure 918,516 2.3 billion relationships level 2 ~ 30 MNAs per uuuuu parent structure surprising:424,784 MNAs (level 2) are exclusive to a set of 1,3 million structures in ChemSpider

  49. NCI/CADD Web Resources Chemical Structure Web Services external(web) services ChemicalIdentifierResolver NCI/CADDweb service NCI/CADDweb service http Chemical Structure Web Services othersoftwarepackages CACTVS e.g. OPSIN NCI/CADD Chemical StructureDatabase (CSDB)

  50. Symyx Draw Resolver http://www.symyx.com/ webel.py - A Cinfony module http://baoilleach.blogspot.com/2009/11/introducing-webel-cheminformatics.html avogadro.openmolecules.net/ NCI/CADD Web Resources Chemical Identifier Resolver http://www.akosgmbh.eu/globalsearch/index.htm gChem Virtual Molecular Model Kit http://chemagic.com/web_molecules/script_page_large.aspx CACTVS IUPHAR DATABASE http://www.iuphar-db.org http://www.xemistry.com

More Related