1 / 53

Part 3:

Part 3:. Essentials. Global Entrez Search Page. All[Filter]. Overall Goal: An on-line resource providing comprehensive information on the biological activities of small molecules. Constituents to all macromolecules (DNA, RNA, protein, carbohydrates, etc.)

chesna
Download Presentation

Part 3:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part 3: Essentials

  2. Global Entrez Search Page All[Filter]

  3. Overall Goal: An on-line resource providing comprehensive information on the biological activities of small molecules

  4. Constituents to all macromolecules(DNA, RNA, protein, carbohydrates, etc.) Serve as cofactors and signaling molecules to thousands of proteins The chemistry part of “biochemistry” Most drug entities and drug types are small molecules Most biomarkers used in clinical chemistry are small molecules Why Are Small Molecules Important?

  5. PubChem Databases and Tools: http://pubchem.ncbi.nlm.nih.gov/

  6. The Molecular Libraries Roadmap:An Integrated Initiative Technology Development Screening Informatics Chem-informatics Research Centers Molecular Libraries Screening Centers Network ( M L S C N ) Assay Development Instrumentation Compound Repository (MLSMR) Chemical Diversity Predictive ADMET

  7. Repository for small molecules and bioactivity assay data Part of Entrez search and linking system Links to other NCBI databases, e.g., PubMed, MeSH Protein structures (MMDB) Protein/Nucleotide sequences (GenPept/GenBank) Contains complete chemical structures Standardized for uniformity Small set of computed properties Structure similarity searching PubChem =

  8. Other Depositors to PubChem and more…

  9. PubChem: Bird’s Eye View Depositors PubChemSubstance PubChemBioAssays PubChemCompound Chemical Structure Similarity

  10. How does data get into PubChem?

  11. PubChem integration in Entrez VAST Structure Similarity Term Frequency Statistics Literature 3D Structures Bioactivity Assay Results Small Molecule Structures Chemical Structure Similarity Protein Sequences Activity Profile Similarity

  12. Primary Database

  13. No “Global” rules or standards Based on organizational needs Lots of data overlap Often based on individual Scientist preferences PubChem accepts data from many organizations Previously unseen data representation Combinatorial explosion of ways for drawing the same structure Depositor Data

  14. Redundancy, mixtures Mixture

  15. Derivative Database

  16. Chemical Structures may be representedin many different ways

  17. Chemical Structures may be representedin many different ways

  18. Substance Compound

  19. Substance Compound Unknown E/Z isomers Unknown stereo Knownstereochemistry

  20. Substances (heterogens) from Protein 3D structures (PDB) Most molecules come out right, even complex ones Sometimes there is a need to fix problems, e.g. bond orders Deposited structure receives • bond information • hydrogens • stereochemistry(where possible) PDB lacks chemical detail • no bond order information • no hydrogens Vancomycin Result Need to fix heme bond orders Dopamine

  21. PubChem Compound Processing • Chemical Data Verification • Atom description (label, element?) • Functional group clean-up • Atom valence verification to prevent non-sense • “Normalize” and “Standardize” • Valence-Bond canonicalize (for Tautomer invariance) • Aromaticity detection and self-consistency • Stereochemistry detection • Explicit hydrogen assignment • Calculation • 2-D Coordinate generation • Image Depictions • Fingerprints • IUPAC Name • SMILES, InChI, Hash Codes • xLogP, TPSA, HBD, HBA, MW, MF

  22. Chemical Structure “Sanitization” • Chemical Structures that fail Sanitization • Are not part of the aggregated PubChem Compound Database • Still “searchable” via PubChem Substance Database • Keeps the PubChem Compound Database “Clean” for Chemical Informatic Analysis • Collapses structures represented in various ways into a uniform, identical representation

  23. Compound for mixture Component compounds

  24. Components of a mixture

  25. Substance vs. Compound Substance summary Compound summary

  26. Substance vs. Compound

  27. Examples of queries • 200[MW] • “ dopamine”[CompleteSynonym] • 300:500[MW] • “ pcsubstance structure"[Filter] • “ ca"[Element] AND 300:500[MW] AND "chemidplus"[SourceName] "InChI=1/Ca.3H2O/h;3*1H2/q 2;;;/p-3/fCa.3HO/h;3*1h/qm;3*-1"[InChI] • "lipinski"[Filter] AND "antineoplastic agents"[PharmAction] • Lipinski rule of 5 -- a molecule is likely to be bioactive if it has: • not more than 5 hydrogen bond donors (OH and NH groups) • <10 hydrogen bond acceptors (N or O) • a molecular weight under 500 • a LogP under 5

  28. Examples of PubChem Index Fields … All [ALL] -- All of the following fields are searched; defaultsearch field. Uid[UID] -- The integer represents SID for PCSubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].Filter [Filter] -- Limits the records to various indexed filters. ActiveAid [AA] -- Active BioAssay identifier, integer. ActiveAidCount [AC, ACNT] -- # bioassays where tested active. AtomChiralCount [ACC, ACCNT] -- Total count of chiral atoms in a given compound.BioAssayID [BAID, AID] -- BioAssay identifier.BondChiralCount [BCC, BCCNT] –- Number of chiral bonds.Comment [CMT] -- Substance or bioassay comment. CompleteSynonym [CSYN, CSYNO] – exactly matching name for substance/compound. CompoundID [CID] -- Compound identifier, integer. DepositDate [DDAT, DEPDAT] -- Deposition timestamp for a substance. Element [ELMT, EL] -- Chemical element in a substance/compound. ExactMass [EMAS, EXMASS]-- The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a MS spec. A real number.HeavyAtomCount [HAC, HACNT] -- Atom count in a compound except hydrogen, integer. HydrogenBondAcceptorCount [HBAC, HBACNT] -- Hydrogen bond acceptors for a compound, integer. HydrogenBondDonorCount [HBDC, HBDCNT] -- Hydrogen bond donors for a compound, integer. InChI [inchi] -- IUPAC International Chemical Identifier.

  29. Examples of PubChem Index Fields, contd. IUPACName [UPAC, IUPAC] -- Standard IUPAC name for compound. MeSHDescription [MHD]MeSHTerm [MSHT, MESHT] -- Medical Subject Heading term.MeSHTreeNode [MSHN, MESHTN] -- Medical Subject Heading tree node (tree structures).MolecularWeight [MW, MWT, MOLWT] -- Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number. MonoisotopicMass [MMAS, MIMASS] -- Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number. PharmAction [PHMA, PHARMA] -- MeSH pharmacological actions heading.RotatableBondCount [RBC, RBCNT] – Number of rotatable bonds. SourceCategory [SRCC, SRCCAT, SRCCATG] -- Depositor categories.SourceID [SRID, SRCID] -- Depositor's external id.SourceName [SRC, SRCNAM, SRCNAME] -- official depositor name.SubstanceID [SID] -- Substance ID. Same as [UID].Synonym [SYNO] -- Synonyms for substance. TautomerCount [TC, TCNT, TTMC] -- Possible tautomer count for each given structure, ≤ 200.  TotalFormalCharge [TFC, CHG, CHRG] -- Total formula charge.TPSA [TPSA] -- Topological Polar Surface Area.XLogP [XLGP, LOGP]

  30. Preview/Index Tab

  31. History Tab Substances of MW 300-500Da having antineoplastic properties and obeying Lipinski rule of 5

  32. Links For the whole set oronly selected records

  33. Property Report

  34. SDF format

  35. MeSH is the National Library of Medicine's controlled vocabulary thesaurus. Consists of sets of terms naming descriptors in a hierarchical and alphabetic structure, e.g.: "Mental Disorders”, “Pharmacological action”, “Catecholamine hormones” , etc. Permits searching at various levels of specificity MeSH thesaurus is used for indexing articles for the MEDLINE/PubMed database MeSH is continually updated PubChem assigns MeSH headings to Compound records Medical Subject Headings (MeSH)

  36. Contains bioactivity screens of chemical substances described in PubChem Substance Provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to a screening protocol Depositor decides on data definitions and interpretation Data can be plotted as graphs of statistical histograms Cross-indexed to other Entrez databases Primary Database

  37. Click to view structure

  38. NCBI FTP >> PubChem Folder

  39. Entrez PubChem: Help and Tabs

More Related