Part 3:

Part 3: Essentials

Global Entrez Search Page All[Filter]

Overall Goal: An on-line resource providing comprehensive information on the biological activities of small molecules

Constituents to all macromolecules(DNA, RNA, protein, carbohydrates, etc.) Serve as cofactors and signaling molecules to thousands of proteins The chemistry part of “biochemistry” Most drug entities and drug types are small molecules Most biomarkers used in clinical chemistry are small molecules Why Are Small Molecules Important?

PubChem Databases and Tools: http://pubchem.ncbi.nlm.nih.gov/

The Molecular Libraries Roadmap:An Integrated Initiative Technology Development Screening Informatics Chem-informatics Research Centers Molecular Libraries Screening Centers Network ( M L S C N ) Assay Development Instrumentation Compound Repository (MLSMR) Chemical Diversity Predictive ADMET

Repository for small molecules and bioactivity assay data Part of Entrez search and linking system Links to other NCBI databases, e.g., PubMed, MeSH Protein structures (MMDB) Protein/Nucleotide sequences (GenPept/GenBank) Contains complete chemical structures Standardized for uniformity Small set of computed properties Structure similarity searching PubChem =

Other Depositors to PubChem and more…

PubChem: Bird’s Eye View Depositors PubChemSubstance PubChemBioAssays PubChemCompound Chemical Structure Similarity

How does data get into PubChem?

PubChem integration in Entrez VAST Structure Similarity Term Frequency Statistics Literature 3D Structures Bioactivity Assay Results Small Molecule Structures Chemical Structure Similarity Protein Sequences Activity Profile Similarity

Primary Database

No “Global” rules or standards Based on organizational needs Lots of data overlap Often based on individual Scientist preferences PubChem accepts data from many organizations Previously unseen data representation Combinatorial explosion of ways for drawing the same structure Depositor Data

Redundancy, mixtures Mixture

Derivative Database

Chemical Structures may be representedin many different ways

Substance Compound

Substance Compound Unknown E/Z isomers Unknown stereo Knownstereochemistry

Substances (heterogens) from Protein 3D structures (PDB) Most molecules come out right, even complex ones Sometimes there is a need to fix problems, e.g. bond orders Deposited structure receives • bond information • hydrogens • stereochemistry(where possible) PDB lacks chemical detail • no bond order information • no hydrogens Vancomycin Result Need to fix heme bond orders Dopamine

PubChem Compound Processing • Chemical Data Verification • Atom description (label, element?) • Functional group clean-up • Atom valence verification to prevent non-sense • “Normalize” and “Standardize” • Valence-Bond canonicalize (for Tautomer invariance) • Aromaticity detection and self-consistency • Stereochemistry detection • Explicit hydrogen assignment • Calculation • 2-D Coordinate generation • Image Depictions • Fingerprints • IUPAC Name • SMILES, InChI, Hash Codes • xLogP, TPSA, HBD, HBA, MW, MF

Chemical Structure “Sanitization” • Chemical Structures that fail Sanitization • Are not part of the aggregated PubChem Compound Database • Still “searchable” via PubChem Substance Database • Keeps the PubChem Compound Database “Clean” for Chemical Informatic Analysis • Collapses structures represented in various ways into a uniform, identical representation

Compound for mixture Component compounds

Components of a mixture

Substance vs. Compound Substance summary Compound summary

Substance vs. Compound

Examples of queries • 200[MW] • “ dopamine”[CompleteSynonym] • 300:500[MW] • “ pcsubstance structure"[Filter] • “ ca"[Element] AND 300:500[MW] AND "chemidplus"[SourceName] "InChI=1/Ca.3H2O/h;3*1H2/q 2;;;/p-3/fCa.3HO/h;3*1h/qm;3*-1"[InChI] • "lipinski"[Filter] AND "antineoplastic agents"[PharmAction] • Lipinski rule of 5 -- a molecule is likely to be bioactive if it has: • not more than 5 hydrogen bond donors (OH and NH groups) • <10 hydrogen bond acceptors (N or O) • a molecular weight under 500 • a LogP under 5

Examples of PubChem Index Fields … All [ALL] -- All of the following fields are searched; defaultsearch field. Uid[UID] -- The integer represents SID for PCSubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].Filter [Filter] -- Limits the records to various indexed filters. ActiveAid [AA] -- Active BioAssay identifier, integer. ActiveAidCount [AC, ACNT] -- # bioassays where tested active. AtomChiralCount [ACC, ACCNT] -- Total count of chiral atoms in a given compound.BioAssayID [BAID, AID] -- BioAssay identifier.BondChiralCount [BCC, BCCNT] –- Number of chiral bonds.Comment [CMT] -- Substance or bioassay comment. CompleteSynonym [CSYN, CSYNO] – exactly matching name for substance/compound. CompoundID [CID] -- Compound identifier, integer. DepositDate [DDAT, DEPDAT] -- Deposition timestamp for a substance. Element [ELMT, EL] -- Chemical element in a substance/compound. ExactMass [EMAS, EXMASS]-- The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a MS spec. A real number.HeavyAtomCount [HAC, HACNT] -- Atom count in a compound except hydrogen, integer. HydrogenBondAcceptorCount [HBAC, HBACNT] -- Hydrogen bond acceptors for a compound, integer. HydrogenBondDonorCount [HBDC, HBDCNT] -- Hydrogen bond donors for a compound, integer. InChI [inchi] -- IUPAC International Chemical Identifier.

Examples of PubChem Index Fields, contd. IUPACName [UPAC, IUPAC] -- Standard IUPAC name for compound. MeSHDescription [MHD]MeSHTerm [MSHT, MESHT] -- Medical Subject Heading term.MeSHTreeNode [MSHN, MESHTN] -- Medical Subject Heading tree node (tree structures).MolecularWeight [MW, MWT, MOLWT] -- Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number. MonoisotopicMass [MMAS, MIMASS] -- Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number. PharmAction [PHMA, PHARMA] -- MeSH pharmacological actions heading.RotatableBondCount [RBC, RBCNT] – Number of rotatable bonds. SourceCategory [SRCC, SRCCAT, SRCCATG] -- Depositor categories.SourceID [SRID, SRCID] -- Depositor's external id.SourceName [SRC, SRCNAM, SRCNAME] -- official depositor name.SubstanceID [SID] -- Substance ID. Same as [UID].Synonym [SYNO] -- Synonyms for substance. TautomerCount [TC, TCNT, TTMC] -- Possible tautomer count for each given structure, ≤ 200. TotalFormalCharge [TFC, CHG, CHRG] -- Total formula charge.TPSA [TPSA] -- Topological Polar Surface Area.XLogP [XLGP, LOGP]

Preview/Index Tab

History Tab Substances of MW 300-500Da having antineoplastic properties and obeying Lipinski rule of 5

Links For the whole set oronly selected records

Property Report

SDF format

MeSH is the National Library of Medicine's controlled vocabulary thesaurus. Consists of sets of terms naming descriptors in a hierarchical and alphabetic structure, e.g.: "Mental Disorders”, “Pharmacological action”, “Catecholamine hormones” , etc. Permits searching at various levels of specificity MeSH thesaurus is used for indexing articles for the MEDLINE/PubMed database MeSH is continually updated PubChem assigns MeSH headings to Compound records Medical Subject Headings (MeSH)

Contains bioactivity screens of chemical substances described in PubChem Substance Provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to a screening protocol Depositor decides on data definitions and interpretation Data can be plotted as graphs of statistical histograms Cross-indexed to other Entrez databases Primary Database

Click to view structure

NCBI FTP >> PubChem Folder

Entrez PubChem: Help and Tabs

Part 3:

Part 3:

Presentation Transcript

Computer & Internet Troubleshooting 101

Mixed Models – Part 2

Minds brains and machines Part II

Twelve Angry Men (Part Two)

Border Gateway Protocol – BGP4

The Effective Management of Juvenile Sex Offenders in the Community

Title II, Part A

Software Architecture

PM Master Data 2

UNIX/Linux

Introduction to OpenGL (part 1)

Molecular Biology

Slides at … tompeters

Welcome Each of You to My Molecular Biology Class

HETBAHN Case

Going Home

ELECTRONICS Part II

Microsoft Robotics Developer Studio 고급 프로그래밍 과정 [Part 5] CCR 및 DSS 서비스 프로그래밍

ELECTRONICS Part II

MSHA’s Final Respirable Dust Rules

Part 3:

Part 3:

Presentation Transcript

Computer &amp; Internet Troubleshooting 101

Mixed Models – Part 2

Minds brains and machines Part II

Twelve Angry Men (Part Two)

Border Gateway Protocol – BGP4

The Effective Management of Juvenile Sex Offenders in the Community

Title II, Part A

Software Architecture

PM Master Data 2

UNIX/Linux

Introduction to OpenGL (part 1)

Molecular Biology

Slides at … tompeters

Welcome Each of You to My Molecular Biology Class

HETBAHN Case

Going Home

ELECTRONICS Part II

Microsoft Robotics Developer Studio 고급 프로그래밍 과정 [Part 5] CCR 및 DSS 서비스 프로그래밍

ELECTRONICS Part II

MSHA’s Final Respirable Dust Rules

Computer & Internet Troubleshooting 101