530 likes | 658 Views
Part 3:. Essentials. Global Entrez Search Page. All[Filter]. Overall Goal: An on-line resource providing comprehensive information on the biological activities of small molecules. Constituents to all macromolecules (DNA, RNA, protein, carbohydrates, etc.)
E N D
Part 3: Essentials
Global Entrez Search Page All[Filter]
Overall Goal: An on-line resource providing comprehensive information on the biological activities of small molecules
Constituents to all macromolecules(DNA, RNA, protein, carbohydrates, etc.) Serve as cofactors and signaling molecules to thousands of proteins The chemistry part of “biochemistry” Most drug entities and drug types are small molecules Most biomarkers used in clinical chemistry are small molecules Why Are Small Molecules Important?
PubChem Databases and Tools: http://pubchem.ncbi.nlm.nih.gov/
The Molecular Libraries Roadmap:An Integrated Initiative Technology Development Screening Informatics Chem-informatics Research Centers Molecular Libraries Screening Centers Network ( M L S C N ) Assay Development Instrumentation Compound Repository (MLSMR) Chemical Diversity Predictive ADMET
Repository for small molecules and bioactivity assay data Part of Entrez search and linking system Links to other NCBI databases, e.g., PubMed, MeSH Protein structures (MMDB) Protein/Nucleotide sequences (GenPept/GenBank) Contains complete chemical structures Standardized for uniformity Small set of computed properties Structure similarity searching PubChem =
Other Depositors to PubChem and more…
PubChem: Bird’s Eye View Depositors PubChemSubstance PubChemBioAssays PubChemCompound Chemical Structure Similarity
PubChem integration in Entrez VAST Structure Similarity Term Frequency Statistics Literature 3D Structures Bioactivity Assay Results Small Molecule Structures Chemical Structure Similarity Protein Sequences Activity Profile Similarity
Primary Database
No “Global” rules or standards Based on organizational needs Lots of data overlap Often based on individual Scientist preferences PubChem accepts data from many organizations Previously unseen data representation Combinatorial explosion of ways for drawing the same structure Depositor Data
Redundancy, mixtures Mixture
Derivative Database
Chemical Structures may be representedin many different ways
Chemical Structures may be representedin many different ways
Substance Compound
Substance Compound Unknown E/Z isomers Unknown stereo Knownstereochemistry
Substances (heterogens) from Protein 3D structures (PDB) Most molecules come out right, even complex ones Sometimes there is a need to fix problems, e.g. bond orders Deposited structure receives • bond information • hydrogens • stereochemistry(where possible) PDB lacks chemical detail • no bond order information • no hydrogens Vancomycin Result Need to fix heme bond orders Dopamine
PubChem Compound Processing • Chemical Data Verification • Atom description (label, element?) • Functional group clean-up • Atom valence verification to prevent non-sense • “Normalize” and “Standardize” • Valence-Bond canonicalize (for Tautomer invariance) • Aromaticity detection and self-consistency • Stereochemistry detection • Explicit hydrogen assignment • Calculation • 2-D Coordinate generation • Image Depictions • Fingerprints • IUPAC Name • SMILES, InChI, Hash Codes • xLogP, TPSA, HBD, HBA, MW, MF
Chemical Structure “Sanitization” • Chemical Structures that fail Sanitization • Are not part of the aggregated PubChem Compound Database • Still “searchable” via PubChem Substance Database • Keeps the PubChem Compound Database “Clean” for Chemical Informatic Analysis • Collapses structures represented in various ways into a uniform, identical representation
Compound for mixture Component compounds
Substance vs. Compound Substance summary Compound summary
Examples of queries • 200[MW] • “ dopamine”[CompleteSynonym] • 300:500[MW] • “ pcsubstance structure"[Filter] • “ ca"[Element] AND 300:500[MW] AND "chemidplus"[SourceName] "InChI=1/Ca.3H2O/h;3*1H2/q 2;;;/p-3/fCa.3HO/h;3*1h/qm;3*-1"[InChI] • "lipinski"[Filter] AND "antineoplastic agents"[PharmAction] • Lipinski rule of 5 -- a molecule is likely to be bioactive if it has: • not more than 5 hydrogen bond donors (OH and NH groups) • <10 hydrogen bond acceptors (N or O) • a molecular weight under 500 • a LogP under 5
Examples of PubChem Index Fields … All [ALL] -- All of the following fields are searched; defaultsearch field. Uid[UID] -- The integer represents SID for PCSubstance database. By default, an integer without a field alias is recognized as a UID. Same as [SID].Filter [Filter] -- Limits the records to various indexed filters. ActiveAid [AA] -- Active BioAssay identifier, integer. ActiveAidCount [AC, ACNT] -- # bioassays where tested active. AtomChiralCount [ACC, ACCNT] -- Total count of chiral atoms in a given compound.BioAssayID [BAID, AID] -- BioAssay identifier.BondChiralCount [BCC, BCCNT] –- Number of chiral bonds.Comment [CMT] -- Substance or bioassay comment. CompleteSynonym [CSYN, CSYNO] – exactly matching name for substance/compound. CompoundID [CID] -- Compound identifier, integer. DepositDate [DDAT, DEPDAT] -- Deposition timestamp for a substance. Element [ELMT, EL] -- Chemical element in a substance/compound. ExactMass [EMAS, EXMASS]-- The calculated mass of an ion or a molecule containing most likely isotopic composition for a single random molecule, corresponding to mass of most intense ion/molecule peak in a MS spec. A real number.HeavyAtomCount [HAC, HACNT] -- Atom count in a compound except hydrogen, integer. HydrogenBondAcceptorCount [HBAC, HBACNT] -- Hydrogen bond acceptors for a compound, integer. HydrogenBondDonorCount [HBDC, HBDCNT] -- Hydrogen bond donors for a compound, integer. InChI [inchi] -- IUPAC International Chemical Identifier.
Examples of PubChem Index Fields, contd. IUPACName [UPAC, IUPAC] -- Standard IUPAC name for compound. MeSHDescription [MHD]MeSHTerm [MSHT, MESHT] -- Medical Subject Heading term.MeSHTreeNode [MSHN, MESHTN] -- Medical Subject Heading tree node (tree structures).MolecularWeight [MW, MWT, MOLWT] -- Mass of a molecule calculated using the average mass of each element weighted for its natural isotopic abundance. E.g., Carbon has two natural isotopes 12 and 13 with relative abundances of 98.9% and 1.1% to yield an average mass of 12.011 g/mol. A real number. MonoisotopicMass [MMAS, MIMASS] -- Mass of a molecule calculated using the mass of the most abundant isotope of each element. E.g., Carbon has a monoisotopic mass of 12.000 g/mol. A real number. PharmAction [PHMA, PHARMA] -- MeSH pharmacological actions heading.RotatableBondCount [RBC, RBCNT] – Number of rotatable bonds. SourceCategory [SRCC, SRCCAT, SRCCATG] -- Depositor categories.SourceID [SRID, SRCID] -- Depositor's external id.SourceName [SRC, SRCNAM, SRCNAME] -- official depositor name.SubstanceID [SID] -- Substance ID. Same as [UID].Synonym [SYNO] -- Synonyms for substance. TautomerCount [TC, TCNT, TTMC] -- Possible tautomer count for each given structure, ≤ 200. TotalFormalCharge [TFC, CHG, CHRG] -- Total formula charge.TPSA [TPSA] -- Topological Polar Surface Area.XLogP [XLGP, LOGP]
History Tab Substances of MW 300-500Da having antineoplastic properties and obeying Lipinski rule of 5
Links For the whole set oronly selected records
MeSH is the National Library of Medicine's controlled vocabulary thesaurus. Consists of sets of terms naming descriptors in a hierarchical and alphabetic structure, e.g.: "Mental Disorders”, “Pharmacological action”, “Catecholamine hormones” , etc. Permits searching at various levels of specificity MeSH thesaurus is used for indexing articles for the MEDLINE/PubMed database MeSH is continually updated PubChem assigns MeSH headings to Compound records Medical Subject Headings (MeSH)
Contains bioactivity screens of chemical substances described in PubChem Substance Provides searchable descriptions of each bioassay, including descriptions of the conditions and readouts specific to a screening protocol Depositor decides on data definitions and interpretation Data can be plotted as graphs of statistical histograms Cross-indexed to other Entrez databases Primary Database