180 likes | 302 Views
The Use of Graph Matching Algorithms to Identify Biochemical Substructures in Synthetic Chemical Compounds Application to Metabolomics. Mai Hamdalla , David Grant, Ion Mandoiu , Dennis Hill, Sanguthevar Rajasekaran and Reda Ammar University of Connecticut. Genome. DNA. Transcriptome. RNA.
E N D
The Use of Graph Matching Algorithms to Identify Biochemical Substructures inSynthetic Chemical CompoundsApplication to Metabolomics Mai Hamdalla, David Grant, Ion Mandoiu, Dennis Hill, SanguthevarRajasekaran and RedaAmmarUniversity of Connecticut
Genome DNA Transcriptome RNA Proteome Proteins Metabolome Lipids Sugars Amino Acids Nucleotides Metabolites Phenotype/Function
N O O O O O O O O O O O O O O Identification Process • SMILES (simplified molecular-input line-entry system) • C8H7N C1=CC=C2C(=C1)C=CN2 • C9H18O8 C(C1C(C(C(C(O1)OCC(CO)O)O)O)O)O • C6H12O6 C(C1C(C(C(O1)(CO)O)O)O)O List of Candidate Chemical Structures MammalianMetabolite Identifier Ranked list of Candidate Structures with mammalian substructures
List of Candidate Compound Structures Identification Process Mammalian Scaffolds List non-Biological Scaffolds Sugars Lipids Amino Acids Nucleotides Filtration Structure Matching List of Filtered Candidate Compounds Ranked list of identified Compounds
Collection and Curation of Scaffolds Retrieve All compounds in a Metabolic Pathway in KEGG Database Keep Participants of Mammalian Metabolic Pathway Groups (91 KEGG Pathways) Carbohydrate, Energy, Lipid, Nucleotide, Amino Acid, Glycan, Cofactors, and Vitamins Metabolism Remove Entries that were single elements, metals, or inorganic Remove Compounds that did not have an entry in the PubChem Database. 1,987 compounds 30 – 1,000 da
Identification Process List of Candidate Compound Structures Mammalian Scaffolds List non-Biological Scaffolds Sugars Lipids Amino Acids Nucleotides Filtration Structure Matching List of Filtered Candidate Compounds List of Identified Compounds
O O N O O O O N N N O Structure Matching • SMSD (Small Molecule Sub-graph Detector) toolkit is used for molecule similarity searches. Where: NSBS : the number of atoms in the substructure and NSPR : the number of atoms in the superstructure.
O O O O O O O O O N N N N N N N O O O O O O O N O O O O O O O O O O N O O N O N Scaffolds-Structure Matching Mammalian Scaffolds Candidate Structure 0.29 0.43 0.29 0.29 0.29 Similarity Score = 0.43 (6/14) Similarity Score = 0.43 (6/14) Similarity Score = 0.29 (4/14) Similarity Score = 0.43 (6/14) Similarity Score = 0.29 (4/14) Similarity Score = 0.29 (4/14) 0.43 0.36 C10H7NO3 C1=CC=C2C(=C1)C(=O)C=C(N2)C(=O)O
O O O O O O O O O O N N N N N N N N O O O O O O O O N O O O O O O O O O O O N O O N O N Union Scaffold Structure Candidate Structure Mammalian Scaffolds 0.29 0.43 0.29 Similarity Score = 0.71 (10/14) 0.29 0.29 0.43 0.36 Union Scaffold
N N N O N S S O O N O O N N O N O O O O Superstructure Scaffolds Matching 0.45 Union Scaffold Score = 0 Found to be a substructure of 38 Scaffolds! About 30% of the mammalian structures were missed (FN) Similarity Score = 0.9 0.9 (9/10) 0.6 (9/15) 0.75 (9/12)
O N O O O N N O O O O O Scoring Methods Union Scaffold Structure Candidate Structure Superstructure Scaffold Structure • US: Union Scaffold Score = 0.71 • MS: Maximum Score (Union Scaffold Score, Superstructure Score) = 0.93 • SS: Sum of Scores (Union Scaffold Score, Superstructure Score) = 1.64 O 0.71 0.93
Collection and Curation of Synthetic Compounds • Retrieve synthetic compounds from ChemBridge and ChemSynthesis databases. • restricted to the 6 biological elements C, H, N, O, P, and S. • The mass distribution • ChemBridge (150 – 700 da) • ChemSynthesis (50 –300 da) • 1,400 compounds were randomly selected for training and 5,320 compounds were randomly chosen for testing. mammalian scaffold list reduced to 1,400 compounds (50 – 700 da)
Leave one Out Accuracy Sensitivity = 96%
Prospective Results of Synthetic Compounds 54% eliminated as non-mammalian
Conclusions • A novel way of utilizing known mammalian metabolites (scaffolds database) to identify synthetic chemical compounds with mammalian substructures. • The results show a sensitivity of 96% in the mammalian scaffolds leave-one-out experiments. • The system was able to eliminate 54% of a random set of synthetic compounds.
Ongoing Work • Exploring further improvements in accuracy by using known biological pathway information. • Annotating PubChem • Annotating existing and potential drugs • Database independent compound search • Generate all possible structures of a given formula and rank them
O N O O O N N O O O O O Candidate Structures Mammalian Scaffolds List non-Biological Scaffolds Sugars Lipids Amino Acids Nucleotides Structure Matching Filtration List of Filtered Candidate Compounds O Thank you! Ranked Compounds