470 likes | 564 Views
Semantic empowerment of Life Science Applications October 2006. Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia. Acknowledgement: NCRR funded Bioinformatics of Glycan Expression ,
E N D
Semantic empowermentof Life Science ApplicationsOctober 2006 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Cartic Ramakrishnan, Christopher Thomas, Cory Henson.
Computation, data and semantics In life sciences • “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999 • “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 • "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins • “We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show howsemanticsis a key enabler for achieving the above predictions and visions in which information and process play critical role.
Semantic Web and Life Science • Data captured per year = 1 exabyte (1018)(Eric Neumann, Science, 2005) • How much is that? • Compare it to the estimate of the total words ever spoken by humans = 12 exabyte • Death by data • The need for • Search • Integration • Analysis, decision support • Discovery Not data, but analysis and insight, leading to decisions and discovery
Semantic empowermentof Life Science Applications Life Science research today deals with highly heterogeneous as well as massive amounts of data distributed across the world. We need more automated ways for integration and analysis leading to insight and discovery - to understand cellular components, molecular functions and biological processes, and more importantly complex interactions and interdependencies between them.
Benefits of Semantics • Development of large domain-specific knowledge • for reference, common nomenclature, tagging • Integration of heterogeneous multi-source data: biomedical documents (text), scientific/experimental data and structured databases • Semantic search, browsing, integration analysis, and discovery Faster and more reliable discovery leading to quality of life improvements
What is semantics & Semantic Web • Meaning and use of data • From syntax and structure to semantics (beyond formatting, organization, query interfaces,….) • XML -> RDF -> OWL -> Rules -> Trust • Ontologies at the heart of Semantic Web, capturing agreement and domain knowledge • (Automatic) Semantic annotation, reasoning,… • Also, increasing use of Services oriented Architecture -> semantic Web services • W3C SW for Health Care and Life Sciences
Semantic empowermentof Life Science Applications This talk will demonstrate some of the efforts in: • Building large (populated) life science ontologies (GlycO, ProPreO) • Gathering/extracting knowledge and metadata: entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry) • Semantic web services and registries, leading to better discovery/reuse of scientific tools and their composition • Ontology-driven applications developed
Semantic Applications • Active Semantic Medical Records Demo: an operational health care application using multiple ontologies, semantic annotations and rule based decsion support • Semantic Browser Demo: contextual browsing of PubMed aided by ontology and schema (in future instance) level relationships • N-glycosylation process: an example of scientific workflow • Integrated Semantic Information & Knowledge System (ISIS):integrated access and analysis of structured databases, sc. literature and experimental data Others we will not discuss: SemBowser, SemDrug, …. Let us start with a couple of simple applications
Life Science Ontologies • Glyco • An ontology for structure and function of Glycopeptides • 573 classes, 113 relationships • Published through the National Center for Biomedical Ontology (NCBO) • ProPreO • An ontology for capturing process and lifecycle information related to proteomic experiments • 398 classes, 32 relationships • 3.1 million instances • Published through the National Center for Biomedical Ontology (NCBO) and Open Biomedical Ontologies (OBO)
N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4 GNT-Vattaches GlcNAc at position 6 N-acetyl-glucosaminyl_transferase_V UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 N-Glycosylation metabolic pathway GNT-Iattaches GlcNAc at position 2
GlycO ontology • Challenge – model hundreds of thousands of complex carbohydrate entities • But, the differences between the entities are small (E.g. just one component) • How to model all the concepts but preclude redundancy → ensure maintainability, scalability
b-D-GlcpNAc -(1-2)- b-D-GlcpNAc -(1-2)+ b-D-GlcpNAc -(1-4)- a-D-Manp -(1-6)+ b-D-Manp -(1-4)- b-D-GlcpNAc -(1-4)- b-D-GlcpNAc a-D-Manp -(1-3)+ GlycoTree N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251
EnzyO • The enzyme ontology EnzyO is highly intertwined with GlycO. While it’s structure is mostly that of a taxonomy, it is highly restricted at the class level and hence allows for comfortable classification of enzyme instances from multiple organisms • GlycO together with EnzyO contain all the information that is needed for the description of Metabolic pathways • e.g. N-Glycan Biosynthesis
Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.
Reaction R05987 catalyzed by enzyme 2.4.1.145 adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13 Zooming in a little … The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. The product of this reaction is the Glycan with KEGG ID 00020.
GlycO population • Multiple data sources used in populating the ontology • KEGG - Kyoto Encyclopedia of Genes and Genomes • SWEETDB • CARBANK Database • Each data source has different schema for storing data • There is significant overlap of instances in the data sources • Hence, entity disambiguation and a common representational format are needed
Ontology population workflow [][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc]{}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}}
Ontology population workflow <Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue> </Glycan>
ProPreO ontology • Two aspects of glycoproteomics: • What is it?→ identification • How much of it is there? → quantification • Heterogeneity in data generation process, instrumental parameters, formats • Need data and process provenance→ ontology-mediated provenance • Hence, ProPreO models both the glycoproteomics experimental process and attendant data
ProPreO population: transformation to rdf Scientific Data Computational Methods Ontology instances
ProPreO population: transformation to rdf Scientific Data Computational Methods Key amino-acid sequence Protein Path Extract Peptide Amino-acid Sequence from Protein Amino-acid Sequence amino-acid sequence Protein Data Peptide Path Determine N-glycosylation Concensus Calculate Chemical Mass Calculate Monoisotopic Mass RDF Chemical Mass RDF Monoisotopic Mass RDF Amino-acid Sequence RDF n-glycosylation concensus chemical mass monoisotopic mass amino-acid sequence parent protein n-glycosylation concensus chemical mass monoisotopic mass amino-acid sequence “Protein RDF” “Peptide RDF”
Semantic empowermentof Life Science Applications This talk will demonstrate some of the efforts in: • building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications • entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data • semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive • semantic applications developed
Relationship extraction from unstructured data (other related research: biological entity extraction)
affects Lipid Overview UMLS Biologically active substance affects complicates causes causes Disease or Syndrome instance_of instance_of ??????? Fish Oils Raynaud’s Disease MeSH 9284 documents PubMed 4733 documents 5 documents
About the data used • UMLS – A high level schema of the biomedical domain • 136 classes and 49 relationships • Synonyms of all relationship – using variant lookup (tools from NLM) • MeSH • Terms already asserted as instance of one or more classes in UMLS • PubMed • Abstracts annotated with one or more MeSH terms T147—effect T147—induce T147—etiology T147—cause T147—effecting T147—induced
Abstract Classification/Annotation Example PubMed abstract (for the domain expert)
Method – Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) (TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ endogenous) (CC or) (JJ exogenous) ) (NN stimulation) ) (PP (IN by) (NP (NN estrogen) ) ) ) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) (NN endometrium) ) ) ) ) ) )
Modifiers Modified entities Composite Entities Method – Identify entities and Relationships in Parse Tree
ProPreO: Ontology-mediated provenance parent ion charge 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 parent ion m/z parent ionabundance fragment ion m/z fragment ionabundance ms/ms peaklist data Mass Spectrometry (MS) Data
ProPreO: Ontology-mediated provenance <ms-ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode=“ms-ms”/> <parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/> <fragment_ion m-z=“580.2985” abundance=“0.3592”/> <fragment_ion m-z=“688.3214” abundance=“0.2526”/> <fragment_ion m-z=“779.4759” abundance=“38.4939”/> <fragment_ion m-z=“784.3607” abundance=“21.7736”/> <fragment_ion m-z=“1543.7476” abundance=“1.3822”/> <fragment_ion m-z=“1544.7595” abundance=“2.9977”/> <fragment_ion m-z=“1562.8113” abundance=“37.4790”/> <fragment_ion m-z=“1660.7776” abundance=“476.5043”/> </ms-ms_peak_list> OntologicalConcepts Semantically Annotated MS Data
Semantic empowermentof Life Science Applications This talk will demonstrate some of the efforts in: • building large life science ontologies (GlycO -an ontology for structure and function for Glycopeptides and ProPreO - an ontology for capturing process and lifecycle information related to proteomic experiments) and their application in advanced ontology-driven semantic applications • entity and relationship extraction from unstructured data, automatic semantic annotation of scientific/experimental data (e.g., mass spectrometry), and resulting capability in integrated access and analysis of structured databases, scientific literature and experimental data • semantic web services and registries, leading to better discovery/reuse of scientific tools and composition of scientific workflows that process high-throughput data and can be adaptive • semantic applications developed
Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 Separation technique I n Glycopeptides Fraction PNGase n Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction Data reduction ms peaklist ms/ms peaklist binning Peptide identification Glycopeptide identification and quantification N-dimensional array Peptide list Data correlation Signal integration N-GlycosylationProcess (NGP)
Biological Sample Analysis by MS/MS Agent Raw Data to Standard Format Agent Data Pre- process Agent DB Search (Mascot/Sequest) Agent Results Post-process (ProValt) O I O I O I O I O Storage Raw Data Standard Format Data Filtered Data Search Results Final Output Biological Information Semantic Web Process to incorporate provenance Semantic Annotation Applications
Converting biological information to the W3C Resource Description Framework (RDF): Experience with Entrez Gene Collaboration with Dr. Olivier Bodenreider (US National Library of Medicine, NIH, Bethesda, MD)
Biomedical Knowledge Repository …. Biomedical Knowledge Repository Entrez
Implementation Entrez Gene Entrez Gene XML XSLT Entrez Gene RDF graph Entrez Gene RDF
Web interface ENTREZ GENE ENTREZ GENE XML XSLT ENTREZ GENE RDF GRAPH ENTREZ GENE RDF ….
Implementation Entrez Gene Entrez Gene XML XSLT Entrez Gene RDF graph Entrez Gene RDF
Connecting different genes amyloid-beta protein protease nexin-II beta-amyloid peptide APP gene [Homo sapiens] A4 amyloid protein cerebral vascular amyloid peptide Human APP gene is implicated in Alzheimer's disease. Which genes are functionally homologous to this gene? amyloid beta (A4) precursor protein (protease nexin-II, Alzheimer disease) amyloid beta A4 protein amyloid beta A4 protein amyloid beta A4 protein APP gene [Gallus gallus] amyloid protein APP gene [Canis familiaris ] eg:has_protein_reference_name_E
Integrated Semantic Information and knowledge System (Isis) Have I performed an error? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. SPARQL query-based User Interface ProPreO ontology Is the result erroneous? Give me all result files from a similar organism, cell, preparation, mass spectrometric conditions and compare results. Experimental Data Semantic Annotation Metadata File Semantic Metadata Registry PROTEOMECOMMONS EXPERIMENTAL DATA ProVault result MACOT result mzXML Pkl pSplit Raw Raw2mzXML mzXML2Pkl Pkl2pSplit MASCOT Search ProVault PROTEOMICS WORKFLOW
Summary, Observations, Conclusions • We now have semantics and services enabled approaches that support semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …
http://lsdis.cs.uga.edu • http://knoesis.org http://lsdis.cs.uga.edu/projects/asdoc/ http://lsdis.cs.uga.edu/projects/glycomics/