1 / 26

Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia

Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.org N ovember 23, 2005. Amit Sheth

caesar
Download Presentation

Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological ApplicationsKeynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.orgNovember 23, 2005 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia http://lsdis.cs.uga.edu Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.

  2. Computation, data and semantics in life sciences • “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999 • “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 • "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins • We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions.

  3. Bioinformatics Apps & Ontologies • GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) • Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy • URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco • ProPreO: a comprehensive process Ontology modeling experimental proteomics • Contains 330 classes, 40,000+ instances • Models three phases of experimental proteomics* – Separation techniques, Mass Spectrometry and, Data analysis; URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo • Automatic semantic annotation of high throughput experimental data (in progress) • Semantic Web Process with WSDL-S for semantic annotations of Web Services • http://lsdis.cs.uga.edu -> Glycomics project (funded by NCRR)

  4. GlycO – A domain ontology for glycans

  5. GlycO

  6. Structural modeling and populationchallenges in GlycO • Extremely large number of glycans occurring in nature • But, frequently there are small differences structural properties • Modeling all possible glycans would involve significant amount of redundant classes • Redundancy results in often fatal complexities in maintenance and upgrade • Population • Manual • Extraction and integration from external knowledge sources • GlycoTree – exploiting structural composition rules

  7. Ontology population workflow GlycoTree Takahashi, Kato 2003

  8. b-D-GlcpNAc -(1-6)+ b-D-GlcpNAc -(1-2)- b-D-GlcpNAc -(1-2)+ b-D-GlcpNAc -(1-4)- a-D-Manp -(1-6)+ b-D-Manp -(1-4)- b-D-GlcpNAc -(1-4)- b-D-GlcpNAc a-D-Manp -(1-3)+ GlycoTree – A Canonical Representation of N-Glycans N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251

  9. Beyond expressiveness afforded in OWL • Probabilistic • more

  10. Example: Mass spectrometry analysis Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

  11. Mass Spectrometry Experiment Each m/z value in mass spec diagrams can stand for many different structures (uncertainty wrt to structure that corresponds to a peak) • Different linkage • Different bond • Different isobaric structures

  12. Very subtle differences • Peak at 1219.1 • Same molecular composition • One diverging link • Found in different organisms • background knowledge (found in honeybee venom or bovine cells) can resolve the uncertainty CBank: 16155 Honeybee venom CBank: 16154 Bovine These are core-fucosylated high-mannose glycans

  13. Even in the same organism • Both Glycans found in bovine cells • Both have a mass of 3425.11 • Same composition • Different linkage • Since expression levels of different genes can be measured in the cell, we can get probability of each structurein the sample CBank: 21821 Different enzymes lead to these linkages CBank: 21982

  14. Model 1: associate probability as part of Semantic Annotation • Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge

  15. P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

  16. Model 2: Probability in ontological representation of Glycan structure • Build a generalized probabilistic glycan structure that embodies several possible glycans

  17. Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 Separation technique I n Glycopeptides Fraction PNGase n Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction Data reduction ms peaklist ms/ms peaklist binning Peptide identification Glycopeptide identification and quantification N-dimensional array Peptide list Data correlation Signal integration N-GlycosylationProcess (NGP)

  18. Phase II: Ontology Population • Populate ProPreO with all experimental datasets? • Two levels of ontology population for ProPreO: • Level 1: Populate the ontology with instances that a stable across experimental runs Ex: Human Tryptic peptides – 40,000 instances in ProPreO • Level 2: Use of URIs to point to actual experimental datasets

  19. Ontology-mediated Proteomics Protocol RAW Files RAW Files DB Storing Output PKL Files (XML-based Format) ‘Clean’ PKL Files ‘Clean’ PKL Files RAW Results File Output (*.dat) Mass Spectrometer Conversion To PKL Preprocessing DB Search Post processing All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist Masslynx_Micromass_application produces_ms-ms_peak_list Instrument mass_spec_raw_data Micromass_Q_TOF_ultima_quadrupole_time_of_flight_mass_spectrometer Data Processing Application Micromass_Q_TOF_micro_quadrupole_time_of_flight_ms_raw_data PeoPreO

  20. Semantic Annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 ms/ms peaklist data Annotated ms/ms peaklist data

  21. Semantic annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Annotated ms/ms peaklist data

  22. Service description using WSDL-S <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" ….. xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> ….. </complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message> <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" …… xmlns: wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" > <wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema"> …… </complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> • Formalize description and classification of Web Services using ProPreO concepts Description of a Web Service using: Web Service Description Language data sequence peptide_sequence Concepts defined in process Ontology ProPreO process Ontology WSDL ModifyDB WSDL-S ModifyDB

  23. Summary, Observations, Conclusions • Ontology Schema: relatively simple in business/industry, highly complex in science • Ontology Population: could have millions of assertions, or unique features when modeling complex life science domains • Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population • Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)

  24. Summary, Observations, Conclusions • Some applications: semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …

  25. More information at • http://lsdis.cs.uga.edu/projects/glycomics

More Related