260 likes | 395 Views
Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological Applications Keynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.org N ovember 23, 2005. Amit Sheth
E N D
Semantics Enabled Industrial and Scientific Applications: Research, Technology and Deployed Applications Part III: Biological ApplicationsKeynote - the First Online Metadata and Semantics Research Conference http://www.metadata-semantics.orgNovember 23, 2005 Amit Sheth LSDIS Lab, Department of Computer Science, University of Georgia http://lsdis.cs.uga.edu Acknowledgement: NCRR funded Bioinformatics of Glycan Expression, collaborators, partners at CCRC (Dr. William S. York) and Satya S. Sahoo, Christopher Thomas, Cartic Ramakrishan.
Computation, data and semantics in life sciences • “The development of a predictive biology will likely be one of the major creative enterprises of the 21st century.” Roger Brent, 1999 • “The future will be the study of the genes and proteins of organisms in the context of their informational pathways or networks.” L. Hood, 2000 • "Biological research is going to move from being hypothesis-driven to being data-driven." Robert Robbins • We’ll see over the next decade complete transformation (of life science industry) to very database-intensive as opposed to wet-lab intensive.” Debra Goldfarb We will show how semantics is a key enabler for achieving the above predictions and visions.
Bioinformatics Apps & Ontologies • GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) • Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy • URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco • ProPreO: a comprehensive process Ontology modeling experimental proteomics • Contains 330 classes, 40,000+ instances • Models three phases of experimental proteomics* – Separation techniques, Mass Spectrometry and, Data analysis; URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo • Automatic semantic annotation of high throughput experimental data (in progress) • Semantic Web Process with WSDL-S for semantic annotations of Web Services • http://lsdis.cs.uga.edu -> Glycomics project (funded by NCRR)
Structural modeling and populationchallenges in GlycO • Extremely large number of glycans occurring in nature • But, frequently there are small differences structural properties • Modeling all possible glycans would involve significant amount of redundant classes • Redundancy results in often fatal complexities in maintenance and upgrade • Population • Manual • Extraction and integration from external knowledge sources • GlycoTree – exploiting structural composition rules
Ontology population workflow GlycoTree Takahashi, Kato 2003
b-D-GlcpNAc -(1-6)+ b-D-GlcpNAc -(1-2)- b-D-GlcpNAc -(1-2)+ b-D-GlcpNAc -(1-4)- a-D-Manp -(1-6)+ b-D-Manp -(1-4)- b-D-GlcpNAc -(1-4)- b-D-GlcpNAc a-D-Manp -(1-3)+ GlycoTree – A Canonical Representation of N-Glycans N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15: 235-251
Beyond expressiveness afforded in OWL • Probabilistic • more
Example: Mass spectrometry analysis Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Mass Spectrometry Experiment Each m/z value in mass spec diagrams can stand for many different structures (uncertainty wrt to structure that corresponds to a peak) • Different linkage • Different bond • Different isobaric structures
Very subtle differences • Peak at 1219.1 • Same molecular composition • One diverging link • Found in different organisms • background knowledge (found in honeybee venom or bovine cells) can resolve the uncertainty CBank: 16155 Honeybee venom CBank: 16154 Bovine These are core-fucosylated high-mannose glycans
Even in the same organism • Both Glycans found in bovine cells • Both have a mass of 3425.11 • Same composition • Different linkage • Since expression levels of different genes can be measured in the cell, we can get probability of each structurein the sample CBank: 21821 Different enzymes lead to these linkages CBank: 21982
Model 1: associate probability as part of Semantic Annotation • Annotate the mass spec diagram with all possibilities and assign probabilities according to the scientist’s or tool’s best knowledge
P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Model 2: Probability in ontological representation of Glycan structure • Build a generalized probabilistic glycan structure that embodies several possible glycans
Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 Separation technique I n Glycopeptides Fraction PNGase n Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction Data reduction ms peaklist ms/ms peaklist binning Peptide identification Glycopeptide identification and quantification N-dimensional array Peptide list Data correlation Signal integration N-GlycosylationProcess (NGP)
Phase II: Ontology Population • Populate ProPreO with all experimental datasets? • Two levels of ontology population for ProPreO: • Level 1: Populate the ontology with instances that a stable across experimental runs Ex: Human Tryptic peptides – 40,000 instances in ProPreO • Level 2: Use of URIs to point to actual experimental datasets
Ontology-mediated Proteomics Protocol RAW Files RAW Files DB Storing Output PKL Files (XML-based Format) ‘Clean’ PKL Files ‘Clean’ PKL Files RAW Results File Output (*.dat) Mass Spectrometer Conversion To PKL Preprocessing DB Search Post processing All values of the produces ms-ms peaklist property are micromass pkl ms-ms peaklist Masslynx_Micromass_application produces_ms-ms_peak_list Instrument mass_spec_raw_data Micromass_Q_TOF_ultima_quadrupole_time_of_flight_mass_spectrometer Data Processing Application Micromass_Q_TOF_micro_quadrupole_time_of_flight_ms_raw_data PeoPreO
Semantic Annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 ms/ms peaklist data Annotated ms/ms peaklist data
Semantic annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Annotated ms/ms peaklist data
Service description using WSDL-S <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" ….. xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> ….. </complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message> <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" …… xmlns: wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" > <wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema"> …… </complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> • Formalize description and classification of Web Services using ProPreO concepts Description of a Web Service using: Web Service Description Language data sequence peptide_sequence Concepts defined in process Ontology ProPreO process Ontology WSDL ModifyDB WSDL-S ModifyDB
Summary, Observations, Conclusions • Ontology Schema: relatively simple in business/industry, highly complex in science • Ontology Population: could have millions of assertions, or unique features when modeling complex life science domains • Ontology population could be largely automated if access to high quality/curated data/knowledge is available; ontology population involves disambiguation and results in richer representation than extracted sources, rules based population • Ontology freshness (and validation—not just schema correctness but knowledge—how it reflects the changing world)
Summary, Observations, Conclusions • Some applications: semantic search, semantic integration, semantic analytics, decision support and validation (e.g., error prevention in healthcare), knowledge discovery, process/pathway discovery, …
More information at • http://lsdis.cs.uga.edu/projects/glycomics