310 likes | 459 Views
Semantics powered Bioinformatics. Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center University of Georgia http://lsdis.cs.uga.edu. Project Information:. Background: SW for Life Sciences.
E N D
Semantics powered Bioinformatics Amit Sheth, William S. York, et al Large Scale Distributed Information Systems Lab & Complex Carbohydrate Research Center University of Georgia http://lsdis.cs.uga.edu Project Information:
Background: SW for Life Sciences • Bioinformatics of Glycan Expression – component of the NCRR "Integrated Technology Resource for Biomedical Glycomics”. • W3C Interest Group on Semantic Web for Health care and Life Sciences • Deployed Active Semantic Electronic Medical Patient Record application at the Athens Heart Center
Agenda • Review of Accomplishments/Ongoing Work: • GLYDE standard • GlycO Ontology • ProPreO Ontology • Semantic Analytical Glycomics Workflow • Visualization • Semantic Web Services: WSDL-S/METEOR-S
GLYDE standard • An XML based representation format for glycan structures • Inter-convertible with existing data represented using IUPAC or LINUCS. • In progress: Incorporation of Probability based representation • In progress: Incorporation of aspects for visualization of structures using GLYDE (XML) files GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.
Collaborative GlycoInformatics • Enable querying and export of query results in GLYDE format • Using GLYDE representation for disambiguation, mapping and matching GLYDE MonosaccharideDB <glyde> <residue> . . </residue> </glyde> SweetDB QUERY <glyde> <residue> . . </residue> </glyde> KEGG RESULT
Collaborative GlycoInformatics • Development of GLYDE semantic web portal • Integration with www.glycosciences.de • Visualization aspect integrated with LiGraph (Heidelberg) or OntoVista (UGA) • Semantic Annotation of publications in GlycoProteomics domain MonosaccharideDB www.glycosciences.de KEGG GLYDE Semantic Portal
Collaborative GlycoInformatics Evolving collaboration between: • LSDIS/CCRC: Will York, Amit Sheth, Michael Pierce • EUROCarbDB (German Cancer Research Center): Willi von der Lieth • Consortium for Functional Glycomics (CFG): Rahul Raman, Ram Sasisekharan, Thomas Lütteke • N.D. Zelinsky Institute of Organic Chemistry (Moscow) Yuriy Knirel • Mitsui Knowledge Industry (Japan): Hisashi Narimatsu, Norihiro Kikuchi • Kyoto Encyclopedia of Genes and Genomes (KEGG): Minoru Kanehisa, Kiyoko F. Aoki-Kinoshita • Palo Alto Research Center (PARC): David Goldberg,
Semantic GlcyoInformatics - Ontologies • GlycO: A domain ontology for glycan structures, glycan functions and enzymes (embodying knowledge of the structure and metabolisms of glycans) • Contains 600+ classes and 100+ properties – describe structural features of glycans; unique population strategy • URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco • ProPreO: a comprehensive process Ontology modeling experimental proteomics • Contains 330 classes, 6 million+ instances • Models three phases of experimental proteomicsURL: http://lsdis.cs.uga.edu/projects/glycomics/propreo
GlycO taxonomy The first levels of the GlycO taxonomy Most relationships and attributes in GlycO GlycO exploits the expressiveness of OWL-DL. Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.
Pathway representation in GlycO Pathways do not need to be explicitly defined in GlycO. The residue-, glycan-, enzyme- and reaction descriptions contain all the knowledge necessary to infer pathways.
Reaction R05987 catalyzed by enzyme 2.4.1.145 adds_glycosyl_residue N-glycan_b-D-GlcpNAc_13 Zooming in a little … The N-Glycan with KEGG ID 00015 is the substrate to the reaction R05987, which is catalyzed by an enzyme of the class EC 2.4.1.145. The product of this reaction is the Glycan with KEGG ID 00020.
Ontology Population • The next slides show the different steps that were necessary to populate GlycO with glycan structures from multiple sources. • GLYDE is used to disambiguate between representations from multiple sources
[][Asn]{[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-GlcpNAc] {[(4+1)][b-D-Manp] {[(3+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc] {}[(4+1)][b-D-GlcpNAc]{}}[(6+1)][a-D-Manp] {[(2+1)][b-D-GlcpNAc]{}}}}}}
<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue> </Glycan>
ProPreO • ProPreO: A process ontology to capture proteomics experimental lifecycle: • Separation • Mass spectrometry • Analysis • 330 classes • 110 properties • 6 million+ instances
Usage: Mass spectrometry analysis Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated. Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Semantic Annotation of Experimental Data • Enables Ontology-mediated Disambiguation • Allows correlation between disparate entities using Semantic Relations P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) = 0.4 Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875
Cell Culture extract Glycoprotein Fraction proteolysis Glycopeptides Fraction 1 Separation technique I n Glycopeptides Fraction PNGase n Peptide Fraction Separation technique II n*m Peptide Fraction Mass spectrometry ms data ms/ms data Data reduction Data reduction ms peaklist ms/ms peaklist binning Peptide identification Glycopeptide identification and quantification N-dimensional array Peptide list Data correlation Signal integration Semantic GlycoProteomics Workflow
Web Services based Workflow = Web Process Windows XP WORKFLOW Web Service 1 WS1 Web Service 4 LINUX WS 2 WS 3 Web Service 2 Web Service 3 WS 4 MAC Solaris
BOWSER • Use semantics for describing Web Services • WSDL-S (LSDIS/IBM) • Use service-level annotation of Web Services • Graphical traversal of taxonomy of biological concepts to search for Web Services • http://128.192.9.11:8080/stargate/bowser.jsp
Semantic Annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> 830.9570 194.9604 2 580.2985 0.3592 688.3214 0.2526 779.4759 38.4939 784.3607 21.7736 1543.7476 1.3822 1544.7595 2.9977 1562.8113 37.4790 1660.7776 476.5043 ms/ms peaklist data Annotated ms/ms peaklist data
Semantic annotation of Scientific Data <ms/ms_peak_list> <parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer” mode = “ms/ms”/> <parent_ion_mass>830.9570</parent_ion_mass> <total_abundance>194.9604</total_abundance> <z>2</z> <mass_spec_peak m/z = 580.2985 abundance = 0.3592/> <mass_spec_peak m/z = 688.3214 abundance = 0.2526/> <mass_spec_peak m/z = 779.4759 abundance = 38.4939/> <mass_spec_peak m/z = 784.3607 abundance = 21.7736/> <mass_spec_peak m/z = 1543.7476 abundance = 1.3822/> <mass_spec_peak m/z = 1544.7595 abundance = 2.9977/> <mass_spec_peak m/z = 1562.8113 abundance = 37.4790/> <mass_spec_peak m/z = 1660.7776 abundance = 476.5043/> <ms/ms_peak_list> Annotated ms/ms peaklist data
Discovery of relationship between biological entities p r o c e s s ProPreO GlycO Lectin Collection of N-glycan ligands Identified and quantified peptides Gene Ontology (GO) Fragment of Specific protein Genomic database (Mascot/Sequest) The inference: instances of the class collection of Biosynthetic enzymes (GNT-V) are involved in the specific cellular process (metastasis). Specific cellular process Collection of Biosynthetic enzymes
Semantic Web Services using WSDL-S <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" ….. xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <wsdl:types> <schema targetNamespace="urn:ngp“ xmlns="http://www.w3.org/2001/XMLSchema"> ….. </complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> <wsdl:message name="replaceCharacterResponse"> <wsdl:part name="replaceCharacterReturn" type="soapenc:string"/> </wsdl:message> <?xml version="1.0" encoding="UTF-8"?> <wsdl:definitions targetNamespace="urn:ngp" …… xmlns: wssem="http://www.ibm.com/xmlns/WebServices/WSSemantics" xmlns: ProPreO="http://lsdis.cs.uga.edu/ontologies/ProPreO.owl" > <wsdl:types> <schema targetNamespace="urn:ngp" xmlns="http://www.w3.org/2001/XMLSchema"> …… </complexType> </schema> </wsdl:types> <wsdl:message name="replaceCharacterRequest" wssem:modelReference="ProPreO#peptide_sequence"> <wsdl:part name="in0" type="soapenc:string"/> <wsdl:part name="in1" type="soapenc:string"/> <wsdl:part name="in2" type="soapenc:string"/> </wsdl:message> • Formalize description and classification of Web Services using ProPreO concepts Description of a Web Service using: Web Service Description Language data sequence peptide_sequence Concepts defined in process Ontology ProPreO process Ontology WSDL ModifyDB WSDL-S ModifyDB
Semantic Visualization • Ontologies are meant for machine consumption • Often too convoluted for the human eye • The scientist needs to know the concepts she uses for annotation • Build a visualization environment that translates the formal concepts into a representation the domain expert understands well
Customizable Layouts • Using customizable layouts, knowledge can be formalized in a machine understandable way and then visually translated for the user’s needs. • Cartoonist representation for the Glycobiologist • Chemical reactions as left side right side, instead of convoluted representation in the ontology.
Ongoing and Future Work • SemURI: Semantic URI based provenance scheme using ProPreO • RDF-based version of the GLYDE schema • A framework for semantic annotation of experimental data • Integration of large datasets (~500MB) into ProPreO for reasoning
Further details at: • http://lsdis.cs.uga.edu/projects/glycomics/