670 likes | 898 Views
Genome Data and Tool Interoperation over the “Semantic” Web. By Kei-Hoi Cheung, Ph.D. Assistant Professor Yale Center for Medical Informatics. MB&B 452b/752b, April 20, 2005, Yale University. Outline. Introduction Semantic Web Resource Description Framework (RDF)
E N D
Genome Data and Tool Interoperation over the “Semantic” Web By Kei-Hoi Cheung, Ph.D. Assistant Professor Yale Center for Medical Informatics MB&B 452b/752b, April 20, 2005, Yale University
Outline • Introduction • Semantic Web • Resource Description Framework (RDF) • Life Sciences Identifiers (LSID) • YeastHub: yeast genome data interoperation • Web Services for tool interoperation • Collaborative projects • Biosphere • Taverna • Semantic Web Services • Conclusion • Future directions
Eras of Computing • Mainframe computing (many people share one computer) • Personal computing (one person uses one computer) • Ubiquitous computing (one person is served by many computers over the network) • Client/server computing, grid computing, peer-to-peer computing, distributed/parallel computing, component-based computing, etc • World Wide Web (WWW) is one of the main driving forces • It provides a globally distributed communication framework that is essential for almost all scientific collaboration, including bioinformatics
The World Wide Web • On the order of 108 users • Used in every country on Earth • On the order of 1010 indexed web resources (text) in Google etc • Essentially Infinite if one includes “dynamic” web pages • Massively distributed and open
Data Heterogeneity • Data are exposed in different ways • Programmatic interfaces • Web forms or pages • FTP directory structures • Data are presented in different ways • Structured text • Tab delimited format, XML format, etc • Free text • Binary • Images • Naming conflicts (e.g., synonyms and homonyms)
Tool heterogeneity • Server applications • Web server applications • Application programming interfaces (API) • Client applications (downloadable software) • Different programming languages • Different operating systems
From Web to Semantic Web • Human processing Machine processing • Free text description ontological description • HTML XML RDF or its extensions • Metadata!
Col#Description • pedigree id • Person id • Father id • Mother id • Sex • Status <html> <body> … <a href=“http://ycmi.med.yale.edu/ped_readme.html”> Readme</a> <table> <tr> <td>1</td> <td>1</td> <td>0</t> <td>0</td> … </tr> … </table> … </body> </html> HTML Example Readme 1 1 0 0 1 1 1 2 0 0 2 0 1 3 1 2 2 0 1 4 1 2 1 0 1 5 1 2 1 1 1 6 1 2 1 0
Other Advantages of Using XML • It is simple, hierarchical, self-describing, and computer-readable • It can be validated using DTD or XSchema • It is a W3C standard • It has a large base of software support (both commercial and public domain software tools) • Editing tools, DOM, SAX, XSL, etc
Sequence Microarray Gene Expression Pathway BSML MAML BIND SBML PSI-MI AGAVE GEML MAGE-ML RDF (e.g., BioPax) Semantically rich ontologies Proliferation of Bio-XML Formats Reasoning (machine intelligence)
Definition of an Ontology • Conceptualization of a domain of interest • Concepts, relations, attributes, constraints, objects, values • An ontology is a specification of a conceptualization • Formal notation • Documentation • A variety of forms, but includes: • A vocabulary of terms • Some specification of the meaning of the terms • Ontologies are defined for reuse
Roles of Ontologies in Bioinformatics • Success of many biological DBs depends on • High fidelity ontologies • Clearly communicating their ontologies • Prevent errors on data entry and interpretation • Common framework for multidatabase queries • Controlled vocabularies for genome annotation • GO • EC numbers • Information-extraction applications • Reuse is a core aspect of ontologies • Reuse of existing ontologies faster than designing new ones • Reuse decreases semantic heterogeneity of DBs • Schema-driven Software • Knowledge-acquisition tools • Query tools
Example Bio-ontologies • Gene Ontologies • http://www.geneontology.org/ • MGED Ontologies • http://mged.sourceforge.net/ • Open Biomedical Ontologies (OBO) • http://obo.sourceforge.net/
Precision Formal, unambiguous High fidelity Explicitness Clarity Commitment Reuse Systematic Quality Clarity Flexibility Expressivity Evolution Ontology desiderata machine computable
Semantic Web • It provides a common framework that allows semantic interoperability among multiple resources through the use of ontologies • It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners • It is based on the Resource Description Framework (RDF)
Resource Description Framework (RDF) • It is a standard data model (directed acyclic graph) for representing information (metadata) about resources in the World Wide Web • In general, it can be used to represent information about “things” that can be identified (using URI’s) on the Web • It is intended to provide a simple way to make statements (descriptions) about Web resources
RDF Statement • A RDF statement consists of: • Subject: resource identified by a URI • Predicate: property (as defined in a name space identified by a URI) • Object: property value or a resource For example, the “dbSNP Website” is a subject, “creator” is a Predicate, “NCBI” is an object. A resource can be described by multiple statements.
RDF/XML Representation • <?xml version="1.0"?> • <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” • xmlns:dc=“http://purl.org/dc/elements/1.1” • xmlns:ex=“http://www.example.org/terms”> • <dc: creator rdf:resource=“http://www.example.org/staffid/85740”></dc:creator> • <dc:language>en</dc:language> • <ex:creation-date>August 16, 1999</dc:creation-date> • <rdf:RDF>
Data Integration Using RDF humanhemoglobin atagccgtacctgcgagtctagaagct derives from atagccgtacctgcgagtctagaagct GenBank derives from + humanhemoglobin oxygentransportprotein humanhemoglobin oxygentransportprotein is a is a Gene Ontology + has 3D structure humanhemoglobin has 3D structure Unified view Protein Data Bank
Reification • Making statements about statements • For example, GenBank provides the following statement: “human hemoglobin derives from atagccgtacctgcgagtctagaagct” Example <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:s=“http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=29436”> <rdf:Description about=“http://www.ncbi.nlm.nih.gov/Genbank”> <s:derive_from rdf:ID=“statement1”> atag… </s:derive_from> </rdf:Description> <rdf:Description about=“#statement1”> <s:providedBy>GenBank</s:providedBy> </rdf:Description> </rdf:RDF>
Other RDF-Based Ontology Languages • RDFS • DAML+OIL • OWL
Life Science Identifiers (“LSID”) Addresses Data Access Problems • LSID is a naming standard for distributed data, specifically: • Scientifically significant data • Geographically distributed • Files, database records, and data objects managed by N-tier applications • Public and/or private networks • And owned, managed, by different organizations
LSID Syntax • 5 Part Format: URN:LSID:Authority:Namespace:Object:[Revision-ID] • URN:LSID: is a mandatory prefix • Authority is the Internet domain of the organization that assigns an LSID to a resource • Namespace constrains the scope of the object • Object is an alphanumeric describing the object • Revision-ID is an optional version of the object • Examples • URN:LSID:ncbi.nlm.nih.gov:genbank:AF271072:1 • URN:LSID:ncbi.nlm.nih.gov:pubmed:12571434
LSID: a single naming schema • One standard naming scheme • Named data is unique • Data integrity is maintained • Breaking down of “data silos” • Names no longer only useful in a specific proprietary context • Integrate any data source using standard naming scheme • Single LSID protocol replaces proprietary source specific programs • Access to more data • Integrate data across discovery and development cycles • Metadata features • Standard access to specific data allows them to easily be related semantically. These semantic links can lead to new insights
LSID-Enabled Applications • LaunchPad • BioHaystack
LaunchPad • it takes an LSID; • resolves it; • attempts to match the local applications one uses to process/view this data.
YeastHub (a semantic web approach to yeast data integration)(Collaboration between YCMI and Gerstein Lab: Kevin Yip, Andrew Smith, Andy Masiar, Remko deKnikker)(Accepted for publication and presentation in ISMB 2005)
Yeast Genome Data • The budding yeast Saccharomyces cerevisiae was the first fully sequenced eukaryotic genome. • Ease of genetic manipulation and many of its genes are strikingly similar to human genes • It has been studied extensively through a wide range of biological experiments (e.g., microarray experiments). • A large variety of yeast genome data (e.g., gene expression data) have been made available through many resources (e.g., SGD, MIPS, YPD, TRIPLES, Yeast World, etc) • Integration of such a variety of yeast data can facilitate whole genome analysis
Data Conversion and Integration Resource1 Resource2 Resourcen <xml> … </xml> DOM/SAX DB-specific tool XSLT RDF1 RDF2 RDFn RDF/DB (Sesame) RDQL Users/Agents
Two Levels of RDF Description • Resource description • Data description
Semantic Web Technologies Employed in YeastHub • RDF Site Summary (RSS) • D2RQ (mapping from relational databases to RDF) • Semantic Web Database (Sesame) • RDF Query Languages (e.g., RQL and SeRQL)
An Example Scenario • Comparative genomics
Web Services“Creating a Bioinformatics Nation”(Lincoln Stein)
Web Services UDDI WSDL SOAP