Genome Data and Tool Interoperation over the “Semantic” Web

Genome Data and Tool Interoperation over the “Semantic” Web By Kei-Hoi Cheung, Ph.D. Assistant Professor Yale Center for Medical Informatics MB&B 452b/752b, April 20, 2005, Yale University

Outline • Introduction • Semantic Web • Resource Description Framework (RDF) • Life Sciences Identifiers (LSID) • YeastHub: yeast genome data interoperation • Web Services for tool interoperation • Collaborative projects • Biosphere • Taverna • Semantic Web Services • Conclusion • Future directions

Eras of Computing • Mainframe computing (many people share one computer) • Personal computing (one person uses one computer) • Ubiquitous computing (one person is served by many computers over the network) • Client/server computing, grid computing, peer-to-peer computing, distributed/parallel computing, component-based computing, etc • World Wide Web (WWW) is one of the main driving forces • It provides a globally distributed communication framework that is essential for almost all scientific collaboration, including bioinformatics

The World Wide Web • On the order of 108 users • Used in every country on Earth • On the order of 1010 indexed web resources (text) in Google etc • Essentially Infinite if one includes “dynamic” web pages • Massively distributed and open

It is difficult to keep track of these resources

Data Heterogeneity • Data are exposed in different ways • Programmatic interfaces • Web forms or pages • FTP directory structures • Data are presented in different ways • Structured text • Tab delimited format, XML format, etc • Free text • Binary • Images • Naming conflicts (e.g., synonyms and homonyms)

Tool heterogeneity • Server applications • Web server applications • Application programming interfaces (API) • Client applications (downloadable software) • Different programming languages • Different operating systems

From Web to Semantic Web • Human processing  Machine processing • Free text description  ontological description • HTML  XML  RDF or its extensions • Metadata!

Col#Description • pedigree id • Person id • Father id • Mother id • Sex • Status <html> <body> … <a href=“http://ycmi.med.yale.edu/ped_readme.html”> Readme</a> <table> <tr> <td>1</td> <td>1</td> <td>0</t> <td>0</td> … </tr> … </table> … </body> </html> HTML Example Readme 1 1 0 0 1 1 1 2 0 0 2 0 1 3 1 2 2 0 1 4 1 2 1 0 1 5 1 2 1 1 1 6 1 2 1 0

XML Example

Other Advantages of Using XML • It is simple, hierarchical, self-describing, and computer-readable • It can be validated using DTD or XSchema • It is a W3C standard • It has a large base of software support (both commercial and public domain software tools) • Editing tools, DOM, SAX, XSL, etc

Sequence Microarray Gene Expression Pathway BSML MAML BIND SBML PSI-MI AGAVE GEML MAGE-ML RDF (e.g., BioPax) Semantically rich ontologies Proliferation of Bio-XML Formats Reasoning (machine intelligence)

Definition of an Ontology • Conceptualization of a domain of interest • Concepts, relations, attributes, constraints, objects, values • An ontology is a specification of a conceptualization • Formal notation • Documentation • A variety of forms, but includes: • A vocabulary of terms • Some specification of the meaning of the terms • Ontologies are defined for reuse

Roles of Ontologies in Bioinformatics • Success of many biological DBs depends on • High fidelity ontologies • Clearly communicating their ontologies • Prevent errors on data entry and interpretation • Common framework for multidatabase queries • Controlled vocabularies for genome annotation • GO • EC numbers • Information-extraction applications • Reuse is a core aspect of ontologies • Reuse of existing ontologies faster than designing new ones • Reuse decreases semantic heterogeneity of DBs • Schema-driven Software • Knowledge-acquisition tools • Query tools

Example Bio-ontologies • Gene Ontologies • http://www.geneontology.org/ • MGED Ontologies • http://mged.sourceforge.net/ • Open Biomedical Ontologies (OBO) • http://obo.sourceforge.net/

Are current bio-ontologies adequate?

Precision Formal, unambiguous High fidelity Explicitness Clarity Commitment Reuse Systematic Quality Clarity Flexibility Expressivity Evolution Ontology desiderata machine computable

Semantic Web • It provides a common framework that allows semantic interoperability among multiple resources through the use of ontologies • It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners • It is based on the Resource Description Framework (RDF)

Resource Description Framework (RDF) • It is a standard data model (directed acyclic graph) for representing information (metadata) about resources in the World Wide Web • In general, it can be used to represent information about “things” that can be identified (using URI’s) on the Web • It is intended to provide a simple way to make statements (descriptions) about Web resources

RDF Statement • A RDF statement consists of: • Subject: resource identified by a URI • Predicate: property (as defined in a name space identified by a URI) • Object: property value or a resource For example, the “dbSNP Website” is a subject, “creator” is a Predicate, “NCBI” is an object. A resource can be described by multiple statements.

Graphical Representation

RDF/XML Representation • <?xml version="1.0"?> • <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” • xmlns:dc=“http://purl.org/dc/elements/1.1” • xmlns:ex=“http://www.example.org/terms”> • <dc: creator rdf:resource=“http://www.example.org/staffid/85740”></dc:creator> • <dc:language>en</dc:language> • <ex:creation-date>August 16, 1999</dc:creation-date> • <rdf:RDF>

Data Integration Using RDF humanhemoglobin atagccgtacctgcgagtctagaagct derives from atagccgtacctgcgagtctagaagct GenBank derives from + humanhemoglobin oxygentransportprotein humanhemoglobin oxygentransportprotein is a is a Gene Ontology + has 3D structure humanhemoglobin has 3D structure Unified view Protein Data Bank

Reification • Making statements about statements • For example, GenBank provides the following statement: “human hemoglobin derives from atagccgtacctgcgagtctagaagct” Example <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:s=“http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=29436”> <rdf:Description about=“http://www.ncbi.nlm.nih.gov/Genbank”> <s:derive_from rdf:ID=“statement1”> atag… </s:derive_from> </rdf:Description> <rdf:Description about=“#statement1”> <s:providedBy>GenBank</s:providedBy> </rdf:Description> </rdf:RDF>

Other RDF-Based Ontology Languages • RDFS • DAML+OIL • OWL

Life Science Identifiers (“LSID”) Addresses Data Access Problems • LSID is a naming standard for distributed data, specifically: • Scientifically significant data • Geographically distributed • Files, database records, and data objects managed by N-tier applications • Public and/or private networks • And owned, managed, by different organizations

LSID Syntax • 5 Part Format: URN:LSID:Authority:Namespace:Object:[Revision-ID] • URN:LSID: is a mandatory prefix • Authority is the Internet domain of the organization that assigns an LSID to a resource • Namespace constrains the scope of the object • Object is an alphanumeric describing the object • Revision-ID is an optional version of the object • Examples • URN:LSID:ncbi.nlm.nih.gov:genbank:AF271072:1 • URN:LSID:ncbi.nlm.nih.gov:pubmed:12571434

LSID: a single naming schema • One standard naming scheme • Named data is unique • Data integrity is maintained • Breaking down of “data silos” • Names no longer only useful in a specific proprietary context • Integrate any data source using standard naming scheme • Single LSID protocol replaces proprietary source specific programs • Access to more data • Integrate data across discovery and development cycles • Metadata features • Standard access to specific data allows them to easily be related semantically. These semantic links can lead to new insights

LSID-Enabled Applications • LaunchPad • BioHaystack

LaunchPad • it takes an LSID; • resolves it; • attempts to match the local applications one uses to process/view this data.

YeastHub (a semantic web approach to yeast data integration)(Collaboration between YCMI and Gerstein Lab: Kevin Yip, Andrew Smith, Andy Masiar, Remko deKnikker)(Accepted for publication and presentation in ISMB 2005)

Yeast Genome Data • The budding yeast Saccharomyces cerevisiae was the first fully sequenced eukaryotic genome. • Ease of genetic manipulation and many of its genes are strikingly similar to human genes • It has been studied extensively through a wide range of biological experiments (e.g., microarray experiments). • A large variety of yeast genome data (e.g., gene expression data) have been made available through many resources (e.g., SGD, MIPS, YPD, TRIPLES, Yeast World, etc) • Integration of such a variety of yeast data can facilitate whole genome analysis

Data Conversion and Integration Resource1 Resource2 Resourcen <xml> … </xml> DOM/SAX DB-specific tool XSLT RDF1 RDF2 RDFn RDF/DB (Sesame) RDQL Users/Agents

Two Levels of RDF Description • Resource description • Data description

Resource Description(Use of Dublin Core Metadata)

Metadata Example

RDF Modeling of Tabular Data

Data Conversion

RDF Example

Query Form

RQL Syntax and Query Results

Semantic Web Technologies Employed in YeastHub • RDF Site Summary (RSS) • D2RQ (mapping from relational databases to RDF) • Semantic Web Database (Sesame) • RDF Query Languages (e.g., RQL and SeRQL)

Tool Interoperation

An Example Scenario • Comparative genomics

Manual Interoperation

A Better Way of Interoperation

A Better Way of Interoperation (cont’d)

Web Services“Creating a Bioinformatics Nation”(Lincoln Stein)

Web Services UDDI WSDL SOAP

Genome Data and Tool Interoperation over the “Semantic” Web