250 likes | 486 Views
RDF based on Integration of Pathway Database and Gene Ontology. SNU OOPSLA LAB. 2005 DongHyuk Im. Contents. Introduction Pathway Database Enzyme Database Gene Ontology Related Works Our Approach Supporting Function Data Transformation Integration of KEGG, Enzyme, Gene Ontology
E N D
RDF based on Integration of Pathway Database and Gene Ontology SNU OOPSLA LAB. 2005 DongHyuk Im
Contents • Introduction • Pathway Database • Enzyme Database • Gene Ontology • Related Works • Our Approach • Supporting Function • Data Transformation • Integration of KEGG, Enzyme, Gene Ontology • Querying using SeRQL
Pathway? • Most chemical reaction mechanisms are translated from a compound(substrate) to a compound(product) by enzyme acting • Importance • to comparison and analyze pathways in order to understand the process of creating compounds and the evolutive relevance between organisms • Drug Discovery
Pathway Map : Glycolysis / Gluconeogenesis Map : Aquifex aeolicus
Enzyme Database • EC number • Recommended name • Alternative names(if any) • Catalytic activity • Cofactors (if any) • Pointers to the SWISS-PORT entrie(s) that correspond to the enzyme (if any) • Pointers to disease(s) associated with a deficiency of the enzyme (if any)
Enzyme Hierarchy [*] • Four levels • EC number • Ex) 1.1.1.1 is a member of the top level group [1] • The leftmost number identifies the highest level • [2.4.2.3] – [2.4.2.4](sibling) : similar reactions in pathway [1] [2] [3] [2.1] [2.2] [2.3] [2.2.1] [2.2.2] [2.2.3] [2.2.2.1] [2.2.2.2] [2.2.2.3]
KEGG • To computerize all aspects of cellular functions in terms of the pathway of interacting molecules or genes • To maintain gene catalogs for all organisms and link each gene product to a pathway component • To organize a database of all chemical compounds in the cell and link each compound to a pathway component • To develop computational technologies for pathway comparison, reconstruction, and analysis
Why RDF Integration? • Pathway data model : DAG • RDF is a good model for representing pathway • RDF data model : DAG • Need integration of multiple knowledge sources available from internet : one of the major problems in biologists • RDF is a good model for same standard • Enzyme, GO : hierarchy structure • RDF is a good model for representing hierarchy structure • GO annotation is important • Enzymes(proteins) in certain pathway need GO annotation
Related Works • KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res. • YeastHub: a semantic web case for integrating data in the life science domain, 2005, Bioinformatics • LIGAND: database of chemical compounds and reactions in biological pathways, 2002, Nucleic Acids Res. • Gene Ontology: tool for the unification biology, the Gene Ontology Consortium, 2000, Nature Genetics.
Our System’s Supporting • KEGG • Search compound • Path prediction • Search Enzyme • Our system’s function to add • Integration Query (pathway+enzyme+GO) • Relaxation Query using GO hierarchy • Searching pathway using enzyme information
Search Compounds target Compound : C00668
Pathway Prediction Tool compound Relaxation query using enzyme hierarchy
Search Enzyme Enzyme : 5.3.1.9
From Pathway to Gene Ontology Select enzyme
Data Translation for Integration GENOS Storage XSLT KGML Data KEGG RDF Data Adding GO ID Enzyme RDF Data GO RDF Data XSLT : http://www.w3.org/2005/02/13-KEGG/
KEGG RDF Data(1/2) Gene entry <k:entry> <Gene rdf:nodeID="_1"> <k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/aae#aq_186"/> <k:reaction rdf:resource="http://www.w3.org/2005/02/13-KEGG/rn#R00710"/> <k:link rdf:resource="http://www.genome.jp/dbget-bin/www_bget?aae+aq_186"/> <k:graphics><Rectangle k:name="aldH1" k:fgcolor="#000000" k:bgcolor="#BFFFBF" k:x="170" k:y="1018" k:width="45" k:height="17"/> </k:graphics> </Gene> </k:entry> Enzyme entry <k:entry> <Enzyme rdf:nodeID="_3"> <k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/ec#1.2.1.5"/> <k:graphics> <Rectangle k:name="1.2.1.5" k:fgcolor="#000000" k:bgcolor="#FFFFFF" k:x="170" k:y="1039" k:width="45" k:height="17"/> </k:graphics> </Enzyme> </k:entry> No information Compound entry <k:entry> <Compound rdf:nodeID="_4"> <k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/cpd#C00033"/> <k:link rdf:resource="http://www.genome.jp/dbget-bin/www_bget?compound+C00033"/> <k:graphics> <Circle k:name="C00033" k:fgcolor="#000000" k:bgcolor="#FFFFFF" k:x="102" k:y="971" k:width="8" k:height="8"/> </k:graphics> </Compound> </k:entry>
KEGG RDF Data(2/2) Relation <k:relation> <ECrel> <k:entry1 rdf:resource="_42"/> <k:entry2 rdf:resource="_48"/> <compound rdf:resource="_88"/> </ECrel> </k:relation> Reaction <k:reaction reversible="" rdf:about="http://www.w3.org/2005/02/13-KEGG/rn#R00710"> <k:substrate rdf:resource="http://www.w3.org/2005/02/13-KEGG/cpd#C00084"/> <k:product rdf:resource="http://www.w3.org/2005/02/13-KEGG/cpd#C00033"/> </k:reaction>
How to Process KEGG Pathway • Problem • GENOS(Sesame) does not support multiple graph • KEGG data consists of multiple documents • Ex) map00010.rdf, aae00010.rdf … • Solution • Using namespace, we can distinguish maps • When Storing pathway data, pathway’s map name is added as a namespace in resource table of GENOS
Processing Pathway Data <k:Pathway k:org="aae" k:number="00010" k:title="Glycolysis / Gluconeogenesis"> …. …. <k:entry> <Gene rdf:nodeID="_1"> <k:name rdf:resource="http://www.w3.org/2005/02/13-KEGG/aae#aq_186"/> <k:reaction rdf:resource="http://www.w3.org/2005/02/13-KEGG/rn#R00710"/> <k:link rdf:resource="http://www.genome.jp/dbget-bin/www_bget?aae+aq_186"/> <k:graphics><Rectangle k:name="aldH1" k:fgcolor="#000000" k:bgcolor="#BFFFBF" k:x="170" k:y="1018" k:width="45" k:height="17"/> </k:graphics> </Gene> </k:entry> conflict triples table of GENOS resources table of GENOS
Integrating Databases Enzyme number GO ID
Relaxation Querying using SeRQL E1 subclassof subclassof E1.* C2 C1 E1.* SeRQL SELECT C1,C2 FROM Path_EXP WHERE E1 LIKE “1.*" Dewey order Ex. 1.1 and 1.2 are childrens of 1 use Prefix
Considering Performance KEGG : Pathway List aae:aq_018 path:aae03010 aae:aq_020 path:aae03010 aae:aq_021 path:aae00400 …. …. …. …. eco:b1236 path:eco00052 eco:b1236 path:eco00500 eco:b1236 path:eco00520 …. using genes_index Genes Map
Schedule • Implementation (~11/30) • Integrated Databases • Query Processor for pathway • Simple UI (Web :JSP) • Complete Paper (~12/10)