180 likes | 416 Views
Comparative Data Analysis Ontology (CDAO). Francisco Prosdocimi , Brandon Chisham, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus. Objectives. Develop a framework to formalize knowledge in the evolutionary biology domain Formalize an ontology for comparative data analysis
E N D
Comparative Data Analysis Ontology (CDAO) Francisco Prosdocimi, Brandon Chisham, Julie Thompson, Enrico Pontelli, Arlin Stoltzfus
Objectives • Develop a framework to formalize knowledge in the evolutionary biology domain • Formalize an ontology for comparative data analysis • Comparative Data Analysis Ontology (CDAO) • Implement and Evaluate the ontology
Motivation • Interoperation • Ontologies formalize knowledge • Overcome ambiguities in data formats (e.g., the multiple interpretations of NEXUS) • Facilitate provably correct format conversions • Reasoning • Beyond relational queries • Automated generation of format converters • Advanced reasoning required for workflow constructions and validation • Miscellaneous • Guide development of new data formats • Lingua franca for knowledge exchange • …
Structure of CDAO • Current Focus: • Taxonomic units • Tree-like networks of relationships • Models of evolutionary changes
Structure of CDAO • Core Components • Representation of Networks and Trees(e.g., NEXUS TREE Block) • Representation of Character Data(e.g., NEXUS CHARACTERS Block) • Imported Components • Amino Acid Ontology • http://www.co-ode.org/ontologies/amino-acid • U. Manchester, 2006 • Nucleotide Ontology • http://www.co-ode.org/ontologies/basic-bio/
CDAO: Core Components • Network/Tree representation • Rooted and Unrooted Trees • Nodes • Edges • Sets of Nodes topology Child Node node rootedtree part_of directededge hasancestor is_a hasdescendant node network is_a Parent Node is_a part_of Unrootedtree edge Represents TU node part_of has_annotation has_annotation haselement mrca_of Annotation: Tree Procedure, Model… Annotation: Transformation,Length… set of nodes is_a lineage
CDAO: Core Components • Representation of a Directed Tree a) D C B E A has_descendantmin 2 Nodes Lineage Subtree MRCA_Node has_child_Node Directed edge or branch EdgeTransformation has_parent_Node Character Ancestor state, Derived state… has_root_node Edge Node Node (Ancestral) Edge Transformation Rooted_tree
CDAO: Core Components • Annotations • Edge Annotations • Length • Transformation • Model Description • Gap Cost • Substitution Model • TU Annotation • Taxonomic Link • Tree Annotation • Tree Procedure EdgeAnnotation transform_character has_left_state has_left_node character state transformation character state has_right_node has_right_state
CDAO: Core Components • Character State Data Matrix • Character • Taxonomic Units • Datum • State Character State Data Matrix has annotation Annotation: Alignment procedures… character statedata matrix part_of part_of Annotation:TAXID, DB-XREF… hasannotation belongs_to taxonomic unit has datum character has datum character state datum belongs_to has represented by node has coordinate character state is_a is_a is_a belongs_to is_a compound aminoacid discrete coordinatesystem is_transformation_of is_a nucleotide continuous
Implementation Details • Formalization • OWL 1.1 • Tools • Protégé 4 [edit] • Swoop 2.3 [validation] • C++ and Perl+Prolog translators • Swoop 2.3 [reasoning] • Pellet [reasoning] • Fact++ [reasoning]
Preliminary Evaluation • We are reaching the stage where concrete evaluation is possible • NEXUS converters • We stumbled on several blocks • A good formalization of CDAO requires sophisticated features (OWL 1.1) • The majority of reasoning engines has not reached OWL 1.1 yet (even if they claim so…)
Some Examples • Simple NEXUS file #NEXUS BEGIN TAXA; DIMENSIONS ntax=10; TAXLABELS Arabidopsis_thaliana_AAD31363.1 Arabidopsis_thaliana_CAB79970.1 Oryza_sativa_BAB21282.1 Dictyostelium_discoideum_AAO51107.1 Caenorhabditis_elegans_CAA92686.1 Drosophila_melanogaster_AAF55117.1 Drosophila_melanogaster_AAF55115.1 Mus_musculus_BAB61955.1 Saccharomyces_cerevisiae_AAB68881.1 Schizosaccharomyces_pombe_CAB16373.1; END; BEGIN CHARACTERS; TITLE dna; LINK taxa=PF00137_47; DIMENSIONS nchar=10; FORMAT datatype=dna gap=- missing=?; MATRIX Arabidopsis_thaliana_CAB79970.1 gtgtggttgc Schizosaccharomyces_pombe_CAB16373.1 tgtatatgct Drosophila_melanogaster_AAF55117.1 tgtacttcgt Arabidopsis_thaliana_AAD31363.1 gt---gtggc Oryza_sativa_BAB21282.1 ct-------- Saccharomyces_cerevisiae_AAB68881.1 tgtacaagct Mus_musculus_BAB61955.1 tctgctacac Dictyostelium_discoideum_AAO51107.1 cacttactcc Caenorhabditis_elegans_CAA92686.1 tgttttacat Drosophila_melanogaster_AAF55115.1 ac------g- ; END; BEGIN TREES; TREE con_50_majrule = (((Arabidopsis_thaliana_AAD31363.1:0.004496,Arabidopsis_thaliana_CAB79970.1:0.009539)inode15:0.090479,Oryza_sativa_BAB21282.1:0.043596)inode14:0.219708,(Dictyostelium_discoideum_AAO51107.1:0.341768,(((Caenorhabditis_elegans_CAA92686.1:0.308884,(Drosophila_melanogaster_AAF55117.1:0.128132,Drosophila_melanogaster_AAF55115.1:0.384443)inode20:0.236060)inode19:0.093887,Mus_musculus_BAB61955.1:0.243982)inode18:0.150844,(Saccharomyces_cerevisiae_AAB68881.1:0.235101,Schizosaccharomyces_pombe_CAB16373.1:0.261646)inode21:0.225955)inode17:0.189073)inode16:0.127974)root; END;
Some Examples • Node: <cdao:Noderdf:ID="node_inode15"> <cdao:part_ofrdf:resource="#Tree"/> <cdao:belongs_to_Edgerdf:resource="#edge_inode15_inode14" /> <cdao:belongs_to_Edgerdf:resource="#edge_Arabidopsis_thaliana_CAB79970_1_inode15" /> <cdao:belongs_to_Edgerdf:resource="#edge_Arabidopsis_thaliana_AAD31363_1_inode15" /> <cdao:belongs_to_Edge_as_Childrdf:resource="#edge_inode15_inode14" /> <cdao:belongs_to_Edge_as_Parentrdf:resource="#edge_Arabidopsis_thaliana_CAB79970_1_inode15" /> <cdao:belongs_to_Edge_as_Parentrdf:resource="#edge_Arabidopsis_thaliana_AAD31363_1_inode15" /> <cdao:nca_node_ofrdf:resource="#set_nca_44"/> </cdao:Node> • Directed_Edge: <cdao:Directed_Edgerdf:ID="edge_Arabidopsis_thaliana_CAB79970_1_inode15"> <cdao:part_ofrdf:resource="#Tree"/> <cdao:has_Parent_Noderdf:resource="#node_inode15"/> <cdao:has_Child_Noderdf:resource="#node_Arabidopsis_thaliana_CAB79970_1"/> <cdao:has_Annotationrdf:resource="#edge_Arabidopsis_thaliana_CAB79970_1_inode15_length"/> </cdao:Directed_Edge> <cdao:Edge_Lengthrdf:ID="edge_Arabidopsis_thaliana_CAB79970_1_inode15_length"> <cdao:has_Valuerdf:datatype="&xsd;float"> 0.009539 </cdao:has_Value> </cdao:Edge_Length>
Some Examples • TU <cdao:TUrdf:ID="Caenorhabditis_elegans_CAA92686_1"> <cdao:belongs_to_Character_State_Data_Matrixrdf:resource="#Matrix"/> <cdao:represented_by_Noderdf:resource="#node_Caenorhabditis_elegans_CAA92686_1"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Caenorhabditis_elegans_CAA92686_1_char_0"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Caenorhabditis_elegans_CAA92686_1_char_1"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Caenorhabditis_elegans_CAA92686_1_char_2"/> … </cdao:TU> • Character <cdao:Nucleotide_Characterrdf:ID="char_2"> <cdao:belongs_to_Character_State_Data_Matrixrdf:resource="#Matrix"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Oryza_sativa_BAB21282_1_char_2"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Arabidopsis_thaliana_CAB79970_1_char_2"/> <cdao:has_Nucleotide_Datumrdf:resource="#datum_Mus_musculus_BAB61955_1_char_2"/> … </cdao:Nucleotide_Character> • Datum <cdao:Nucleotide_State_Datumrdf:ID="datum_Caenorhabditis_elegans_CAA92686_1_char_6"> <cdao:belongs_to_Characterrdf:resource="#char_6"/> <cdao:belongs_to_TUrdf:resource="#Caenorhabditis_elegans_CAA92686_1"/> <cdao:has_Nucleotide_Staterdf:resource="#value_a"/> </cdao:Nucleotide_State_Datum> • State <cdao:Nucleotiderdf:ID="value_a"> <owl:sameAsrdf:resource="#dA"/> </cdao:Nucleotide>
Simple Reasoning Tasks • Determine what TUs contain a gap in their tables: [Fact++] (has_Datum some (has_State value gap)) and TU • Determine the ancestors of a TU in the tree: has_Descendant value node_Drosophila_melanogaster_AAF55115_1
Simple Reasoning Tasks • Extract the row of a specific TU: SELECT ?z,?yWHERE (base:Arabidopsis_thaliana_AAD31363_1>, cdao:has_Datum, ?x) (?x, cdao:has_State, ?y) (?x, cdao:belongs_to_Character, ?z)USING base FOR <file:/C:/Users/epontell/Documents/Research/Proposals/NEXUS/Research/Perl/inst_matrix.owl#>,cdao FOR <http://www.cs.nmsu.edu/~epontell/CURRENT_matrix.owl#>
Future Work • To facilitate evaluation • Create an OWL 1.0 edition of the ontology (and corresponding NEXUS translator) • Java-level reasoning • Aggregation • Etc. • Large scale NEXUS validation • NeXML Interface • OBO distribution