280 likes | 441 Views
Comparative Data Analysis Ontology (CDAO). Arlin Stoltzfus. Center for Advanced Research in Biotechnology 9600 Gudelsky Drive, Rockville, MD Biochemical Science Division National Institute for Standards and Technology University of Maryland Biotechnology Institute. New Genome Sequence.
E N D
Comparative Data Analysis Ontology (CDAO) Arlin Stoltzfus Center for Advanced Research in Biotechnology 9600 Gudelsky Drive, Rockville, MD Biochemical Science Division National Institute for Standards and Technology University of Maryland Biotechnology Institute
New Genome Sequence Useful information ? Computational genome analysis • Human genes • Does it vary in humans? • Is it implicated in disease? • Potential pathogens • Does it make a toxin? • Will UV sterilization work? • Any organism • Does it synthesize ascorbic acid? • Will it grow at high temperatures?
Comparative Analysis New Genome Sequence Useful information ? Database with annotated genomes of other species . . . and comparative analysis is evolutionary biology Genome analysis is comparative analysis
Genome sequences Useful inferences Comparative analysis 99.99 % accurate Far less accurate The problem, restated Power comes from comparative analysis Comparative analysis is an evolutionary problem • Depends on a tree describing relationships • Depends on representing dynamics of evolution • Requires attention to uncertainty Genome analysis can be improved by • Facilitating tree-based analysis with better informatics • Improving models of evolutionary change • Incorporating prior knowledge
A bold generalization "It matters not at all whether you work with genetic elements, with viruses, bacteria, fungi, animals, or plants. The same principles apply if your subject is molecular evolution, the diversity of genetic systems, comparative morphology, physiology, ecology, or behaviour." (p. 7) Harvey, P. H., and M. D. Pagel. 1991. The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford. What are these principles?
The “entropy” Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E S = 1 bit Principle 1: hierarchically structured data demand appropriate statistics Example: Residue “conservation” Valdar, W. S. 2002. Scoring residue conservation. Proteins 48:227-241. Figure 1. Some example columns from different multiple alignments. Each labeled column represents a residue position in a multiple-sequence alignment . . .
Statistics for tree-related data How can one characterize a set of data collected from different biological species, or indeed any set of data related by an evolutionary tree? The structure imposed by the tree implies that the data are not independent, and for most applications this should be taken into account. We describe strategies for weighting the data that circumvent some of the problems of dependency. Altschul, S. F., R. J. Carroll, and D. J. Lipman. 1989. Weights for data related by a tree. J Mol Biol 207:647-653 Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E
Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E Let r = +, then P(DE,t)=(/r)(1-e-rt) t Principle 2: evolution is the generating process Because the non-independence arises via descent with modification, the proper framework for addressing hierarchy is as to interpret it as an evolved pattern
Probabilities gain intron A 1 B 1 C 0 D 0 E 0 F 0 intron A 1 B 1 C 0 D 0 E 0 F 0 1 gain loss loss A B (Prob) Probability of presence 0 max 0 max Distance from root Distance from root 0 intron A 1 B 1 C 0 D 0 E 0 F 0 gain intron A 1 B 1 C 0 D 0 E 0 F 0 loss loss present loss F E C D loss 0 max 0 max Distance from root Distance from root 0 max Distance from root Example: intron “loss vs. gain” problem Possibilities
Principle 3: the result is an inference with uncertainty that should be treated explicitly • assign uncertainties to inferences • provide explicit probability distribution Example from Huelsenbeck “The phylogeny is usually treated as known without error; this assumption is problematic because inferred phylogenies are subject to both stochastic and systematic errors.” Huelsenbeck, J. P., B. Rannala, and J. P. Masly. 2000. Science 288:2349-2350.
Principle 3: Explicit treatment of uncertainty This principle is not followed in non-evolutionary approaches: “tree-based weighting schemes require more assumptions than those based on only the alignment. After all, many plausible trees can describe a single alignment. Choosing one, even if it is the most probable, introduces additional uncertainty and thus hidden complexity”. Valdar, W. S. 2002. Scoring residue conservation. Proteins 48:227-241.
functional attribute A 1 B 1 C ? D ? E 0 F 0 presence A 1 B 1 C 0 D 0 E 0 F 0 t Example: functional inference Let r = +, then P(01,t)=(/r)(1-e-rt)
Evolutionary analysis in practice • Homologize characters • Discriminate character states • Assume or infer a phylogeny • Carry out tree-based analysis • Parameter estimation • State reconstruction • Model comparison • Correlation analysis
13 Q Q Q Q E The “state” is Q (Glutamine) for “character” 13 (column 13) of “OTU” H_sapiens_4826964 Character-state data model OTU: Operational Taxonomic Unit Character Data Tree
NEXUS #NEXUS BEGIN TAXA; DIMENSIONS ntax=26; TAXLABELS O_volvulus_AAB64227.1 O_volvulus_AAB64226.1 C_elegans_AAF39759.1 C_elegans_AAA83577.1 S_cerevisiae_CAA89634.1 C_albicans_AAC12872.1 S_pombe_CAB57444.1 N_crassa_AAA63780.1 M_musculus_AAA40121.1 C_capitata_AAA57249.1 D_virilis_CAA32060.1 D_erecta_AAF23595.1 D_orena_AAF23594.1 D_teissieri_AAF23599.1 D_yakuba_AAF23598.1 D_melanogaster_AAF50095.1 D_mauritiana_AAF23597.1 D_sechellia_AAF23596.1 D_simulans_CAA33720.1 Z_mays_AAB49913.1 O_sativa_AAC14464.1 O_sativa_AAC14465.1 A_thaliana_AAF99769.1 P_tremuloides_AAD01605.1 A_thaliana_BAB09468.1 A_thaliana_AAD29823.2; END; BEGIN CHARACTERS; DIMENSIONS nchar=30; FORMAT datatype=protein gap=- missing=?; CHARLABELS 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120; MATRIX M_musculus_AAA40121.1 QGTIHFEQKASGE--PVVLSGQITGLTE-G C_capitata_AAA57249.1 KGTVHFEQQDAKS--PVLVTGEVNGLAK-G N_crassa_AAA63780.1 KGTVIFEQESESA--PTTITYDISGNDPNA --stuff deleted here-- D_simulans_CAA33720.1 KGTVFFEQESSGT--PVKVSGEVCGLAK-G S_cerevisiae_CAA89634.1 SGVVKFEQASESE--PTTVSYEIAGNSPNA S_pombe_CAB57444.1 SGVVTFEQVDQNS--QVSVIVDLVGNDANA; END; BEGIN ASSUMPTIONS; WTSET MySoapWeights (VECTOR) = 1 1 1 1 1 1 1 1 0.83 0.8 0.8 0.8 0.8 0.8 0.71 0.71 1 1 1 1 1 1 1 1 1 1 1 1 1 1; END; BEGIN TREES; TREE "Cu-Zn Superoxide Dismutase" = (((((O_volvulus_AAB64227.1:0.31741,O_volvulus_AAB64226.1:0.13498): 0.20268[1],(C_elegans_AAF39759.1:0.14579,C_elegans_AAA83577.1:0.27311):0.2533[1]):0.12655[0.98], ((S_cerevisiae_CAA89634.1:0.28255,C_albicans_AAC12872.1:0.25631):0.08358[0.91],(S_pombe_CAB57444.1: 0.3159,N_crassa_AAA63780.1:0.1635):0.11954[0.97]):0.17514[1]):0.08988[0.77],(M_musculus_AAA40121.1: 0.49149,(C_capitata_AAA57249.1:0.18945,(D_virilis_CAA32060.1:0.11453,(((D_erecta_AAF23595.1:0.00661, D_orena_AAF23594.1:0.00769):0.00497[0.92],(D_teissieri_AAF23599.1:0.004,D_yakuba_AAF23598.1:0.01012): 0.0073[0.87]):0.01271[0.88],(((D_melanogaster_AAF50095.1:0.00836,D_mauritiana_AAF23597.1:0.00552): 0.00203[0.28],D_sechellia_AAF23596.1:0.01103):0.00398[0.7],D_simulans_CAA33720.1:0.00595):0.00739[0.75]): 0.11795[1]):0.11754[1]):0.12932[1]):0.10326[1]):0.0712[0.9],(((((Z_mays_AAB49913.1:0.05142, O_sativa_AAC14464.1:0.09031):0.02799[0.98],O_sativa_AAC14465.1:0.06915):0.05245[0.99], (A_thaliana_AAF99769.1:0.17064,P_tremuloides_AAD01605.1:0.1075):0.08023[1]):0.08596[1], A_thaliana_BAB09468.1:0.46052):0.06401[0.75],A_thaliana_AAD29823.2:0.42442):0.14252[0.94]); END;
Visualization & editing: Nexplorer(www.molevol.org/nexplorer) • Nice tutorial • Thousands of pre-compiled data sets • Standard format allows user-supplied data • Publication-quality graphics • Intuitive user interface
BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor 00000 . . . MATRIX presumed_ancestor 00000 Geospiza_difficilis 00000 Geospiza_scandens 00000 Geospiza_conirostris 00000 Geospiza_magnirostris 00000 Geospiza_fortis 00000 Geospiza_fuliginosa 00000 Camarhynchus_pallidus 11101 Camarhynchus_heliobates 11101 Camarhynchus_psittacula 11101 Camarhynchus_pauper 11101 Camarhynchus_parvulus 11101 Platyspiza_crassirostris 11010 Certhidea_olivacea 11101; END; #NEXUS [!Data and tree from: Schluter, D. 1989. Pp. 79-95 in D.B. Wake and G. Roth, eds., Complex organismal functions: Integration and evolution in vertebrates. Wiley, N.Y. ] BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor 00000 Geospiza_difficilis 00000 Geospiza_scandens 00000 Geospiza_conirostris 00000 Geospiza_magnirostris 00000 Geospiza_fortis 00000 Geospiza_fuliginosa 00000 Camarhynchus_pallidus 11101 Camarhynchus_heliobates 11101 Camarhynchus_psittacula 11101 Camarhynchus_pauper 11101 Camarhynchus_parvulus 11101 Platyspiza_crassirostris 11010 Certhidea_olivacea 11101; END; BEGIN ASSUMPTIONS; OPTIONS DEFTYPE=unord PolyTcount=MINSTEPS ; END; BEGIN TREES; TRANSLATE 1 presumed_ancestor, 2 Geospiza_difficilis, 3 Geospiza_scandens, 4 Geospiza_conirostris, 5 Geospiza_magnirostris, 6 Geospiza_fortis, 7 Geospiza_fuliginosa, 8 Camarhynchus_pallidus, 9 Camarhynchus_heliobates, 10 Camarhynchus_psittacula, 11 Camarhynchus_pauper, 12 Camarhynchus_parvulus, 13 Platyspiza_crassirostris, 14 Certhidea_olivacea; TREE * UNTITLED = [&R] (1,(((2,(3,4),((5,6),7)),(((8,9),((10,11),12)),13)),14)); END; Character State Data(example from MacClade documentation)
CDAO Project • Wiki: http://www.evolutionaryontology.org/CDAO • Artifacts: http://sourceforge.net/projects/cdao/ • Development team: • Enrico Pontelli (NMSU) • Brandon Chisham (NMSU) • Julie Thompson (U. Strasbourg, France) • Franciso Prosdocimi (U. Strasbourg, France) • Arlin Stoltzfus (CARB, NIST)
Specification: Study use-cases to clarify scope Choice of representation: Choose language and development tools Ontology refinement • Conceptualization: • Identify terms from use cases, artefacts • Build concept glossary • Classify key concepts and relations Implementation: Formalize the concepts and relations using the chosen language and tools Evaluation: Test the ontology for its ability to represent data called for in the use cases, and to support reasoning CDAO: development strategy
CDAO: development & evaluation tools • OWL (Ontology Web Language) • Widely supported, emerging as a standard • Includes Description Logics concepts (OWL 1.1) • Has convenient RDF-XML file syntax • Protégé 4 alpha • Supports OWL-DL • Nice graphical interface • Improving rapidly due to active user-developer community • Integrates Reasoners (Pellet, FaCT++) • Racer (external reasoner) • Ad hoc translators from NEXUS or NeXML to CDAO • NCL (C++) or Bio::NEXUS (Perl) libraries
character state data matrix Annotation: Alignment procedure… character state data matrix has part_of part_of Annotation: taxonomic_link … has TU character belongs_to belongs_to character state datum is_transformation_of represents_TU has state transformation topology node has child has child_node has ancestor directed edge rooted tree part_of state has descendant has parent has parent_node is_a node has left_state tree is_a transformation has is_a Annotation: Tree Procedure Model… has right_state has unrooted tree node part_of edge has left_node state connects_to has has has right_node node Annotation: Length… CDAO: key concepts & relations
D C B E A has_descendantmin 2 Nodes Subtree Lineage MRCA_Node has_Child_Node Directed_Edge EdgeTransformation has_Parent_Node has_root Directed Edge Node Ancestral Node Rooted_tree Edge Transformation a) CDAO: tree concepts
b) EdgeTransformation F E G connects_to RootedSubtree I Node Node J H D has_Root has Left_Node has Right_Node C Node K represents_TU Edge TU B Annotation: taxonomic_link … A Annotation: Length… M L Edge Node Unrooted tree Edge Transformation CDAO: tree concepts, continued
CDAO: how to find out more • talk to developers • view with Protégé • browse OWLdocs on term request server
CDAO: plans CDAO is intended to be useful in solving problems (its not intended as an educational tool) Ontologies are useful for creating semantically rich computable representations, and for semantic transformation (translation) of other representations • Two projects beginning in 2009 • Support for MIAPA (Minimal Information for a Phylogenetic Analysis) standard • to cover various types of data (not just sequences) • to include meta-data on sources and methods • workflow description capacity leads on to bigger and better things . . . • Interoperability of Data resources
Acknowledgements • Former Stoltzfus group members • Lev Yampolsky • Weigang Qiu • Vivek Gopalan • Tom Hladish • Chengzhi Liang • Peter Yang • Support • CARB • NIH • NIST • National Evolutionary Synthesis Center • Collaborators on CDAO project • Enrico Pontelli (NMSU) • Brandon Chisham (NMSU) • Julie Thompson (U. Strasbourg, France) • Franciso Prosdocimi (U. Strasbourg, France)
Outline • Introduction: comparative analysis and evolutionary analysis • Principles underlying evolutionary analysis • Use a tree • Use a model of change • Treat uncertainty explicitly • Generalized aspects of methodology used in evolutionary analysis • The character-state data model • NEXUS files • Examples • Infrastructure needs and the Evolutionary Informatics Working Group • Ongoing projects and plans • Bio::NEXUS, Nexplorer • Comparative Data Analysis Ontology