390 likes | 523 Views
Evolutionary Informatics:. Supporting Interoperability in Evolutionary Analysis. DB Interop Hackathon Jim Balhoff Lucie Chan Dave Clements Karen Cranston Sam Donnelly Vladimir Gapeyev Karla Gendler Vivek Gopalan Roger Hyam Mark Jensen Greg Jordan Matt Kosnik
E N D
Evolutionary Informatics: Supporting Interoperability in Evolutionary Analysis DB Interop Hackathon Jim Balhoff Lucie Chan Dave Clements Karen Cranston Sam Donnelly Vladimir Gapeyev Karla Gendler Vivek Gopalan Roger Hyam Mark Jensen Greg Jordan Matt Kosnik Sheldon McKay Ryan Scherle Katja Schulz Katja Seltmann Jeet Sukumaran Matt Yoder NESCent staff Hilmar Lapp Todd Vision Working Group Members Jon Eisen (“phylogenomics”) Joe Felsenstein (PHYLIP) Mark Holder (GARLI) Sergei Kosakovsky Pond (HyPhy) Sudhir Kumar (MEGA) Paul Lewis (NCL) Aaron Mackey (BioPerl,GMOD) David Maddison (Mesquite) Wayne Maddison (Mesquite) Enrico Pontelli (CDAO) Andrew Rambaut (BEAST) Arlin Stoltzfus (Bio::NEXUS) David Swofford (PAUP*) Rutger Vos (Bio::Phylo) Xuhua Xia (DAMBE) Christian Zmasek (ATV, RIO) WG colleagues Brandon Chisham Brian Devries Gopal Gupta Peter E. Midford William Piel Francisco Prosdocimi Julie Thompson Derrick Zwickl Fourth meeting
New Genome Sequence Useful information ? Computational genome analysis • Human genes • Does it vary in humans? • Is it implicated in disease? • Potential pathogens • Does it make a toxin? • Will UV sterilization work? • Any organism • Does it synthesize ascorbic acid? • Will it grow at high temperatures?
LOCUS AB060655 4091 bp DNA linear ROD 14-SEP-2001 DEFINITION Mus musculus Atp6f gene for 23-kDa subunit of V-ATPase, complete cds. ACCESSION AB060655 VERSION AB060655.1 GI:14646762 KEYWORDS . SOURCE Mus musculus (house mouse) ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus. REFERENCE 1 AUTHORS Sun-Wada,G.H., Murakami,H., Nakai,H., Wada,Y. and Futai,M. TITLE Mouse Atp6f, the gene encoding the 23-kDa proteolipid of vacuolar proton translocating ATPase JOURNAL Gene 274 (1-2), 93-99 (2001) PUBMED 11675001 REFERENCE 2 (bases 1 to 4091) AUTHORS Wada,Y., Sun-Wada,G., Hideaki,M. and Masamitsu,F. TITLE Direct Submission JOURNAL Submitted (23-APR-2001) Yoh Wada, ISIR, Osaka University, Division of Biological Science; Mihogaoka 8-1, Ibaraki, Osaka 5670047, Japan (E-mail:yohwada@sanken.osaka-u.ac.jp, Tel:81-6-6879-8482, Fax:81-6-6875-5724) FEATURES Location/Qualifiers source 1..4091 /organism="Mus musculus" /mol_type="genomic DNA" /strain="129Sv" /db_xref="taxon:10090" /chromosome="4" /clone="225b09" /clone_lib="Genome Systems" gene 483..3435 /gene="Atp6f" exon 483..595 /gene="Atp6f" CDS join(529..595,1128..1176,1407..1490,1621..1698,1893..1962, 2086..2137,2472..2662,3125..3151) /gene="Atp6f" /codon_start=1 /product="23-kDa subunit of V-ATPase" /protein_id="BAB61955.1" /db_xref="GI:14646763" /translation="MTGLELLYLGIFVAFWACMVVVGICYTIFDLGFRFDVAWFLTET SPFMWSNLGIGLAISLSVVGAAWGIYITGSSIIGGGVKAPRIKTKNLVSIIFCEAVAI YGIIMAIVISNMAEPFSATEPKAIGHRNYHAGYSMFGAGLTVGLSNLFCGVCVGIVGS GAALADAQNPSLFVKILIVEIFGSAIGLFGVIVAILQTSRVKMGD" Annotations exon 1128..1176 /gene="Atp6f" exon 1407..1490 /gene="Atp6f" exon 1621..1698 /gene="Atp6f" exon 1893..1962 /gene="Atp6f" exon 2086..2137 /gene="Atp6f" exon 2472..2662 /gene="Atp6f" exon 3125..3435 /gene="Atp6f" ORIGIN 1 gatcctatag ggcgaattgg agctccccgc ggtggcggcc gctctagaac tagtggatca 61 cctggacatc gtgggcgttc gcgtctggca ttccacccta cctctgggtt ggaaaagaca 121 acctagaatg acctccgatg aacagcaggc attagctagg caccgcgaaa tcctgcttca 181 agcagaagga actaggcagg actagaacag accggaagga tctgcagtga ttggtgagta 241 aactgggagt ccggtgggaa gttagggaac cagcagcgca ggtggagagc cagtacctgt 301 cacggagaac gtccgacgaa actacaacca ccacagtgct ccgcggcatg acgtctacca . . . 3901 ttacctaata agtccttttc agtcaacacc tttaggggtc ttacccagca ggcagccctg 3961 gttggctgac cttgactcat gctcccagga aagagttggc aaggccctaa ccctctgaat 4021 tgcccactat ccagaccccg tcccaaatac ctgaagggcc ttagccatcc ggctcctggt 4081 ctcttcccat t // CDS join(529..595,1128..1176,1407..1490,1621..1698,1893..1962, 2086..2137,2472..2662,3125..3151) /gene="Atp6f" /codon_start=1 /product="23-kDa subunit of V-ATPase" /protein_id="BAB61955.1" /db_xref="GI:14646763" /translation="MTGLELLYLGIFVAFWACMVVVGICYTIFDLGFRFDVAWFLTET SPFMWSNLGIGLAISLSVVGAAWGIYITGSSIIGGGVKAPRIKTKNLVSIIFCEAVAI YGIIMAIVISNMAEPFSATEPKAIGHRNYHAGYSMFGAGLTVGLSNLFCGVCVGIVGS GAALADAQNPSLFVKILIVEIFGSAIGLFGVIVAILQTSRVKMGD"
Comparative Analysis New Genome Sequence Useful information ? Database with annotated genomes of other species . . . and comparative analysis is evolutionary biology Genome analysis is comparative analysis
A bold generalization "It matters not at all whether you work with genetic elements, with viruses, bacteria, fungi, animals, or plants. The same principles apply if your subject is molecular evolution, the diversity of genetic systems, comparative morphology, physiology, ecology, or behaviour." (p. 7) Harvey, P. H., and M. D. Pagel. 1991. The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford. What are these principles?
The “entropy” Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E S = 1 bit Principle 1: hierarchically structured data demand appropriate statistics Example: Residue “conservation” Valdar, W. S. 2002. Scoring residue conservation. Proteins 48:227-241. Figure 1. . . Each labeled column represents a residue position in a multiple-sequence alignment . . .
Seq_1 D D Seq_2 D E Seq_3 D D Seq_4 D E Seq_5 E D Seq_6 E E Seq_7 E D Seq_8 E E Let r = +, then P(DE,t)=(/r)(1-e-rt) t Principle 2: evolution is the generating process Because the non-independence arises via descent with modification, the proper framework for addressing hierarchy is as to interpret it as an evolved pattern
Probabilities gain intron A 1 B 1 C 0 D 0 E 0 F 0 intron A 1 B 1 C 0 D 0 E 0 F 0 1 gain loss loss A B (Prob) Probability of presence 0 max 0 max Distance from root Distance from root 0 intron A 1 B 1 C 0 D 0 E 0 F 0 gain intron A 1 B 1 C 0 D 0 E 0 F 0 loss loss present loss F E C D loss 0 max 0 max Distance from root Distance from root 0 max Distance from root Example: intron “loss vs. gain” problem Possibilities
functional attribute A 1 B 1 C ? D ? E 0 F 0 presence A 1 B 1 C 0 D 0 E 0 F 0 t Example: functional inference Let r = +, then P(01,t)=(/r)(1-e-rt)
Principle 3: the result is an inference with uncertainty that should be treated explicitly • assign uncertainties to inferences • provide explicit probability distribution Example from Huelsenbeck “The phylogeny is usually treated as known without error; this assumption is problematic because inferred phylogenies are subject to both stochastic and systematic errors.” Huelsenbeck, J. P., B. Rannala, and J. P. Masly. 2000. Science 288:2349-2350.
13 Q Q Q Q E The “state” is Q (Glutamine) for “character” 13 (column 13) of “OTU” H_sapiens_4826964 Character-state data model OTU: Operational Taxonomic Unit Character Data Tree
BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor 00000 . . . MATRIX presumed_ancestor 00000 Geospiza_difficilis 00000 Geospiza_scandens 00000 Geospiza_conirostris 00000 Geospiza_magnirostris 00000 Geospiza_fortis 00000 Geospiza_fuliginosa 00000 Camarhynchus_pallidus 11101 Camarhynchus_heliobates 11101 Camarhynchus_psittacula 11101 Camarhynchus_pauper 11101 Camarhynchus_parvulus 11101 Platyspiza_crassirostris 11010 Certhidea_olivacea 11101; END; #NEXUS [!Data and tree from: Schluter, D. 1989. Pp. 79-95 in D.B. Wake and G. Roth, eds., Complex organismal functions: Integration and evolution in vertebrates. Wiley, N.Y. ] BEGIN DATA; DIMENSIONS NTAX=14 NCHAR=5; FORMAT MISSING=? GAP=- ; CHARLABELS [1] Maxillary_tomia [2] lateral_groove [3] posterolateral_teeth [4] intercalary_ridge [5] maxillary_tomia; STATELABELS 1 thick thin, 2 deep shallow, 3 sharp reduced, 4 absent present, 5 'round-edged' 'sharp-edged'; MATRIX presumed_ancestor 00000 Geospiza_difficilis 00000 Geospiza_scandens 00000 Geospiza_conirostris 00000 Geospiza_magnirostris 00000 Geospiza_fortis 00000 Geospiza_fuliginosa 00000 Camarhynchus_pallidus 11101 Camarhynchus_heliobates 11101 Camarhynchus_psittacula 11101 Camarhynchus_pauper 11101 Camarhynchus_parvulus 11101 Platyspiza_crassirostris 11010 Certhidea_olivacea 11101; END; BEGIN ASSUMPTIONS; OPTIONS DEFTYPE=unord PolyTcount=MINSTEPS ; END; BEGIN TREES; TRANSLATE 1 presumed_ancestor, 2 Geospiza_difficilis, 3 Geospiza_scandens, 4 Geospiza_conirostris, 5 Geospiza_magnirostris, 6 Geospiza_fortis, 7 Geospiza_fuliginosa, 8 Camarhynchus_pallidus, 9 Camarhynchus_heliobates, 10 Camarhynchus_psittacula, 11 Camarhynchus_pauper, 12 Camarhynchus_parvulus, 13 Platyspiza_crassirostris, 14 Certhidea_olivacea; TREE * UNTITLED = [&R] (1,(((2,(3,4),((5,6),7)),(((8,9),((10,11),12)),13)),14)); END; Character State Data(example from MacClade documentation)
Genome sequences Useful inferences Comparative analysis 99.99 % accurate Far less accurate The problem, restated Power comes from comparative analysis Comparative analysis is an evolutionary problem • Depends on a tree describing relationships • Depends on representing dynamics of evolution • Requires attention to uncertainty How to improve evolutionary analysis? • Facilitating tree-based analysis with better informatics • Improving models of evolutionary change • Incorporating prior knowledge
How to advance evolutionary analysis? • Automate • Improve current models • Add more parameters • Expand universe of problem • Include more prior knowledge • Improve methods of numerical analysis • Demonstrate benefits of evolutionary analysis convincingly • Improve informatics support • standards (e.g., NEXUS, NeXML) • libraries (Bio::Phylo, Bio::NEXUS) • applications (Mesquite, Nexplorer) • ontologies (CDAO) • intelligent user-oriented systems (e.g., Galaxy)
Working group report outline • Development and evolution of goals • Activities • Products and other outcomes • NeXML standard and implementations • CDAO standard, publication • PhyloWS standard • other • Impacts • Lessons learned • Follow-ups
2007 2008 2009 2010 2011 2012 Activities Informal meeting Philly, June 2006 Phylohackathon PhyloWS (Tokyo) WG1 WG2 CDAO Ontology session at Evolution 2008 WG3 Evolutionary Informatics Working Group NESCent Phyloinformatics course Google Summer-of-Code projects DBH1 Evol. Ontology RCN meeting at NESCent • Timespan of NIH project if funded • Comparative Data Analysis Ontology • Domain-specific language • Workflow construction using reasoning • Services infrastructure
Working group report outline • Development and evolution of goals • Activities • Products and other outcomes • NeXML standard and implementations • CDAO standard, publication • PhyloWS standard • other • Impacts • Lessons learned • Follow-ups
Prioritization exercise In spring of 2007, participants ranked 11 proposed items leaders devised coherent plan with suggested tactics
First meeting • May 21-23, 2007, NESCent • Priorities and activities • Supporting current file formats • Substitution model language • Central unifying artefact • New data exchange format • Outreach (funding, community needs)
Tangible Outcomes, period 1 • New data exchange format (wiki) • Detailed proposal • Current formats (wiki) • Use assessment (incomplete) • Examples (incomplete) • Transition Model Language (wiki) • Assessment • Initial results on related technologies • Central Unifying Artefact (wiki, docs, online demos) • NeXML draft • Ontology development strategy (CDAO) • Concept glossary • Ontology-based semantic transformation demos • Project proposal (4-year, ~1.2 M$ NIH RO1) • International team of collaborators • Outreach: not much (broader awareness)
Tangible Outcomes, period 2 • New data exchange format (wiki, nexml.org) • NeXML (Rutger’s talk) • Current formats (wiki) • NA • Transition Model Language (wiki) • NA • Comparative data analysis ontology (wiki, docs, demos) • Analyzed related artefacts • Expanded concept glossary • Developed first draft of CDAO • Started evaluation • Outreach: not much (broader awareness)
Tangible Outcomes, period 3 • New data exchange format (NeXML) • More support from apps developers • Broader support in libraries • Current formats (wiki) • NA • Transition Model Language (wiki) • NA • Comparative data analysis ontology (wiki, docs, demos) • Completed first evaluation cycle • Manuscript written, submitted, accepted • Started v. 2 with more terms; support for protocols • Outreach: • Meetings and workshops
Working group report outline • Development and evolution of goals • Activities • Products and other outcomes • NeXML standard and implementations • CDAO standard, publication • PhyloWS standard • other • Impacts • Lessons learned • Follow-ups
character state data matrix Annotation: Alignment procedure… character state data matrix has part_of part_of Annotation: taxonomic_link … has TU character belongs_to belongs_to character state datum is_transformation_of represents_TU has state transformation topology node has child has child_node has ancestor directed edge rooted tree part_of state has descendant has parent has parent_node is_a node has left_state tree is_a transformation has is_a Annotation: Tree Procedure Model… has right_state has unrooted tree node part_of edge has left_node state connects_to has has has right_node node Annotation: Length… CDAO: key concepts & relations
Summary of outputs • NeXML • Standard • APIs • Supporting apps and resources • CDAO • Standard, evaluation • publication • PhyloWS • Standard • Demo implementations
Working group report outline • Development and evolution of goals • Activities • Products and other outcomes • NeXML standard and implementations • CDAO standard, publication • PhyloWS standard • other • Impacts • Lessons learned • Follow-ups
Impacts • Cohesion & awareness • Interactions, spin-offs, related projects • Penetration of standards • Use of implementations
This week’s hackathon • Takes place of 4th wg meeting • Aims to rely on wg technologies • Opportunity to • assess useability and scope of wg artefacts • assess prospects for technology “push” • assess potential gains from interoperability • expose weakness in wg artefacts • expose further interop needs
Tangible Outcomes, this week • Semantics — accessing content via an RDF triple store • Process, translate, load from nexml into triplet store • Java API and SPARQL query interface • Phylr — UI for phyloWS access to combined data • DendroPy nexml; bioSQL; • http://dbhack1.nescent.org:8080/SRW/search/treebase? • Java API for nexml — lightweight IO in Java • Extensively implemented DOM approach (31 classes, 28 interfaces, 7 test classes • live demo of test classes on last day!
Tangible Outcomes, this week • Visualization — UI to overlay on a tree data repository • http://iptol.iplantcollaborative.org/hackathon/ — live demo • Integration with Morphbank image collection • Taxonomic intelligence — access to data via taxo. info • TreeBase REST API — live demo of local implementation • Rogue projects and other outcomes • NeXML test files (metadata representation and visualization) • mx improvements (tree viewing, nexml IO) • Tentative metadata standard for nexml (meta, RDFa)
Introduction The problem EvoInfo interests This subproject Nexus issues Parsing Extensibility XML goodies Design Principles Re-use Patterns Inheritance References Implementation Approach ERD Inheritance Anatomy Characters Trees Current status Schema blocks Parsers & writers Experiments To do Resources Current status (2/4)Parsers and writers Nexml parsers and writers: mesquite, java, using xmlbeans Bio::Phylo, perl pyNexml, python DAMBE, Visual Basic stubs for c++ xmlbeans plans for ruby?
Lessons learned — successes • Choosing participants • Preparation for meetings • Use of collaboration tools • Identifying targets
Lessons learned — challenges • Preparation for meetings • Dissemination • Evaluation • Identifying targets
Possible follow-ups • Another hackathon this summer • Google SoC projects • MIAPA project • Renew evoinfo working group
Evolutionary Informatics: Supporting Interoperability in Evolutionary Analysis DB Interop Hackathon Jim Balhoff Lucie Chan Dave Clements Karen Cranston Sam Donnelly Vladimir Gapeyev Karla Gendler Vivek Gopalan Roger Hyam Mark Jensen Greg Jordan Matt Kosnik Sheldon McKay Ryan Scherle Katja Schulz Katja Seltmann Jeet Sukumaran Matt Yoder NESCent Evolutionary Informatics Working Group NESCent staff Hilmar Lapp Todd Vision Jon Eisen (“phylogenomics”) Joe Felsenstein (PHYLIP) Mark Holder (GARLI) Sergei Kosakovsky Pond (HyPhy) Sudhir Kumar (MEGA) Paul Lewis (NCL) Aaron Mackey (BioPerl,GMOD) David Maddison (Mesquite) Wayne Maddison (Mesquite) Enrico Pontelli (CDAO) Andrew Rambaut (BEAST) Arlin Stoltzfus (Bio::NEXUS) David Swofford (PAUP*) Rutger Vos (Bio::Phylo) Xuhua Xia (DAMBE) Christian Zmasek (ATV, RIO) WG colleagues Brandon Chisham Brian Devries Gopal Gupta Peter E. Midford William Piel Francisco Prosdocimi Julie Thompson Derrick Zwickl Fourth meeting