120 likes | 141 Views
Explore phylogenomics, the fusion of genomics and phylogenetics, and its emerging data standards for diverse analyses, storage methods, and publication challenges. Learn about phyloXML and its vital components for efficient research practices.
E N D
Emerging Data Standards for Phylogenomics Research Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org
Phylogenomics • Original definition • the application of phylogenetic information for gene function analysis (Eisen, 1998) • Recent usage • species evolution based on whole genome analyses (for example, Dunn et al., 2008) • various types of studies at the intersection of genomics and phylogenetics www.phyloxml.org
The application of phylogenetic information for gene function analysis RAT RAT MOUSE MOUSE Y Y HUMAN CIONA CIONA HUMAN X RAT RAT Z Z CIONA CIONA : query sequence : orthologous to query : most similar to query : gene duplication www.phyloxml.org
What information do we need for a phylogenomic analysis (sequence function analysis type)? • In phylogenomic analyzes, tree nodes might be annotated with: • Sequence name • Species name • Duplication: true/false • Branches might be annotated with: • Branch lengths • Support values (bootstrap, probability, …) www.phyloxml.org
What information might we need for other types of phylogenomic analyses? • Support values (possible multiple) • Taxonomy information (possibly detailed) • Geographic information • Host/parasite data (relation between tree nodes) • Gene expression values • Genomic location • Mutations, variation, disease • … www.phyloxml.org
How is this information processed and stored? • Tree topologies are described by hierarchical parenthesis: ((A,B),C) • Unique tree node labels mapped to text files, spreadsheets, databases • Manual processing of text files with text editors • Macros, shell scripts, Perl scripts • New HamphshireeXtended (NHX) format • Adds tags for different fields: • Species: S= • Bootstrap support: B= • Example: ADH2:0.1[&&NHX:S=human:B=90] • http://www.phylosoft.org/forester/NHX.html www.phyloxml.org
How is this information published? • Mostly as images of phylogenetic trees in journals • not suitable as input for further studies! • Submission to (publicly accessible) databases rare www.phyloxml.org
Problems with this approach • Tedious • Error prone • Published images are difficult to use as input for further studies • Meta-analyzes are hard • Different, and incompatible, “dialects” of NHX appeared • Limited expressiveness www.phyloxml.org
phyloXML by example <phylogeny rooted="true"> <name> example from Prof. Joe Felsenstein's book "Inferring Phylogenies“ </name> <clade> <clade> <branch_length>0.06</branch_length> <clade> <name>A</name> <branch_length>0.102</branch_length> </clade> <clade> <name>B</name> <branch_length>0.23</branch_length> </clade> </clade> <clade> <name>C</name> <branch_length>0.4</branch_length> </clade> </clade> </phylogeny> www.phyloxml.org
phyloXML • Important elements: • Taxonomy • Sequence • Confidence • Events (duplication, speciation) • Property (“custom data”) • Typed relations (between clades, sequences) • XSD schema, examples, description, applications: http://www.phyloxml.org/ • Current version: 1.o www.phyloxml.org
Important clade level elements • <taxonomy> • <id source=“”> • <scientific_name> • <common_name> • <rank> • <uri> • <sequence> • <symbol> • <accession source=“”> • <name> • <uri> • <confidence type=“”> • <distribution> • <desc> • <point geodetic_datum=“”> • <lat> • <long> • <alt> • <property ref=“” unit=“” datatype=“”> www.phyloxml.org
phyloXML applications/implementations (examples) • BioPerl: • Parser, writer • ATV — A Tree Viewer • Java based tree display tool suitable for large (>10 000) and highly decorated phylogenetic/taxonomic trees • http://www.phylosoft.org/atv • phyloxml_converter • Command line tool to convert Newick (NH), NHX, and Nexus formatted trees to phyloXML www.phyloxml.org