120 likes | 211 Views
Emerging Data Standards for Phylogenomics Research. Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org. Phylogenomics. Original definition
E N D
Emerging Data Standards for Phylogenomics Research Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org
Phylogenomics • Original definition • the application of phylogenetic information for gene function analysis (Eisen, 1998) • Recent usage • species evolution based on whole genome analyses (for example, Dunn et al., 2008) • various types of studies at the intersection of genomics and phylogenetics www.phyloxml.org
The application of phylogenetic information for gene function analysis RAT RAT MOUSE MOUSE Y Y HUMAN CIONA CIONA HUMAN X RAT RAT Z Z CIONA CIONA : query sequence : orthologous to query : most similar to query : gene duplication www.phyloxml.org
What information do we need for a phylogenomic analysis (sequence function analysis type)? • In phylogenomic analyzes, tree nodes might be annotated with: • Sequence name • Species name • Duplication: true/false • Branches might be annotated with: • Branch lengths • Support values (bootstrap, probability, …) www.phyloxml.org
What information might we need for other types of phylogenomic analyses? • Support values (possible multiple) • Taxonomy information (possibly detailed) • Geographic information • Host/parasite data (relation between tree nodes) • Gene expression values • Genomic location • Mutations, variation, disease • … www.phyloxml.org
How is this information processed and stored? • Tree topologies are described by hierarchical parenthesis: ((A,B),C) • Unique tree node labels mapped to text files, spreadsheets, databases • Manual processing of text files with text editors • Macros, shell scripts, Perl scripts • New HamphshireeXtended (NHX) format • Adds tags for different fields: • Species: S= • Bootstrap support: B= • Example: ADH2:0.1[&&NHX:S=human:B=90] • http://www.phylosoft.org/forester/NHX.html www.phyloxml.org
How is this information published? • Mostly as images of phylogenetic trees in journals • not suitable as input for further studies! • Submission to (publicly accessible) databases rare www.phyloxml.org
Problems with this approach • Tedious • Error prone • Published images are difficult to use as input for further studies • Meta-analyzes are hard • Different, and incompatible, “dialects” of NHX appeared • Limited expressiveness www.phyloxml.org
phyloXML by example <phylogeny rooted="true"> <name> example from Prof. Joe Felsenstein's book "Inferring Phylogenies“ </name> <clade> <clade> <branch_length>0.06</branch_length> <clade> <name>A</name> <branch_length>0.102</branch_length> </clade> <clade> <name>B</name> <branch_length>0.23</branch_length> </clade> </clade> <clade> <name>C</name> <branch_length>0.4</branch_length> </clade> </clade> </phylogeny> www.phyloxml.org
phyloXML • Important elements: • Taxonomy • Sequence • Confidence • Events (duplication, speciation) • Property (“custom data”) • Typed relations (between clades, sequences) • XSD schema, examples, description, applications: http://www.phyloxml.org/ • Current version: 1.o www.phyloxml.org
Important clade level elements • <taxonomy> • <id source=“”> • <scientific_name> • <common_name> • <rank> • <uri> • <sequence> • <symbol> • <accession source=“”> • <name> • <uri> • <confidence type=“”> • <distribution> • <desc> • <point geodetic_datum=“”> • <lat> • <long> • <alt> • <property ref=“” unit=“” datatype=“”> www.phyloxml.org
phyloXML applications/implementations (examples) • BioPerl: • Parser, writer • ATV — A Tree Viewer • Java based tree display tool suitable for large (>10 000) and highly decorated phylogenetic/taxonomic trees • http://www.phylosoft.org/atv • phyloxml_converter • Command line tool to convert Newick (NH), NHX, and Nexus formatted trees to phyloXML www.phyloxml.org