1 / 12

Emerging Data Standards for Phylogenomics Research

Emerging Data Standards for Phylogenomics Research. Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org. Phylogenomics. Original definition

Download Presentation

Emerging Data Standards for Phylogenomics Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emerging Data Standards for Phylogenomics Research Christian M Zmasek, PhD Burnham Institute for Medical Research Bioinformatics and Systems Biology www.phylosoft.org www.phyloxml.org

  2. Phylogenomics • Original definition • the application of phylogenetic information for gene function analysis (Eisen, 1998) • Recent usage • species evolution based on whole genome analyses (for example, Dunn et al., 2008) • various types of studies at the intersection of genomics and phylogenetics www.phyloxml.org

  3. The application of phylogenetic information for gene function analysis RAT RAT MOUSE MOUSE Y Y HUMAN CIONA CIONA HUMAN X RAT RAT Z Z CIONA CIONA : query sequence : orthologous to query : most similar to query : gene duplication www.phyloxml.org

  4. What information do we need for a phylogenomic analysis (sequence function analysis type)? • In phylogenomic analyzes, tree nodes might be annotated with: • Sequence name • Species name • Duplication: true/false • Branches might be annotated with: • Branch lengths • Support values (bootstrap, probability, …) www.phyloxml.org

  5. What information might we need for other types of phylogenomic analyses? • Support values (possible multiple) • Taxonomy information (possibly detailed) • Geographic information • Host/parasite data (relation between tree nodes) • Gene expression values • Genomic location • Mutations, variation, disease • … www.phyloxml.org

  6. How is this information processed and stored? • Tree topologies are described by hierarchical parenthesis: ((A,B),C) • Unique tree node labels mapped to text files, spreadsheets, databases • Manual processing of text files with text editors • Macros, shell scripts, Perl scripts • New HamphshireeXtended (NHX) format • Adds tags for different fields: • Species: S= • Bootstrap support: B= • Example: ADH2:0.1[&&NHX:S=human:B=90] • http://www.phylosoft.org/forester/NHX.html www.phyloxml.org

  7. How is this information published? • Mostly as images of phylogenetic trees in journals • not suitable as input for further studies! • Submission to (publicly accessible) databases rare www.phyloxml.org

  8. Problems with this approach • Tedious • Error prone • Published images are difficult to use as input for further studies • Meta-analyzes are hard • Different, and incompatible, “dialects” of NHX appeared • Limited expressiveness www.phyloxml.org

  9. phyloXML by example <phylogeny rooted="true"> <name> example from Prof. Joe Felsenstein's book "Inferring Phylogenies“ </name> <clade> <clade> <branch_length>0.06</branch_length> <clade> <name>A</name> <branch_length>0.102</branch_length> </clade> <clade> <name>B</name> <branch_length>0.23</branch_length> </clade> </clade> <clade> <name>C</name> <branch_length>0.4</branch_length> </clade> </clade> </phylogeny> www.phyloxml.org

  10. phyloXML • Important elements: • Taxonomy • Sequence • Confidence • Events (duplication, speciation) • Property (“custom data”) • Typed relations (between clades, sequences) • XSD schema, examples, description, applications: http://www.phyloxml.org/ • Current version: 1.o www.phyloxml.org

  11. Important clade level elements • <taxonomy> • <id source=“”> • <scientific_name> • <common_name> • <rank> • <uri> • <sequence> • <symbol> • <accession source=“”> • <name> • <uri> • <confidence type=“”> • <distribution> • <desc> • <point geodetic_datum=“”> • <lat> • <long> • <alt> • <property ref=“” unit=“” datatype=“”> www.phyloxml.org

  12. phyloXML applications/implementations (examples) • BioPerl: • Parser, writer • ATV — A Tree Viewer • Java based tree display tool suitable for large (>10 000) and highly decorated phylogenetic/taxonomic trees • http://www.phylosoft.org/atv • phyloxml_converter • Command line tool to convert Newick (NH), NHX, and Nexus formatted trees to phyloXML www.phyloxml.org

More Related