1 / 34

Introduction: from primary to secondary data

1. 2. Phy logenetic S ignal with I nduction and non- C ontradiction: the PhySIC method for building supertrees ( Systematic Biology 2007) http:/atgc.lirmm.fr/SuperTree/PhySIC.

janice
Download Presentation

Introduction: from primary to secondary data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1 2 Phylogenetic Signal with Induction and non-Contradiction: the PhySIC method for building supertrees(Systematic Biology 2007)http:/atgc.lirmm.fr/SuperTree/PhySIC Vincent Berry1,V. Ranwez2,A. Criscuolo1,2, P.-H. Fabre2, S. Guillemot1, C. Scornavacca1,2, E.J.P. Douzery2 Funded by ACI IMPBIO & BIOSTIC LR CNRS - Université Montpellier 2 France

  2. Introduction: from primary to secondary data • Increasing number of phylogenetic studies • Annotating and storing phylogenies has become crucial • TreeBase (http://www.treebase.org/) • Tree Of Life (http://www.tolweb.org/tree/) • Using these secondary data requires dedicated tools • Enabling complex requests • Analyses/Syntheses of obtained phylogenies • State of the art phylogeny for a taxonomic group • Filtering: date > 2002 & method=ML & outGroup belongs to X • Phylogeny summarizing the 25 responses ? PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  3. Introduction: what are supertrees? • Supertree methods combine phylogenies having overlapping taxa sets into a larger phylogeny Semple & Steel 2003 PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  4. Introduction: use of supertrees • The supertree approach: n alterantive to the total evidence approach: • Avoids dealing with too much missing data • Allows using separate evolutionary models for each gene • Enables combining data of different kinds: morphological, molecular, SINEs • Supertrees are useful for • producing well-resolved large phylogenies to provide a framework for broad comparative studies (Gittleman et al 2004) • Quantitative studies of input-tree congruence, identifying outlier taxa by tree-supertree distance measures (Willkinson et al 2004) • Exploring and identifying agreement and disagreement among sets of input trees. The aim is then to reveal conflicts rather than resolving them. Conflict are ultimately resolved from additional data or analyses (Willkinson et al 2001) • Identifying where limited overlap between the leaf sets of the input trees is an obstacle in their amalgamation, thereby guiding further research (Sanderson et al 1996, Arné et al 2007). PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  5. Introduction : dealing with conflicts • Dealing with contradiction of input trees • Voting methods: resolve conflicts based on a voting procedure • Veto methods: do not favor any resolution in case of conflict Dessin de deux topos en conflit sur le placement d’un taxon PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  6. Voting methods • The vast majority of supertree methods • These methods, also called « liberal » provide most resolved supertrees Goal: obtain a rough but large picture of how the source trees can be assembled • Examples • MRP (Baum 1992 & Ragan 1992): encoding trees + parsimony criterion. Admits several variants: • MRF (Eulenstein et al 2002) • MRC (Ross & Rodrigo 2004) • Quartet-suite (Piaggio et al 2004): quartet-based approach • SuperTriplet (Ranwez et al) PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  7. Veto methods • Proceed from an axiomatic approach: proposed supertrees satisfy specified theoretical properties Goal: obtain a reliable, if incomplete, picture of how the source trees fit together • Full congruence with the source trees can be necessary for further applications such as phylogeography, divergence time estimations, etc. • Avoid as much as possible the inference of non-supported novel clades (by-product of some voting methods : Bininda-Emonds et al, Goloboff et al) PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  8. Existing veto methods • Build(Aho 1981) • Only returns a tree when the source trees are compatible • Strict consensus(Gordon 1986) • Avoids conflict by removing clades from the source trees • Limited to two trees (could be extended but requires a large common overlap between all source trees to obtain a meaningful result). • Maximum Agreement SuperTree - SMAST (Berry & Nicolas 2004, Berry & Guillemot 07) • Avoid conflicts by removing taxa from the source trees • Requires exponential computing time in the general case • No inherent mechanism to avoid arbitrary branches • PhySIC : an axiomatic approach PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  9. Overview • Some relevant properties for reliable inference • Decomposition of a tree into triples • Identifying a tree • Property of Induction (PI) • Property of non-Contradiction (PC) • Algorithms (sketch) • BUILD - Aho • PhySICPC • PhySICPI • Biological case study: Primate supertree • Conclusion & prospects PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  10. Axiomatic approach: important properties Reliable factsare those that can beinducedfrom testimonies and that arenotincompatiblewith any other. PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  11. d c b a c d b e c b a d c b d b a d c a tr(T1) tr(T2) ed|c eb|d eb|c bd|c bc|d ac|d ab|d ab|c Decomposition of trees in building stones Triplets (rooted triples): subtrees on 3 taxa T2 T1 ac|d PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  12. d c b a c d b a c b a d c b • RidentifiesT iff • T displays R • every tree T’ displaying R contains all the clades of T X R identifies T R’ does not identify T ab|c ab|d Properties of interest: identification • A tree Tdisplays a set R of triples • iff R  tr(T) • In such case R is said to be compatible :all triples of R can be summarized in a tree T bc|d ab|c PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  13. T d c b a ab|d and ac|d are induced Properties of interest: identification • RidentifiesT yet R does not contain all triples of tr(T): additional triples are induced by those present in R c b a d c b R bc|d ab|c PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  14. d c b a c b a d c b a R d c b a d b a ab|c ab|d ac|d? cd|b? ab|c ab|d ac|d? bc|d? ab|c ab|d Relevant properties: induction (PI) • We want to infer reliable supertrees: we only accept supertrees T such that tr(T) is present in the dataR or induced by hypotheses in R PI PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  15. T R ab|c bc|d ab|d ac|d ad|c bd|c Supertree method R identifies T? d c b a c d b a • However, there is a chance that part of the underlying tree appears uncorrupted in the data: find a subset R’ of R identifying a tree (ie, a subtree of the underlying tree) Focusing on a coherent subset of hypotheses • There is no chance that practical data exactly identifies a (super)tree: • Lack of overlap between the source trees: missing data • Errors due to gene specific evolution, systematic errors in the source tree inference (long branch attraction, estimated model of evolution) PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  16. PC dc b a T R’ identifies T Relevant properties: non-contradiction • We search for a subset of R identifying a tree T • But we want to be reliable: we don’t accept hypotheses that are in direct contradiction with discarded hypotheses we reject subsets R’ obtained by keeping xy|z and removing xz|y. R’ R ab|c ab|d bc|d ac|d bd|c ad|c we focus on the triples of R resolved by T PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  17. Link between the properties: • R(T) identifiesT is equivalent to • T satisfies PC: (property of non-contradiction) for any triplet ab|c displayed by T, R(T) induces neither bc|a nor ac|b and • T satisfies PI: (property of induction)every triplet ab|c displayed by T is induced by R(T) • PhySIC • Phylogenetic Signal withInduction and non-Contradiction • PhySIC returns a supertree T that satisfies both PC and PI • PhySIC has a polynomial time complexity PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  18. Overview • Relevant properties for a veto method (reliable facts) • Decomposition of a tree into triplets • Tree identification • Property of Induction (PI) • Property of non-Contradiction (PC) • Algorithms (sketch) • BUILD - Aho • PhySICPC • PhySICPI • Biological case study: Primate supertree • Conclusion & prospects PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  19. d c b a d c b c b a R bc|dab|c a   c a   c a   d b  b  b  d {a,b,c} c {a,b} a b Algorithmic ideas: BUILD (Aho et al 81) • Returns a tree only when R is compatible. PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  20. R2 a   c bc|d bd|c ac|dad|c ab|c ab|d  d b  R1 a   c ab|c ac|b bc|dab|dac|d d c b a c d b a d c b a d b c a  d b  d c b a d {a,b,c} a   c b  Algorithmic ideas: limits of BUILD PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  21. a   c  d b  a   c d c b a c d b a  d b  c d b a Algorithmic ideas: PhySICPC R bc|dbd|c ac|d ad|c ab|c ab|d R’ bc|dbd|c ac|d ad|c ab|c ab|d Idea: forget temporarily the direct contradictions • At each iteration, if there is a single connected component • Check if using R’ leads to several connected components • If so, check that the tree will satisfy PC w.r.t. R. • Or else, propose a multifurcation on those taxa • We thus obtain a more resolved tree (contradictions affecting basal clades do not imped deeper clades to be obtained) that satisfies PC PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  22. c b a c f e a b c e f {a,b} c {e,f} R a   c ab|c ef|c  f  e b  Algorithmic ideas:limits of BUILD (2) • When the graph contains several connected components, it is necessary to check that the triplets we are about to create are really induced by R • Branches that create triples not induced by R are collapsed ef|a ?? PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  23. Algorithmic ideas: PhySICPI • We keep the clade related to a connected component only if it is connected w.r.t. each and every other connected component. {a,b} {c} {e,f} R a   c  f ab|c ef|c  e b  • For the {a,b} clade to be kept, {a,b} must be connected when we restrict R to the triplets of : • {a,b} U {c} (connected) • {a,b} U {e,f}(not connected : isolating {a,b} will produce arbitray triples) PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  24. Algorithmic ideas summary • A supertree draft is proposed by PhySICPC ensuring PC • If a clade is not « strong enough » the corresponding branch is collapsed by PhySICPI ensuring also PI • Physic is a polynomial-time supertree method: • Decomposition of the input forest into triplets O(kn3) • Creation of a tree satisfying PCO(n4) • Collapsing edges displaying triplets not induced by the source trees:O(n4) the algorithm requires O(kn3+n4) computing time PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  25. Overview • Relevant properties for a veto method • Decomposition of a tree into triplets • Tree identification • Property of Induction (PI) • Property of non-Contradiction (PC) • Algorithms (intuitive presentation) • BUILD Aho • PhySICPC • PhySICPI • Biological case study: Primate supertree • Conclusion & prospects PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  26. Primate case study: source trees • ADRA2B and IRBP study (Poux et al. 04, 06) • SINEs (Roos et al. 04) • Branches with bootstrap support <50% are collapsed Anthropoids PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  27. Primate case study: PC & PI in action Source trees Platyrrhines are unresolved due to a conflict (PC) Arbitrary resolution among Anthropoids is removed (PI) ADRA2B PhySICPC PhySIC IRBP PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  28. Labels indicating source of problems • PhySIC can tell the reason for multifurcations proposed: • Lack of overlap or information in the source trees (i) • Local contradictions between the source trees (c) guides correction/completion of source trees and primary data PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  29. Pointing out “problems” in other supertrees • MRP is known to have some indesirable features: • inferring “novel clades” not supported by any input tree (Bininda-Emonds & Bryant 98, Goloboff & Pol 01, Goloboff 05) • being affected by a size-bias, i.e. when two trees conflict on the resolution of a clade, the tree with the smallest local sampling is ignored(Purvis 95, Bininda-Emonds & Bryant 98, Goloboff 05) • favoring source tree that are more unbalanced(Wilkinson et al 01) • A supertree already built from a collection of source trees by an usual supertree method, can be reanalyzed in the light of PI & PC to identify problems on some dubious nodes. PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  30. PC 2 1 1 Primate case study: MRP analyzed Source trees MRP supertree filtered supertree ADRA2B IRBP PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  31. Online server: http://atgc.lirmm.fr/SuperTree/PhySIC Contact: Vincent.Lefort@lirmm.fr PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  32. Conclusion & prospects • Study of desirable properties • PC and PI: intuitively attractive properties • These properties can be checked in polynomial time • PhySIC algorithm(http://atgc.lirmm.fr/SuperTree/PhySIC) • Supertrees satisfying PI and PC (exact) and as much resolved as possible (heuristics) • Proposes very reliable supertrees: identified by the data • Polynomial-time method • Visualization of conflicts and areas with insufficient overlap • Enables to check/correct supertrees built by other methods (MRP, …). • Further developments: • Producing more resolved trees satisfying PC et PI • Filtering triplets based on their frequencies • Coupling with a database (TreeBase, …) PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  33. Thank you all! Emmanuel Douzery Vincent Ranwez Alexis Criscuolo Sylvain Guillemot Pierre-Henri Fabre Celine Scornavacca Vincent Lefort Equipe Méth. et Algor. pour la bioinf. LIRMM Equipe Phylogénie Moléculaire ISEM Merci Allen et Stéphane PhySIC: Phylogenetic Signal with Induction and non-Contradiction

  34. R’ a   c d c b a bc|d ab|c bc|d ab|c  d b  d {a,b,c} T Algorithmic ideas: PhySICPC • At each iteration, if there is a single connected component • Check if using R’ leads to several connected components • If so, check that the tree will satisfy PC w.r.t. R. • Or else, propose a multifurcation on those taxa • We thus obtain a tree satisfying PC even if source trees contain local contradictions R ab|d bd|a PhySIC: Phylogenetic Signal with Induction and non-Contradiction

More Related