420 likes | 581 Views
Effects of Rooting on Phylogenic Algorithms. Margareta Ackerman Joint work with David Loker and Dan Brown . Hierarchical Clustering & Phylogency. Phylogeny meets Hierarchical Clustering. Ph ylogeny is an application of Hierarchical Clustering. They are closely related!.
E N D
Effects of Rooting on Phylogenic Algorithms Margareta Ackerman Joint work with David Loker and Dan Brown
Phylogeny meets Hierarchical Clustering Phylogeny is an application of Hierarchical Clustering. They are closely related! Unfortunately, there is a disconnect between these fields.
Bridging the Gap A step towards bridging the gap: We bring techniques from cluster analysis to study Phylogenetic algorithms. We apply a recent framework for clustering algorithm selection to Phylogeny [(Ackerman, Ben-David, and Loker, ‘10), (Ackerman, Ben-David, and Loker, ‘10), (Ackerman & Ben-David, IJCAI ‘11), (Zedah and Ben-David, ‘09)]
Selecting Phylogenetic Algorithms Given the same input, different Phylogenetic algorithms can produce radically different results. How should a user decide which algorithm to use?
Framework for Selecting Phylogenetic Algorithms This framework lets a user utilize prior knowledge to select an algorithm • Identify properties that distinguish between different input-output behaviour of clustering paradigms • The properties should be: 1) Intuitive and “user-friendly” 2) Useful for distinguishing clustering algorithms
Outline • Rooting Phylogenetic Trees • Formal Framework • Properties of Hierarchical Algorithms • Analysis of Linkage-Based Algorithms • Analysis of Neighbor Joining • Conclusions and Future Direction
How to Root Phylogenetic Trees? A common solution: Introduce distant taxa (or, elements) and root where the distant taxa connect with the ingroup. E
When Rooting Changes the Ingroup The addition of an outgroup can CHANGE the topology of the ingroup. After adding outgroup E
This Happens in Practice! Empirical studies demonstrate that when using some algorithms, ingroup topology can be disrupted when an outgroup is added [(Holland et. al., ‘03), (Shavit et. al., ‘07), (Lin et. al, ‘02), (Slack et. al., ‘03) ] We perform a theoretical analysis of this phenomenon, proving that some algorithms are immune to this problem, while others are highly volatile.
Previous Work Independently of our work, it was shown that when using BME, the ingroup topology can change arbitrarily when an outlier is added (Cueto and Matsen, 2010)
Our Contributions • Linkage-based algorithms (including UPGMA) do not change ingroup when the outgroup is sufficiently far away • Using Neighbor Joining, ingroup topology is effected by outgroups even if the outgroup is arbitrarily far away
Outline • Rooting Phylogenetic Trees • Formal Framework • Properties of Hierarchical Algorithms • Analysis of Linkage-Based Algorithms • Analysis of Neighbor Joining • Conclusions and Future Direction
Formal Setup C_i is a clusterin a dendrogramD if there exists a node in the dendrogram so that C_iisthe set of its leaf descendents.
Formal Setup C = {C1, … , Ck} is a clusteringin a dendrogramD if • Ciis a cluster in D for all 1≤ i ≤ k, and • Clusters are disjoint
Formal Setup AHierarchical Clustering Algorithm A maps Input: A data set Xwith a distance function d, denoted (X,d) to Output:A dendrogram of X The distance between Y ⊆ X andZ ⊆ X isthe length of the minimum edge between them d(Y,Z) = miny inY, z in Z d(y,z)
Outline • Rooting Phylogenetic Trees • Formal Framework • Properties of Hierarchical Algorithms • Analysis of Linkage-Based Algorithms • Analysis of Neighbor Joining • Conclusions and Future Direction
Unaffected by an Outgroup Given a data set (XuO, d) and algorithm A, X is unaffectedby O if A(X, d)is a sub-dendrogram of A(XuO, d). Otherwise, X is affected by O. A(X,d) A(O,d) A(XuO,d)
Outgroup Independence Algorithm A is outgroup-independent if for any data sets (X, d) and (O, d’),if (X,d) and (O,d’)are sufficiently far apart then X is unaffected by O. Outgroup Ingroup
Outgroup Independence Algorithm A is outgroup-independent if for any data sets (X, d) and (O, d’),if (X,d) and (O,d’) are sufficiently far apart then X is unaffected by O. A(XuO,d*) d* puts (X,d) and (O,d’)sufficiently far apart A(X,d) A(O,d’)
Outgroup Volatility An algorithm A is outgroup volatile if for any data set (X,d) and any constant c, there exist (O,d’) with distance between X and O at least c, such that Xis affected by O. If O is a singleton, then A is outlier volatile.
Outline • Rooting Phylogenetic Trees • Formal Framework • Properties of Hierarchical Algorithms • Analysis of Linkage-Based Algorithms • Analysis of Neighbor Joining • Conclusions and Future Direction
We use the following general result to show that Linkage-Based algorithms are outgroup-independent. • Theorem: Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent.
Locality D = A(X,d) D’ = A(X’,d) X’={x1, …, x4} If we select a cluster from the dendrogram, and run the algorithm the data underlying this cluster, we obtain a result that is consistent with the original dendrogram.
Outer Consistency A(X,d) C C on dataset (X,d’) C on dataset (X,d) Outer-consistent change If A is outer-consistent, then A(X,d’) will also include the clustering C.
2-Richness Given any pair of data sets (X, d) and (X’, d’), there exists d*over XuX’,so that X and X’ are the children of the root in A(XuX’, d*). (X,d) (X, d’) A(X uO,d*) (X, d*) X X’
Theorem: Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. Proof: We want to show that given any if the data sets are placed sufficiently far apart, thenA(X,d)is a sub-dendrogram of A(XuO, d*). (X,d) (O, d’) (X uO,d’’) A(X uO,d*) A(X,d)
Theorem: Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. Proof: First, apply 2-richness. Given there exists d’’ over X uO, so that X and O are children of A(X uO,d’’). (X,d) (O, d’) (X uO,d’’) c A(X uO,d’’) O X
Theorem: Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. (X uO,d’’) c Proof: Let d* be any distance function extending d and d’ where the min distance between X and O is at least c. Then by outer-consistency, X and O are children of the root of A(X uO,d*). (X uO,d*) A(X uO,d*) O X
Theorem: Any hierarchical algorithm A that is 2-rich, outer-consistent, and local, is outgroup independent. A(X uO,d*) Proof: Finally, by locality, A(X,d)is a sub-dendrogram of A(X uO,d*). Therefore, whenever (X,d)and (O,d’) are sufficiently far apart, X is unaffected by O. O X A(X uO,d*) A(X,d)
Linkage Based Algorithm • Create a leaf node for every element of X Insert image
Linkage Based Algorithm • Create a leaf node for every element of X • Repeat the following until a single tree remains: • Consider clusters represented by the remaining root nodes.
Linkage Based Algorithm ? • Create a leaf node for every elements of X • Repeat the following until a single tree remains: • Consider clusters represented by the remaining root nodes. Merge the closest pair of clusters by assigning them a common parent node.
Examples of Linkage Based Algorithms • The choice of linkage function distinguishes between different linkage-based algorithms. • Examples of common linkage-functions • UPGMA: average between-cluster distance • Single-linkage: shortest between-cluster distance • Complete-linkage: maximum between-cluster distance X1 X2
Theorem: • All Linkage-Based algorithms are outgroup independent. Proof: We can show that all linkage-based algorithms are 2-outer-rich, outer-consistent, and local. Result follows by previous Theorem.
Outline • Rooting Phylogenetic Trees • Formal Framework • Properties of Hierarchical Algorithms • Analysis of Linkage-Based Algorithms • Analysis of Neighbor Joining • Conclusions and Future Direction
Neighbour Joining • Most widely-used distance-based method for phylogenetic reconstruction • Works well in practice • If there is a tree that fits the distance matrix (additive), it will find it
Theorem: • Neighbor joining is outlier volatile. This remains the case when distances of the ingroup are additive.
Outgroups can lead to arbitrarydendrograms • Theorem: • Given any data set (X,d),there exists a set of outliers O and a distance function d∗over X∪ O extending d, where d∗(X,O)can be arbitrarily large, such that • NJ(X∪ O, d∗)|X is an arbitrary dendrogram. A(X,d) A(X uO,d*)|X
Outline • Rooting Phylogenetic Trees • Formal Framework • Properties of Hierarchical Algorithms • Analysis of Linkage-Based Algorithms • Analysis of Neighbor Joining • Conclusions and Future Direction
Conclusions • Present a formal framework for the analysis of the effects of outgroups on the ingroup topology for computationally efficiently hierarchical algorithms • Prove that all Linkage-Based algorithms, which include UPGMA, are outgroup independent • Prove that NJ is outgroup volatile • This only addresses rooting - We do not claim that UPGMA is in general better than NJ.
Future Work • How to choose outgroups for rooting NJ? • Perform a similar analysis of Likelihood methods