500 likes | 639 Views
A phylogenetic application of the combinatorial graph Laplacian. Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State University. My motivation for this project. Trees in statistics or biology
E N D
A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State University
My motivation for this project • Trees in statistics or biology • Often a latent branching structure relating some observed data • Trees in mathematics • Always a connected graph with no cycles
My motivation for this project • Trees in statistics or biology • PROBLEM: Recover properties of latent branching structure • Trees in mathematics • Always a connected graph with no cycles
My motivation for this project • Trees in statistics or biology • PROBLEM: Recover properties of latent branching structure • Trees in mathematics • Characterization of observed structure by spectral graph theory
My motivation for this project • Trees in statistics or biology • PROBLEM: Recover properties of latent branching structure • Trees in mathematics • Characterization of observed structure by spectral graph theory
Bridging the gap • Rectifying trees and trees • Can we use some powerful tools of spectral graph theory to recover latent structure? • Natural relationship between trees and complete graphs?!?
Tree and distance matrices • The tree with vertex set {1,…,8} has distance matrix D • The “phylogenetic tree” can only be observed at {1,…,5} • We can only observe (estimate) the phylogenetic portion D* The phylogenetic portion D*
More motivation for this project • Trees in statistics or biology • PROBLEM: Recover properties of latent branching structure • Given D* only, recover latent branching structure • This is the problem of phylogenetic reconstruction (w/o error!) The phylogenetic portion D*
NJ finds (2,n-2) splits from D* • A split is a bipartition of the leaf set (e.g. {1,2,3,4,5}) that can be induced by cutting a branch on the tree • e.g. {{1,2},{3,4,5}} or {{1,2,5},{3,4}} • Neighbor-joining criterion identifies (2,n-2) splits through {{1,2},{3,4,5}} {{1,2,5},{3,4}}
A recipe for tree reconstruction from D* • Find a split • NJ relies on theorem that guarantees (2,n-2) split from Q matrix • Use knowledge of split to reduce dimension • NJ prunes the cherry (neighboring taxa) to reduce leaves by one • Iterate until tree has been fully reconstructed • Tree topology specified by its split set
Our narrow goal • Find a split • NJ relies on theorem that guarantees (2,n-2) split from Q matrix • Hypothesize criterion that identifies deeper splits … and prove that it actually works
Our solution The phylogenetic portion D*
Our solution The phylogenetic portion D* • Let H be the centering matrix: • Find eigenvector Y of HD*H with the smallest eigenvalue • The signs of the entries of Y identify a split of the tree
About the matrix HD*H • Entries of HD*H are Dij – Di. – D.j + D.. • HD*H is negative semidefinite • Zero is a simple eigenvalue with unit eigenvector • Entries of remaining eigenvalues have both + and - entries • HD*H appears prominently in: • Multidimensional scaling • Principal coordinate analysis
Example of our solution • Find eigenvector Y of HD*H with the smallest eigenvalue: • Signs of Y identify the split {{1,2},{3,4,5}} -0.0564 +0.5793 -0.5011 -0.4636 +0.4418
A real example (data from ToL) • Two iterations
Our solution • Find a split • NJ relies on theorem that guarantees (2,n-2) split from Q matrix • Hypothesize criterion that identifies deep splits … and prove that it actually works
Affinity and distance • In phylogenetics, common to consider pairwise distances • In graph theory, common to consider pairwise affinities Affinity-based Distance-based
The genius of Miroslav Fiedler • G connected smallest eigenvalue of L, zero, is simple • Smallest positive eigenvalue, , called algebraic connectivity of G • Fiedler vectors Y satisfy LY=Y • Fiedler cut is the sign-induced bipartition -0.4277 -0.0223 +0.4840 -0.0158 -0.3653 +0.3449 +0.4038 -0.4047
The genius of Miroslav Fiedler • G connected smallest eigenvalue of L, zero, is simple • Smallest positive eigenvalue, , called algebraic connectivity of G • Fiedler vectors Y satisfy LY=Y • Fiedler cut is the sign-induced bipartition • Fiedler cut here is • {{1,2,6},{3,4,5,7,8}} • Note that the cut implies a leaf split: • {{1,2},{3,4,5}} -0.4277 -0.0223 +0.4840 -0.0158 -0.3653 +0.3449 +0.4038 -0.4047
Is this relevant here? • We do not observe an 8x8 Laplacian matrix L • All we get is a 5x5 matrix of between-leaf pairwise distances D* • Where is the connection to graph theory? The phylogenetic portion D*
Recall: Our solution • Let H be the centering matrix: • Find eigenvector Y of HD*H with the smallest eigenvalue • The signs of the entries of Y identify a split of the tree The phylogenetic portion D*
An extremely useful relationship • Recall the centering matrix H • The (Moore-Penrose) pseudoinverse of HDH is in fact -2L • We have shown in the context of this formula • Principal submatrices of D relate to Schur complements of L • In particular, (HD*H)+ = -2L* = -2(L/Z) = -2(W – XZTY), where W X Y Z
Recall: Our solution • Find eigenvector Y of HD*H with the smallest eigenvalue • The signs of the entries of Y identify a split of the tree • The smallest eigenvalue of HD*H (negative semidefinite) is the smallest positive eigenvalue of L* • In fact, L* can be seen as a graph Laplacian • And our solution, Y, is the Fiedler vector of that graph! • But what does this graph look like?
Schur complementation of a vertex • The vertices adjacent to 8 become adjacent to each other
Schur complementation of the interior • The graph described by L* is fully connected • All cuts yield connected subgraphs No help from Fiedler
Recap thus far • Given matrix D* of pairwise distances between leaves • Find eigenvector Y of HD*H with the smallest eigenvalue • Claim: The signs of the entries of Y identify a split of the tree • Y shown to be a Fiedler vector of the Laplacian L* • But graph of L* is fully connected, has no apparent structure • Thus Fiedler says nothing about signs of entries of Y • But claim requires signs to be consistent with structure of the tree
Recap thus far • Thus Fiedler says nothing about signs of entries of Y • But claim requires signs to be consistent with structure of the tree • How does L* inherit the structure of the tree? NO NO YES
The quotient rule inspires a “Schur tower” • How does this help?
Cutpoints and connected components • A point of articulation (or cutpoint) is a point rG whose deletion yields a subgraph with 2 connected components • Cutpoints: 6,7,8 • Shown: {1}, {2}, {3,4,5,7,8} are connectedcomponents at 6 • The cutpoints of a tree are its internal nodes
The key observation (i.e. theorem) • Let L be the Laplacian of a graph G with some cutpoint v • Let L{v} be the Laplacian of G{v} obtained by Schur complement at v • Then the Fiedler cut G{v} identifies a split of G • Here the Fiedler cut of G{6} is {{1,2,5,8},{3,4,7}} • Including 6 in {1,2,5,8} defines two connected components in G +0.0570 + - -0.4129 + +0.5828 +0.0380 + - ? -0.3439 G G{6} +0.4660 + -0.3870 -
The quotient rule inspires a “Schur tower” L L* • How does this help? • Look at Schur paths to graph with Laplacian L*
The punch line • The graph with Laplacian L* can be obtained in three ways • The Fiedler cut of G{6,7,8} must split G{6,7}and G{6,8}and G{7,8}
The punch line • The graph with Laplacian L* can be obtained in three ways • The Fiedler cut of G{6,7,8} must split G{6,7}and G{6,8}and G{7,8}
Recall: Example • Find eigenvector Y of HD*H with the smallest eigenvalue: • Signs of Y identify the split {{1,2},{3,4,5}} -0.0564 +0.5793 -0.5011 -0.4636 +0.4418
The punch line • The graph with Laplacian L* can be obtained in three ways • The Fiedler cut of G{6,7,8} must split G{6,7}and G{6,8}and G{7,8} • This implies that the cut splits the progenitor graph G! {{1,2,6},{3,4,5,7,8}}
Our solution actually works • Let H be the centering matrix: • Find eigenvector Y of HD*H with the smallest eigenvalue • The signs of the entries of Y identify a split of the tree The phylogenetic portion D*
A recipe for tree reconstruction • Find a split • NJ relies on theorem that guarantees (2,n-2) split from Q matrix • We have a theorem that guarantees splits from HD*H matrix • Use knowledge of split to reduce dimension • NJ prunes the cherry (neighboring taxa) to reduce leaves by one • We use a divisive method that reduces to pairs of subtrees • Iterate until tree has been fully reconstructed • Tree topology specified by its split set
Connections with Classical MDS and PCoA • Classical solution to multidimensional scaling • a.k.a. Principal coordinate analysis • Recipe for dimension reduction given distance matrix D: • Construct matrix A from Dentrywise: x -x2/2 • Double centering: B = HAH • Find k largest eigenvalues i of B with corresponding eigenvectors Xi • Coordinates of point Pr given by row r of eigenvector entries k = 1 with sqrt of tree distance equivalent to our approach
Phylogenetic ordination • PCoA on sequence data with k = 3: • For appropriate distance, C1 (x-axis) guaranteed to split taxa at 0 • Our results support popular use of PCoA • Provided that the right distance is considered…
Conclusion I • Natural connection between matrix of pairwise distances and the Laplacian of a complete graph
Conclusion II • Structure of tree embedded in complete graph and recoverable via spectral theory • Notion of “Fiedler cut” extends concept to “Fiedler split” • Inheritance propagated through Schur tower NO NO YES
Conclusion III • Results inspire fast divisive tree reconstruction method
Conclusion IV • Provides guidance and justification for ordination approach
Acknowledgements • Alex Griffing (NCSU Bioinformatics) • Carl Meyer (NCSU Math) • Amy Langville (CoC Math)