300 likes | 397 Views
On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities. Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel. B E G H L M. B E G H L M. D. T. B E G H L M. 4. 2. 1. 5. 7. 3. reconstruct. calculate. B E G H L M. 4. 3.
E N D
On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel
B E G H L M B E G H L M D T B E G H L M 4 2 1 5 7 3 reconstruct calculate B E G H L M 4 3 1 2 B E M L G H Pairwise-Distance Based Reconstruction DT M E L G H B
B E G H L M B E G H L M B E G H L M B E G H L M Optimization Criteria We wish the tree-metric DT to approximate simultaneously the pairwise distances in D. = D should be “close” to DT = Two “closeness” measures studied here: Maximal Difference(l∞) • Maximal Distortion
B E G H L M B E G H L M Maximal Difference (l∞)vs. Maximal Distortion B E G H L M D = DT = B E G H L M Goal: Find optimal T, which minimizes the maximal difference/distortion between D and DT
Previous works on Approximating Dissimilarities by Tree Distances • Negative results: (NP-hardness) • Closest tree-metric (even ultrametric ) to dissimilarity matrix under l1 l2 [Day ‘87] • Closest tree-metric to dissimilarity matrix under l∞ [ABFPT99] • Hard to approximate better than 1.125 • Implicit:Hard to approximate closest MaxDist tree within any constant factor • Positive results: • Closest ultrametric to dissimilarity matrix under l∞ [Krivanek ‘88] • 3-approximation of closest additive metric to a given metric[ABFPT99] • (implicit 6-approximation for general dissimilarity matrices)
This Work: Triplet-Distances – Distances to Triplets Midpoints C(i,j,k) τT (i ; jk) • τT (i ; jk) = τT (i ; kj) • τT (i ; ij) = 0 • τT (i ; jj) = DT (i, j) i k j
…is realizable by a 3-tree j i 5 3 4 C(i,j,k) k Triplet-Distances Defined by 2-Distances • Each distance Matrix D defines 3-trees • τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)]. i Any metric on 3 taxa… 8 9 j 7 k
BB BE BG….. LL LM MM B E G H L M T T 4 2 1 5 7 3 4 3 1 2 B E M L G H Triplet-Distance Based Reconstruction τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)]. BB BE BG….. LL LM MM B E G H L M reconstruct
Why use Triplet-Distances? 1. They enable more accurate estimations of 2-distances. 2. They are used (de facto) by known reconstruction algorithms
B E G H L M B E G H L M E (Maximum Likelihood) 13 (In calculating D(H,E), all other taxa are ignored H Improved Estimations of Pairwise Distances: “Information Loss” D= Calculate D(H,E)
B=(..AAGT..) L=(..AATA..) G=(..CCGT..) (..****..) (..****..) M=(..CGCG..) 2 3 4 2 (..****..) (..****..) H= (..AACG..) H= (..AACG..) E=(..CAGA..) E=(..CAGA..) 1 5 3 3 H= (..AACG..) H= (..AACG..) E=(..CAGA..) E=(..CAGA..) Improved Estimations (cont): • Estimate D(H,E) by calculating all the 3-trees on {H,E,X:XH,E} • (Or: calculate just one 3-tree, for a “trusted” 3rd taxon X : • V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002)
T BB BE BG….. LL LM MM 4 B E G H L M 2 1 5 7 3 B E G H L M 4 3 1 2 B E M L G H D (Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].
i r j 1st use :“Triplet Distances from a Single Source”: • Fix a taxon r, and construct a tree T which minimizes: • Optimal solution is doable in O(n2) time, and is used eg in : • (FKW95): Optimal approximation of distances by ultrametric trees. • (ABFPT99): The best known approximation of distances by general trees • (BB99): Fast construction of Buneman trees.
2nd use:Saitou&Nei Neighbour Joining The neighbors-selection criterion of NJ selects a taxon-pair i,j which maximizes the sum : r r i r r r r j r r
Previous Works on Triplet-Dissimilarities/Distances • I. Gronau, S. MoranNeighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of Computational Biology 14(1) pp. 1-15 (2007). • Works which use the total weights of 3 trees: • S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995) • L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights , Applied Mathematics Letters 17 pp. 615-621 (2004) • D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity estimates, Mol. Biol. Evol. 23(3) 491–498 (2006) .
Summary of Results • Results for Maximal Difference (l∞): • Decision problem is NP-Hard • IS there a tree T s.t. ||τ,τT ||∞ ≤ Δ ? • Hardness-of-approximation of optimization problem • Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞ • A 15-approximation algorithm • Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99] • Result forMaximal Distortion: • Hardness-of-approximation within any constant factor
literals clause Satisfying assignment: NP Hardness of the Decision Problem We use a reduction from 3SAT (the problem of determining whether a 3CNF formula is satisfiable) We show: If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τT ||∞ ≤ Δ,then one can determine for every 3CNF formula φ whether it is satisfiable.
The Reduction Given a 3CNF formula φ we define triplet distances and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ. • The set of taxa: • Taxa T , F. • A taxon for every literal ( ). • 3 taxa for every clause Cj ( y j1, y j2, y j3).
v w Properties Enforced by the Input (,Δ) • One the following can be enforced on each taxa triplet (u,v,w): • taxon u is closeto Path(v,w), or • taxon u is farto Path(v,w) u
T F Enforcing Truth Assignmaent • A truth assignment to φis implied by the following: • TisfarfromF • For each i, isfar from , and both of and areclose toPath(T ,F) Thus we set xi =T iff xi is close to T.
l1 F l2 l3 Enforcing Clauses-Satisfaction A clause C=( l1 l2 l3 )is satisfied iff At least one literal liis true, i.e. is close toT. (l1 l2 l3 ) is satisfiediff it is not like this We need to guarantee that all clauses avoid the above by the close/far relations.
But we don’t know which two paths Clauses-Satisfaction (cont) -(l1 l2 l3 )is satisfied iff out of the three paths: Path(l1 , l2),Path(l1 , l3),Path(l2 , l3), at least two paths areclose toT . l3 T F l1 l2
y1 y2 y3 l3 T F l1 l2 Clauses-Satisfaction (cont) We attach a taxon to each such path: y1is close toPath( l2,l3) y2is close toPath( l1,l3) y3is close toPath( l1,l2) (l1 l2 l3 )is satisfied iff at least twoyi’s can be locatedclose toT.…
y1 y2 y3 l3 T F l1 l2 Clauses-Satisfaction (end) … and, at least two of theyi’scan be located close toT Path( y2,y3), Path( y1,y3), Path( y1,y2), are close to T So, (l1 l2 l3 )is satisfied iff all the above paths are close toT
y22 y13 y12 y21 y11 y23 α α T 2β F α α vT vF α α Construction Example φ is satisfiable there is a tree T which satisfies all bounds A1τT (T , F ) ≥ 2α+2β A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α ; τT (T ; y j1 y j2 ) ≤α
Hardness of Approximation Results By “stretching” the close/far restrictions, the following problems are also shown NP hard: • Approximating Maximal Difference • Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞ • ApproximatingMaximal Distortion: • Finding a tree T s.t. • MaxDist(τ,τT )≤ CMaxDist(τ,τOPT) for any constantC Details in: I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55.
Open Problems/Further Research • Extending hardness results for 3-diss tables induced by 2-diss matrices • (τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] ) • Extending hardness results for “naturally looking” trees • (binary trees with constant-bounded edge weights) • Check Performance of NJ when neighbor selection formula computed from “real” 3-distances. • Devise algorithms which use 3-distances as input. • Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution) • (it is known that optimization of 2-diss doesn’t lead to good topological accuracy)
1 5 2 4 6 10 1 2 7 • Compute distances between all taxon-pairs • Find a tree(edge-weighted) best-describing the distances Distance-Based Phylogenetic Reconstruction
y22 y13 y12 y21 y11 y23 α α 2β α α T F vT vF α α The Reduction – τ(φ) A1τT (T , F ) ≥ 2α+2β A2i=1..n :τT (T ; ) ≤α ; τT (F ; ) ≤α B1j=1..m :τT (y j1; l j2 l j3 ) ≤α ; τT (y j2; l j1 l j3 ) ≤α ; τT (y j3; l j1 l j2 ) ≤α B2j=1..m :τT (y j1; T F ) ≥α ; τT (y j2; T F ) ≥α ; τT (y j3; T F ) ≥α B3j=1..m :τT (T ; y j2 y j3 ) ≤α ; τT (T ; y j1 y j3 ) ≤α ; τT (T ; y j1 y j2 ) ≤α • In our constructed tree: • All 2-distances are in[2α , 2α+2β]. • All 3-distances are in[α , α+2β]. • Δ=β. A1τ(T , F ) = 2α+3β A2i=1..n :τ(T ; ) = α-β ; τ(F ; ) = α-β B1j=1..m :τ(y j1; l j2 l j3 ) = α-β ; τ(y j2; l j1 l j3 ) = α-β ; τ(y j3; l j1 l j2 ) = α-β B2j=1..m :τ(y j1; T F ) = α+β ; τ(y j2; T F ) = α+β ; τ(y j3; T F ) = α+β B3j=1..m :τ(T ; y j2 y j3 ) = α-β ; τ(T ; y j1 y j3 ) = α-β ; τ(T ; y j1 y j2 ) = α-β Other2-distances: τ(s , t) = 2α+2β Other3-distances: τ(s ; t u) = α+2β