1.01k likes | 1.03k Views
This paper discusses the development and performance of distance-based reconstruction algorithms for phylogenetic reconstruction in stochastic substitution models. The goal is to reconstruct the "true" tree as accurately as possible by minimizing the effect of noise introduced by sampling. The paper explores the Kimura 2-Parameter (K2P) model, substitution rate functions, and methods for optimizing distances in the K2P model. Simulation results and performance evaluations are presented.
E N D
Towards optimal distance functionsfor stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel
Evolution is modeled by a Tree ACGGTCA (All our sequences are DNA sequences, consisting of {A,G,C,T}) AAAGTCA ACGGATA ACGGGTA AAAGGCG AAACACA AAAGCTG GGGGATT TCTGGTA ACCCGTG GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG
Phylogenetic Reconstruction GGGGATT GAACGTA AATCCTG AATGGGC AAACCGA TCTGGGA ATAGCTG ACCGTTG TCCGGAA AGCCGTG
A I J B (root) reconstruct F C D F D G B G A H E H I J E C Phylogenetic Reconstruction A :AATGGGC B :AATCCTG C :ATAGCTG D :GAACGTA E :AAACCGA F :GGGGATT G :TCTGGGA H :TCCGGAA I :AGCCGTG J :ACCGTTG Goal: reconstruct the ‘true’ tree as accurately as possible
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results
edge-weighted ‘true’ tree reconstructed tree D D E E 2 C C 2 5 3 0.3 F 0.4 F 4 6 6 5 B A B G A G reconstruction Distance Based Phylogenetic Reconstruction:Exact vs. Noisy distances Challange: minimize the effect of noise Introduced by the sampling Distance estimation using finite Sampling Exact (additive) distances Between species Estimated distances
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of known distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results
The Kimura 2 Parameter (K2P) model [Kimura80]:each edge corresponds to a “Rate Matrix” Transitions K2P generic rate matrix u Transversions Transitions v
K2P standard distance:Δtotal =Total substitution rate The total substitution rate of a K2P rate matrix R is This is the expected number of mutations per site. It is an additivedistance. + α + 2β α’ + 2β’ u v w (α+α’) + 2(β+ β’)
Estimation of Δtotal(Ruv) = dK2P(u,v) is a noisy stochastic process K2P total rate “distance correction” procedure
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results
wsep Check performance of K2P “standard” distances in resolving quartet-splits There are 3 possible quartet topologies: A C A B A C B D C D B D • Distance methods reconstruct the true split by 4-point condition: The 4-point condition for noisy distances is:
We evaluate the accuracy of theK2P distance estimation by Split Resolution Test: root t is “evolutionary time” The diameter of the quartet is 22t D A C B
Phase A: simulate evolution D A C B
ç ÷ ç ÷ Apply the 4p condition. Was the correct split found? ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ ç ÷ è ø D C A B Phase B: reconstruct the split by the 4p condition estimate distances between sequences, Repeat this process 10,000 times, count number of failures
the split resolution test was applied on the model quartet with various diameters … … • For each diameter, mark the fraction (percentage) of the simulations in which the 4p condition failed (next slide)
Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2 Template quartet
“site saturation” Performance for larger diameters
When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions: Transitions α Transversions This is actually the CFN model [Cavendar78, Farris73, Neymann71] β α Transitions
Apply the same split resolution test on the transversions only distance: Transversions only Distance correction procedure
transversions only performs better on large, worse on small rates Transversions only total K2P rate
æ ö ç ÷ 1 5 2 4 6 ç ÷ 10 1 ç ÷ 2 7 ç ÷ = ç ÷ ç ÷ ç ÷ Find a distance function d which is good for the input ç ÷ è ø Conclusion: Distance based reconstruction methods should be adaptive: We do a small step in this direction:Input: An alignment of the sequences at u, v.Output: a (near)-optimal distance function, which minimizes the expected noise in the estimation procedure. .
Example: An adaptive distance method (max-optimal) based on this talk:
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and Substitution Rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results
Steps in finding optimal distance functions: • Define substitution model. • Characterize the available distance functions. • Select a function which is optimal for the input sequences. least sensitive to stochastic noise
From Rate matrices to Substitution matrices Rate matrices imply stochastic substitution matrices: u Evolution of a finite sequence by unknown model parameters α, β A stochastic substitution matrix Puv v
Also required P>0, 0<det(P)<1 for allP∈M u v w A substitution model M: A set of stochastic substitution matrices, closed under matrix product: P,Q∈M⇒ PQ ∈M Motivation to the definition:
Model tree over M =<Tree Topology> + <DNA distribution at the root> + <M-substitution matrices at the edges> Uniform distribution r Prv P.. P.. v P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P.. P..
u v w Distances for a given model are defined bySubstitution Rate functions: • Δ:M ℝ is an SR function for M iff for all P,Q inM: • Δ(PQ) = Δ(P)+ Δ(Q)(additivity) • Δ(P)>0 (positivity)
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results
1st question:Given a model M, what are its SR functions? X additive SR functions are additive functions which are strictly positive
Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, Muniv : Muniv= {P: P is a stochastic 4╳4 matrix, 0<det(P)<1}.
Both “logdet” and the “log eigenvalue” functions are special cases of a general technique:Generalized logdetwhich is given below:
Linearity of additive functions: • If Δ1 and Δ2 are additive functions for M, so is c1Δ1 + c2Δ2 The set of additive functions for M forms a vector space, to be denoted ADM. Dimension(ADM) is the dimension of this vector space. Large dimension implies more “independent” distance functions If dimension(ADM) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(ADM) > 1.
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models: Models which the adaptive approach is potentially useful. • Optimizing Distances in the K2P model • Simulation results
Unified Substitution Models: Def: A model M is unified if there is a matrix U s.t. for each P∈M it holds that: U-1 PU = Using Lemma GLD, we have:
Strongly Unified Substitution Models Def: A model M is stronglyunified if there is a matrix U s.t. for each P∈M it holds that: U-1 PU =
A simple strongly unified model: The Jukes Cantor model [1969] :0< p <0.25 MJC= MJCis strongly unified by U= For all P∈ MJC , U-1 PU = Claim dimension(ADMJC)=1 Hence the adaptive approach is irrelevant to this model.
Another model M for which dimension(ADM)=1 Recall: Muniv consists of all DNA transition matrices. Claim 2:dimension(ADMuniv) = 1 This meansthat all the additive functions of Munivare proportional to logdet. Hence the adaptive approach is irrelevant also to this model. Luckily, the additive functions of “intermediate” unified models have dimensions > 1, hence the adaptive approach is useful for them. Next we return to the Kimura 2 parameter model.
Back to K2P: For every K2P Substitution Matrix P: U of the JC model U-1 PU = P = Where: λP = 1 - 4Pβ= e-4β μP = 1 - 2Pβ- 2Pα= e-2α-2β 0 < λP <1 0 < μP < 1 Conclusion: dimension(ADMK2P)=2.
u v The functions: Δλ(P)= -ln(λP) , Δμ (P)=-ln(μP) Form a basis of ADK2P The standard “total rate” distance is: ΔK2P(P)=-(ln(λP)+2ln(μP))/4=-Δlogdet(P)/4. The “transversion only” distance is: Δtr(P)=-ln(λP )/4.
Road Map • Distance based reconstruction algorithms • The Kimura 2 Parameter (K2P) Model • Performance of distance methods in the K2P model • Substitution models and substitution rate functions • Properties of SR functions • Unified Substitutions Models • Optimizing Distances in the K2P model • Simulation results
K2P distance estimation: where the noise comes from inherent noise impliednoise propagation “user controlled” noise propagation
Selection of c1, c2 Estimated distance u True distance Expected error + = v
Expected Relative Error Expected error = = True distance
A basic property of Normalized Mean Square Error: This means that equivalent SR functions have the same NMSE