370 likes | 550 Views
Phylogenetics. What is a tree & how many are there? Principles of phylogenetic receconstruction. Special Issues Rooting a tree The Molecular Clock Almost Clocks. Trees – graphical & biological.
E N D
Phylogenetics What is a tree & how many are there? Principles of phylogenetic receconstruction. Special Issues Rooting a tree The Molecular Clock Almost Clocks.
Trees – graphical & biological. A graph is a set vertices (nodes) {v1,..,vk} and a set of edges{e1=(vi1,vj1),..,en=(vin,vjn)}. Edges can be directed, then (vi,vj) is viewed as different (opposite direction) from (vj,vi) - or undirected. v2 v1 (v1v2) (v2, v4) or (v4, v2) v4 v3 Nodes can be labelled or unlabelled. In phylogenies the leaves are labelled and the rest unlabelled. The degree of a node is the number of edges it is a part of. A leaf has degree 1. A graph is connected, if any two nodes has a path connecting them. A tree is a connected graph without any cycles, i.e. only one path between any two nodes.
Trees & phylogenies. A tree with k nodes has k-1 edges. (easy to show by induction). A root is a special node with degree 2 that is interpreted as the point furthes back in time. The leaves are interpreted as being contemporary. A root introduces a time direction in a tree. A rooted tree is said to be bifurcating, if all non-leafs/roots has degree 3, corresponding to 1 ancestor and 2 children. For unrooted tree it is said to have valency 3. Edges can be labelled with a positive real number interpreted as time duration or amount or evolution. If the length of the path from the root to any leaf is the same, it obeys a molecular clock. Tree Topology: Discrete structure – phylogeny without branch lengths. Root Leaf Internal Node Internal Node Leaf
1 2 3 1 2 1 3 1 1 1 1 1 1 2 2 2 2 2 2 4 3 4 2 3 4 4 3 3 3 3 4 4 3 4 4 5 5 5 5 5 Enumerating Trees: Unrooted & valency 3 Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1
Local operations on trees. Nearest Neighbor Interchange: A C A C D D B B Subtree cut and regrafting – (subtree root kept) Subtree cut and regrafting – (subtree root possibly new)
Central Principles of Phylogeny Reconstruction s1 s1 s1 s3 s3 s3 s2 s2 s2 s4 s4 s4 TTCAGT TCCAGT GCCAAT GCCAAT 1 0 2 Parsimony Distance Likelihood Total Weight: 4 0 1 0.6 1 1 2 3 2 1 0.7 1.5 0.4 0.3 L=3.1*10-7 Parameter estimates
Distance Concepts on Trees I A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b) a b c
s1 s3 s2 s4 s3 s1 s2 i Distance Concepts on Trees II Tree Metric: (distance function originates from tree) d(x,y) + d(z,w) = d(x,z) + d(y,w) > d(x,w) + d(y,z), where z,y,z,w is a permutation of a,b,c,d. (> implies that no branch has length 0) Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2
i s1 s2 s3 Distance Concepts on Trees III Ultra Metric (distance function originates from tree) d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c. (> implies that no branch has length 0) Reconstruction Principle: d(s1,i) = d(s1,s2)/2
UPGMASokal and Michener, 1958 Unweighted Pair-Group method with Arithmetic Mean Input: Matrix with pariwise distances between sequences, D: 1: Find smallest distance, di,j 2: i,j are now siblings with a distance, di,j/2, to their MRCA (i,j). 3: A new distancematrix of dimension (n-1)*(n-1) where i and j have been substituted by (i,j). All distances to (i,j) are dk,(i,j) = (dk,i + dj,k)/2. 4: This is done n-1 times and the tree has been reconstructed. Output: An ultrametric. Comment: i. If UPGMA is given an ultrametric, it will reconstruct the same ultrametric.
Assignment to internal nodes: The simple way. A G T C ? ? ? ? ? ? C C C A What is the cheapest assignment of nucleotides to internal nodes, given some (symmetric) distance function d(N1,N2)?? If there are k leaves, there are k-2 internal nodes and 4k-2 possible assignments of nucleotides. For k=22, this is more than 1012.
Cost of a history - minimizing over internal states A C G T d(C,G) +wC(left subtree) A CGT A CGT
Cost of a history – leaves (initialisation). A C G T Initialisation: leaves Cost(N)= 0 if N is at leaf, otherwise infinity G A Empty Cost 0 Empty Cost 0
Fitch-Hartigan-Sankoff Algorithm (A,C,G,T) (9,7,7,7) Costs: Transition 2, / \ Transversion 5. / \ / \ (A, C, G, T) \ (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 * The cost of cheapest tree hanging from this node given there is a “C” at this node C A T G
5S RNA Alignment & Phylogeny Hein, 1990 3 5 4 Mitochondria Plants 6 13 11 9 7 Prokaryotes 15 17 14 10 12 Fungi 16 Transitions 2, transversions 5 Total weight 843. Animals 8 2 1 10 tatt-ctggtgtcccaggcgtagaggaaccacaccgatccatctcgaacttggtggtgaaactctgccgcggt--aaccaatact-cg-gg-gggggccct-gcggaaaaatagctcgatgccagga--ta 17 t--t-ctggtgtcccaggcgtagaggaaccacaccaatccatcccgaacttggtggtgaaactctgctgcggt--ga-cgatact-tg-gg-gggagcccg-atggaaaaatagctcgatgccagga--t- 9 t--t-ctggtgtctcaggcgtggaggaaccacaccaatccatcccgaacttggtggtgaaactctattgcggt--ga-cgatactgta-gg-ggaagcccg-atggaaaaatagctcgacgccagga--t- 14 t----ctggtggccatggcgtagaggaaacaccccatcccataccgaactcggcagttaagctctgctgcgcc--ga-tggtact-tg-gg-gggagcccg-ctgggaaaataggacgctgccag-a--t- 3 t----ctggtgatgatggcggaggggacacacccgttcccataccgaacacggccgttaagccctccagcgcc--aa-tggtact-tgctc-cgcagggag-ccgggagagtaggacgtcgccag-g--c- 11 t----ctggtggcgatggcgaagaggacacacccgttcccataccgaacacggcagttaagctctccagcgcc--ga-tggtact-tg-gg-ggcagtccg-ctgggagagtaggacgctgccag-g--c- 4 t----ctggtggcgatagcgagaaggtcacacccgttcccataccgaacacggaagttaagcttctcagcgcc--ga-tggtagt-ta-gg-ggctgtccc-ctgtgagagtaggacgctgccag-g--c- 15 g----cctgcggccatagcaccgtgaaagcaccccatcccat-ccgaactcggcagttaagcacggttgcgcccaga-tagtact-tg-ggtgggagaccgcctgggaaacctggatgctgcaag-c--t- 8 g----cctacggccatcccaccctggtaacgcccgatctcgt-ctgatctcggaagctaagcagggtcgggcctggt-tagtact-tg-gatgggagacctcctgggaataccgggtgctgtagg-ct-t- 12 g----cctacggccataccaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgagcccagt-tagtact-tg-gatgggagaccgcctgggaatcctgggtgctgtagg-c--t- 7 g----cttacgaccatatcacgttgaatgcacgccatcccgt-ccgatctggcaagttaagcaacgttgagtccagt-tagtact-tg-gatcggagacggcctgggaatcctggatgttgtaag-c--t- 16 g----cctacggccatagcaccctgaaagcaccccatcccgt-ccgatctgggaagttaagcagggttgcgcccagt-tagtact-tg-ggtgggagaccgcctgggaatcctgggtgctgtagg-c--t- 1 a----tccacggccataggactctgaaagcactgcatcccgt-ccgatctgcaaagttaaccagagtaccgcccagt-tagtacc-ac-ggtgggggaccacgcgggaatcctgggtgctgt-gg-t--t- 18 a----tccacggccataggactctgaaagcaccgcatcccgt-ccgatctgcgaagttaaacagagtaccgcccagt-tagtacc-ac-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 2 a----tccacggccataggactgtgaaagcaccgcatcccgt-ctgatctgcgcagttaaacacagtgccgcctagt-tagtacc-at-ggtgggggaccacatgggaatcctgggtgctgt-gg-t--t- 5 g---tggtgcggtcataccagcgctaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagaa-cagtact-gg-gatgggtgacctcccgggaagtcctggtgccgcacc-c--c- 13 g----ggtgcggtcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggccagcc-tagtact-ag-gatgggtgacctcctgggaagtcctgatgctgcacc-c--t- 6 g----ggtgcgatcataccagcgttaatgcaccggatcccat-cagaactccgcagttaagcgcgcttgggttggag-tagtact-ag-gatgggtgacctcctgggaagtcctaatattgcacc-c-tt-
The Felsenstein Zone Felsenstein-Cavendar (1979) s1 s4 s2 s3 True Tree Reconstructed Tree s1 s2 s3 s4 Patterns:(16 only 8 shown) 0 1 0 0 00 0 0 0 0 1 0 01 0 1 0 0 0 1 01 1 0 0 0 0 0 10 1 1
Bootstrapping Felsenstein (1985) ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 10230101201 ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT ATCTGTAGTCT 2 3 4 1
Probability of a pattern - summing over internal states A C G ? ? ? ? A A T A C G T A C G T A C G T
Probability of leaf observations - summing over internal states A C G T P(CG) *PC(left subtree) A CGT A CGT
Output from Likelihood Method With Clock:Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2 Likelihood: 7.9*10-14 = 0.31.1,0.18.1 6.2*10-12 = 0.34.1 0.16.1 ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.
The Molecular Clock First noted by Zuckerkandl & Pauling (1964) as an empirical fact. How can one detect it? Known Ancestor Time Unknown AncestorTime /\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \ s1 s2 s1 s2 s3
Rooting the 3 kingdoms 3 billion years ago: no reliable clock no outgroup Given 2 set of homologous proteins, i.e. MDH & LDH can the archea, prokaria and eukaria be rooted? LDHMDH A A \ \ \ \ --------E --------E / / / / P P LDH MDH / \ / \ / \ /\ /\ / \ / \ / /\ / /\ P A E P A E
Rootings Purpose 1) To give time direction in the phylogeny & most ancient point 2) To be able to define concepts such a monophyletic group. Metoder: 1) Outgrup: Enhance data set with sequence from a species definitely distant to all of them. It will be be joined at the root of the original data set. 2) Midpoint: Find midpoint of longest path in tree. 3) Assume Molecular Clock.
The generation/year-time clock (Illustration of Langley-Fitch) s1 /\ \ / \ clock: l1 \ / \ ----*--- s3 /\ \ {l1 = l2 < l3} l2 / l3 / \ \ / / \ \ s2 s1 s2 s3 Given root: (2k-3)-(k-1) = (k-2) degrees of freedoms lost in imposing a clock. Assumptions 1. Ancestral Sequences are observable. 2. The number of events on branch is Poisson distributed with a mean proportional to the branch length. The same proportionality constant for all branches. 3. The observed differences between sequences at two neighboring nodes is the actual number of events. s1' s1 \ \ \ l1 \ c*l1 \ ------- s3 ------------ s3' l2 / l3 c*l2 / c*l3 / / s2 / s2' sequences 1 sequences 2 k sequences s species : s(2k-3)s s(k-1) (2k-3)+s s+(k-1)
Almost Clocks (MJ Sanderson (1997) “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy” Mol.Biol.Evol.14.12.1218-31), J.L.Thorne et al. (1998): “Estimating the Rate of Evolution of the Rate of Evolution.” Mol.Biol.Evol. 15(12).1647-57, JP Huelsenbeck et al. (2000) “A compound Poisson Process for Relaxing the Molecular Clock” Genetics 154.1879-92. ) I Smoothing a non-clock tree onto a clock tree (Sanderson). II Rate of Evolution of the rate of Evolution (Thorne et al.). The rate of evolution can change at each bifurcation. III Relaxed Molecular Clock (Huelsenbeck et al.). At random points in time, the rate changes by multiplying with random variable (gamma distributed)
Non-contemporaneous leaves. (A.Rambaut (2000): Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics 16.4.395-399)
Recombination and the Molecular Clock I In presence of recombination and Gene Conversion, the relationship among sequence might not be describable by a phylogeny!! Common Practice: I Finding “the phylogeny” anyway. II testing for the molecular clock.
Recombination and the Molecular Clock II Schierup & Hein (2000): Recombination and the Molecular Clock. Mol.Biol.Evol.17.10.1578-79 + Schierup & Hein (2000): Consequences of Recombination on Traditional Phylogenetic Analysis. Genetics 156.879-91. What is the consequences of this practice? I Simulate data with model including recombination. II Reconstruct phylogeny. III Test for Clock.
History of Phylogenetic Methods 1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock. 1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza. 1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock. 1967 First large molecular phylogenies by Fitch and Margoliash. 1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences. 1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution. 1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences. 1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.
1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”. 1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). 1981 Parsimony tree problem is shown to be NP-Complete. 1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies. 1986 Bandelt and Dress introduces split decompostion as a generalization of trees. 1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies. 1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock. 2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves. 2001- Major rise in the interest in phylogenetic statistical alignment
Phylogeny: literature, www and packages. Books: Molecular Systematics (1996) (eds. Hillis and Craig) New Uses for Phylogenies (1996) (eds. P.Harvey) W.Maddison and D.Maddison : MacClade Semple & Steel (2003): Phylogenetics OUP Journals: Molecular Biology and Evoltion J. Molecular Evolution Molecular Phylogenetics Systematic Biology. J. of Classification www-pages: PAUP – probably the best package for phylogenetic analysis available. David Swofford http://www.lms.si.edu/PAUP/about.html MacClade – W. & D. Maddison http://phylogeny.arizona.edu/macclade/macclade.html PHYLIP – J. Felsenstein. http://depts.washington.edu/genetics/faculty/felsenstein.html PAML – Z. Yang http://abacus.gene.ucl.ac.uk/
Global Fit Metods 1: Error function: wi,j * (di,j - pi,j)a 2: Minimisation has two parts topology & branchlengths. Try all topologies and solv branch problem for each. 3: A(i,j),k is (n*(n-1)/2)*(2n-3) matrix with 1 if k is an edge on the path from i to j, 0 ellers. 4: The path length i & j, pi,j, In the given topology is given by: pi,j = A(i,j),k*sk. 5: If wi,j =1 og a=2 this can be solved by linear algebra (di,j - A(i,j),k*sk)2
Nearest Neighbor JoiningSaitou and Nei, 1987 Input: Distancematrix D. 1: For each leaf the average distance to the others is calculated ri=(di,1 + di,2 + + dn,i)/(n-1). 2: Rate corrected distance matrix, M, is constructed mi,j = di,j - (ri + rj)/(n-2). Only minimal mi,j is necessary. 3: Make ancestral node, u, to i & j giving minimal mi,j. New branch lengths are defined by si,u = di,j/2 + (ri - rj)/[2*(N-2)] sj,u = di,j - si,u 4: The distance from u to the others are set to dk,u = (di,k + dj,k -di,j)/2 Do this n-2 times Alternativ karakterisation af metoden: Start med bedste kvadratiske fit af et træ med en k indre (k<n) indre knuder, tilføj den indre gren, som giver den største forbedring i det kvadratiske fit (nu k+1 knuder). Dette fortsættes indtil hel træet er bygget (k-1 indre knuder er tilføjet.
Branch and Bound Algorithm Ø = Lavt overslag på vægten af træ - eventuelt vægten på godt gættet træ. W(n) = vægten for træet i knude n. R(n) = højt underslag for vægttilvæksten ved at tilføje resten af sekvenserne. Betingelse for bounding: W(n) + R(n) >= Ø 97 7 102 Hvordan regnes R(n) ud? A T C G A C G G T C G G *
Tree topology comparison. I. Bootstrapping columns in the alignment. Example: Human, Chimp, Gorilla & Orangutan with root. position 1 2 3 4 5 6 7 8 9 12.586 H T C T G A C G T T T G A ... C C T C T G A C G G T T G A ... C G T C T G A C G G T T G A ... C O T C A G A C G G T C G A ... C root T C A G A C G T A A G A ... C 15 possible trees, only 3 of relevance: /\ /\ /\ / \ / \ / \ /\ \ /\ \ /\ \ / \ \ / \ \ / \ \ /\ \ \ /\ \ \ /\ \ \ / \ \ \ / \ \ \ / \ \ \ H C G O H G C O C G H O I. Bootstrap probabilities: 0.80 0.09 0.11 II. Differences in likelihood: 0.0 -16.63 s.d=14.22 -15.12 sd=13.95