Week 36: Trees, Alignment & Database Search

Week 36: Trees, Alignment & Database Search Trees Alignment Database Search

Trees Tree Concepts Reconstruction Methods Parsimony Likelihood Distance The Molecular Clock

Topologies • 4 sequences with root ignored: • s1 s3 s1 s2 s1 s2 • \ / \ / \ / • \ / \ / \ / • -------- -------- -------- • / \ / \ / \ • / \ / \ / \ • s2 s4 s3 s4 s4 s3 • For unrooted trees with k leaves and internal nodes with 3 incoming edges, the following holds: • Edges = 2k-3; Internal nodes = k-2; T(k) = (2k-5)T(k-1); T(3)=1; • k 3 4 5 6 7 20 • ----------------------------------------------- • T(k) 1 3 15 105 945 1023 • ----------------------------------------------- • Edges 3 5 7 9 11 37 • ----------------------------------------------- • Internal Nodes 1 2 3 4 5 18

Central Principles of Phylogeny Reconstruction s1 s1 s1 s3 s3 s3 s2 s2 s2 s4 s4 s4 TTCAGT TCCAGT GCCAAT GCCAAT 1 0 Parsimony Distance Likelihood 2 Total Weight: 4 0 1 0.6 1 1 2 3 2 1 0.7 1.5 0.4 0.3 L=3.1*10-7 Parameter estimates

The Small Parsimony Problem (Fitch-Hartigan-Sankoff) ? / \ / \ ? ? / \ / \ C G T C L’N / /\ \ / / \ \ / / \ \ / LA LC LG LT / \ RA RG .. \ Recursion:L’N = min{LNL + d(N,NL)} + min{LNR + d(N,NR)} Initialisation: Lleaf(N) = 0, if N at leaf - else infinity

Fitch-Hartigan-Sankoff Algorithm (A,C,G,T) (9,7,7,7) / \ / \ Costs: Transition 2, / \ (A ,C,G, T) \ Transversion 5, indel 10. (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *

Distance Concepts on Trees I A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b) a b c

s1 s3 s2 s4 s3 s1 s2 i Distance Concepts on Trees II Tree Metric: (distance function originates from tree) d(x,y) + d(z,æ) = d(x,z) + d(y,æ) > d(x,æ) + d(y,z), where z,y,z,æ is a permutation of a,b,c,d. (> implies that no branch has length 0) Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2

i s1 s2 s3 Distance Concepts on Trees III Ultra Metric (distance function originates from tree) d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c. (> implies that no branch has length 0) Reconstruction Principle: d(s1,i) = d(s1,s2)/2

Evolutionary Substitution Process t1 A t2 C C Pi,j(t) = probability of going from i to j in time t.

Probability of a pattern - summing over internal states A C G ? ? ? ? A A T A C G T A C G T A C G T

Felsenstein's recursion Conditional probability, CL(v,N) is the probability for observing the nucleotides at the leaves at a subtree hanging from v, if nucleotide N is found at v. P(v1,v2,N1,N2) = probability for N1 at v1, N2 at v2. I) Initial contition: CL(leaf,N)= 1{N is at leaf} II) Recursion P(v,vl,N,Nl)*CL(vl,Nl) * CL(v,N) = P(v,vr,N,Nr)*CL(vr,Nr) where r refers to right son, l to left son. III) L(p) = πN*BL(Rod,N,p). IV) Total likelihood:Product over all positions:L = Li(p).

Output from Likelihood Method With Clock:Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2 Likelihood: 7.9*10-14   = 0.31.1,0.18.1 6.2*10-12   = 0.34.1 0.16.1 ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.

The Felsenstein Zone Felsenstein-Cavendar (1979) s1 s4 s2 s3 True Tree Reconstructed Tree s1 s2 s3 s4 Patterns:(16 only 8 shown) 0 1 0 0 00 0 0 0 0 1 0 01 0 1 0 0 0 1 01 1 0 0 0 0 0 10 1 1

The Molecular Clock First noted by Zuckerkandl & Pauling (1964) as an empirical fact. How can one detect it? Known Ancestor Time Unknown AncestorTime /\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \ s1 s2 s1 s2 s3

History of Phylogenetic Methods 1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock. 1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza. 1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock. 1967 First large molecular phylogenies by Fitch and Margoliash. 1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences. 1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution. 1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences. 1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.

1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”. 1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). 1981 Parsimony tree problem is shown to be NP-Complete. 1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies. 1986 Bandelt and Dress introduces split decompostion as a generalization of trees. 1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies. 1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock. 2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.

Alignment Pairwise Alignment Again Triple – Quadruple - Many Similarity-Distance Conversion Local Alignment Statistical alignment Conclusion

Parsimony Alignment of two strings. Sequences: s1=CTAGG s2=TTGT. Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10. {CTA,TT} + GG {CTAG,TTG} = {CTA,TTG} + G- {CTAG,TT} + -G Initial condition: D0,0=0. (Di,j = D(s1[1:i], s2[1:j])) Di,j=min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g} DCTA,TT + w(GG) =12 + 0 = 12 D4,3=DCTAG,TTG=min{DCTA,TTG + w(G-) = 4 + 10 = 14} DCTAG,TT + w(-G) = 22 + 10 = 32

40 32 22 14 9 17 T / 30 22 12 4 12 22 G / 20 12 2 - 12 22 32 T / 10 2 10 20 30 40 T / 0 10 20 30 40 50 C T A G G CTAGG Alignment: i v Cost 17 TT-GT

Alignment of three sequences. s1=ATCG s2=ATGCC s3=CTCC Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC Configurations in an alignment column: - - n n n - n - - n - n - n n - n - - - n n n - Recursion:Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')} Initial condition: D0,0,0 = 0. Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3 New phenomena: ancestral sequence. A C ? A

C G G C Parsimony Alignment of four sequences s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG Alignment: AT-CG ATGCC CT-CC ACGCG Configurations in alignment columns - - - n - - - n n n - n n n n - - - n - n n - n - - n - n n n - - n - - n - n - n - n n - n n - n - - - - n n - - n n n n - n - Recursion: Di= min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4] Initial condition: D0 = 0. Computation time: l1*l2*l3*l4*15 Memory : l1*l2*l3*l4

Alignment of many sequences. s1=ATCG, s2=ATGCC, ......., sn=ACGCG Alignment: AT-CG s1 s3 s4 ATGCC \ ! / ..... ---------- ..... / \ ACGCG s2 s5 Configurations in an alignment column: 2n-1 Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n] Initial condition: D0,0,..0 = 0. Computation time: ln*(2n-1)*n (l:sequence length, n:number of sequences) Memory requirement: ln

Close-to-Optimal Alignments (Waterman & Byers, 1983) A. Alignments within  of optimal. Ex.  = 2 40 32 22 14 9 * 17 T * / 30 22 12 4 12 22 G * / 20 12 2 - 12 22 32 T / 10 2 10 20 30 40 T / 0 10 20 30 40 50 C T A G G CTAGG Alignment: i iv Cost 19 TTGT- Caveat: There are enormous numbers of suboptimal alignments. B. Sets of postions that are on some suboptimal alignment.

Longer Indels TCATGGTACCGTTAGCGT GCA-----------GCAT gk : cost of indel of length k. Initial condition: D0,0=0 Di,j = min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g1,Di,j-2 + g2,Di,j-3 + g3,, Di-1,j + g1,Di-2,j + g2,Di-3,j + g3,, } Cubic running time. Quadratic memory.

If gk = a + b*k, then quadratic running time. Gotoh (1982)Di,j is split into 3 types: 1. D0i,j as Di,j, except s1[i] must mactch s2[j]. 2. D1i,j as Di,j, except s1[i] is matched with "-". 3. D2i,j as Di,j, except s2[i] is matched with "-". Then: D0i,j = min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1) + d(s1[i],s2[j]) D1i,j = min(D1i,j-1 + b, D0i,j-1 + a + b) D2i,j = min(D2i-1,j + b, D0i-1,j + a + b) Comment: 1. Evolutionary Consistency Condition: gi + gj > gi+j

Gotoh Alignment,1981 Let all substitutions cost 2 og let gk= 3 + k, then align ACGT with AT. The alignment must be: ACGT with a cost 5. A--T - n 5 6 2 6 5 5 4 10 11 12 T T 4 0 4 5 6 4 8 9 10 11 A A 0 4 5 6 7 - - - - - A C G T A C G T n n n - - 6 2 6 5 - 8 9 6 7 T T - 0 6 7 8 - 8 45 6 A A 0 - - - - - 4 5 6 7 A C G T A C G T

Distance-Similarity. (Smith-Waterman-Fitch,1982) Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 +g, Di-1,j +g} Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w} Distance: Transitions:2 Transversions 5 Indels:10 M largest distance between two nucleotides (5). Similarity s(n1,n2) M - d(n1,n2) wk k/(2*M) + gk w 1/(2*M) + g Similarity Parameters: Transversions:0 Transitions:3 Identity:5 Indels: 10 + 1/10

40/-40.4 32/-27.3 22/-12.2 14/0.9 9/11.0 17/2.9 T 30/-30.3 22/-17.2 12/-2.1 4/11.0 12/2.9 22/-7.2 G 20/-20.2 12/-7.1 2/8.012/-2.1 22/-12.2 32/-22.3 T 10/-10.1 2/3.0 10/-7.1 20/-17.2 30/-27.3 40/-37.4 T 0/0 10/-10.1 20/-20.2 30/-30.3 40/-40.4 50/-50.5 C T A G G Comments 1. The Switch from Dist to Sim is highly analogous to Maximizing {-f(x)} instead of Minimizing {f(x)}. 2. Dist will based on a metric: i. d(x,x) =0, ii. d(x,y) >=0, iii. d(x,y) = d(y,x) & iv. d(x,z) + d(z,y) >= d(x,y). There are no analogous restrictions on Sim, giving it a larger parameter space.

Local alignment Global Alignment:Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w} Local: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w,0} 0 1 0 .6 1 2 .6 1.6 1.6 3 2.6 Score Parameters: C 0 0 1 0 1 .3 .6 0.6 2 3 1.6 Match: 1 A 0 0 0 1.3 0 1 1 2 3.3 2 1.6 Mismatch -1/3 G / 0 0 .3 .3 1.3 1 2.3 2.3 2 .6 1.6 Gap 1 + k/3 C / 0 0 .6 1.6 .3 1.3 2.6 2.3 1 .6 1.6 GCC-UCG U / GCCAUUG 0 0 2 .6 .3 1.6 2.6 1.3 1 .6 1 A ! 0 1 .6 0 1 3 1.6 1.3 1 1.3 1.6 C / 0 1 0 0 2 1.3 .3 1 .3 2 .6 C / 0 0 0 1 .3 0 0 .6 1 0 0 G / 0 0 0 .6 1 0 0 0 1 1 2 U 0 0 1 .6 0 0 0 0 0 0 0 A 0 0 1 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 C A G C C U C G C U U

Progressive Alignment (Feng-Doolittle 1987 J.Mol.Evol.) Can align alignments and given a tree make a multiple alignment. * * alkmny-trwq acdeqrt akkmdyftrwq acdehrt kkkmemftrwq [ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6 * * *** * * * * * * Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrk Sodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrk Sodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrk Sddm atkavcvlkgdgpqvq—-infeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq—infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp Lsrk Sods vatkavcvlkgdgpqvq—infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk S dpb datkavcvlkgdgpqvq—infeqkesdgpv---wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk sddm Sodb Sodl Sodh Sdmz sods Sdpb

Thorne-Kishino-Felsenstein (1991) Process *A # C G l < m T= 0 # T = t # # # # i. P(s) = (1-l/m)(l/m)l pA#A* .. *pT #T l =length(s) ii. Time reversible

l & m into Alignment Blocks A. Amino Acids Ignored: # - - - # - - - - * - - - - ## # # - # # # # * # # # # k k k e-mt[1-lb(t)](lb(t))k-1 [1-e-mt-mb(t)][1-lb(t)](lb(t))k-1 [1-lb(t)](lb(t))k pk(t) p’k(t) p’’k(t) p’0(t)= mb(t) where b(t)=[1-e(l-m)t]/[m-l] B. AA Considered:T - - - RQ S W Pt(T-->R)*pQ*..*pW*p4(t) 4 T - - - - - R Q S WpR *pQ*..*pW*p’4(t) 4

Basic Pairwise Recursion (O(length3)) i j Survives Dies i-1 i i-1 i j-1 j j i i-1 i-1 i j-2 j j j-1

a-globin (141) and b-globin (146) 430.108 : -log(a-globin) 327.320 : -log(a-globin ‡b-globin) 730.428 : -log(l(sumalign)) l*t: 0.0371805 +/- 0.0135899 m*t: 0.0374396 +/- 0.0136846 s*t: 0.91701 +/- 0.119556 E(Length) E(Insertions,Deletions) E(Substitutions) 143.499 5.37255 131.59 Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) = 0.00565064

1966: Levenstein formulates distance measure between sequences and instroduces dynamica programming algorithm finding the distance. 1970: Needleman and Wunch compares proteins maximising a similarity score. 1972: Sankoff & Sellers reinvents the basic algorithm. Sankoff can also align subject to the constraint that there must be exactly k indels. 1973: Sankoff makes multiple alignment and phylogeny - both exact & heuristic. 1975: Hirschberg gives linear memory algorithm. 1976: Waterman gives cubic algorithm allowing for indels of arbitrary length. 1976: Waterman introduces alignment without reference to phylogeny. 1981: Waterman, Smith and Fitch shows duality of simiarity and distance. 1982: Gotoh gives quadratic algorithm if gap penalty functionen is gk = a + b*k (for indel of length k). Uses 3 matrices in stead of 1. 1983: Waterman and Byers introduces close-to-optimal alignments. 1984-5: Ukkonen, Myers, Fickett accelerates algorithmen considerably. 1984: Hogeveg and Hespers introduces heuristic multiple phylogenetic alignment. 1984: Fredman introduces triple alignment generalisation of Needleman-Wunch. 1985: Lipman & Wilbur uses hashing.

1989: Myers introduces alignment with concave gap penalty function. 1991: Thorne, Kishino & Felsenstein makes good model for statistical alignment, partially introduced in 1986 by Thomson & Bishop. 1991: States & Botstein compares a DNA string with a protein in search of frameshift mutations. 1993-4: Gusfield, Lander, Waterman and others introduces parametric alignment. 1987: Feng-Doolittle introducesphylogenetisk alignment: "Once a gap always a gap". 1989: Kececioglou makes strong acceleration of Sankoff's exact algorithm. 1994: Krogh et al & Baldi et al. introduces Hidden Markov Models for multiple alignment. 1995: Mitcheson & Durbin introduces Tree-HMMs allowing sequences generated by an HMM to have a "tree correlation" structure, but not based on an explicit evolutionary process. 1999 - resurgence of interest in statistical alignment

Database Search What is the probability model for database search? What is PAM matrix Statistical Alignment & Homology Testing.

Illustration of database search. Query sequence: q=YQPVNPAL Database: s1=CVDAEGKYL and s2=TTEQRPKNPATYCG i. No mismatches - no gaps: longest common segment q = YQPVNPAL s2 = TTEQRPKNPATYCG length = 3 *** ii. Mismatches allowed - no gaps q = YQPVNPAL s2 = TTEQRPKNPATYCG * *** Similarity function s( , ). Total score, S = s(P,P) + s(V,K) + ..+ s(A,A)

iii.Both mismatches & gaps:Local alignment (LA). q = YQ-PVNPAL s2= TTEQRPKNPATYCG * * *** Here I1 = QPVNPA and J1 = QRPKNPA. S = s(Q,Q) - g + s(P,P) + s(V,K) + ..+ s(A,A) i. If g & mismatch cost: = infinity, LA reduces to longest common segment. ii. If g: = infinity, LA reduces to best segment.

Distributions of Scores. Model: q and database is a series of iid (independent, identically distributed) random variables, Xi's, where P(Xi=j) = pj (j any of the 20 amino acids). Scoring scheme: i. E(s(Xi,Xj)) = pi*pj*s(i,j) < 0 ii. max s(i,j) > 0 and g < 0. iii. s(i,j) = s(j,i)

i. Longest Common Segments. The mean grows proportionally to log(nm), where n and m are the length of the 2 sequences. • ii. Best Segment with score S will follow an extreme value distribution. • P(S>x) = exp(exp(-l*x/u)), • u is a positioning parameter, l a parameter that determines how fast the distribution tails off. • P(S>x) ) = exp(K*m*n*exp(- lx)) • is the x that solves pi*pj*exp(s(i,j)*x) = 1 • K is also a known function of the pi's and the sij's. • iii. Local Alignment: • Distribution unknowm. • Looks like Extreme Value Distribution.

From http://www.vuse.vanderbilt.edu/~mahas1/ce207/type1extremevalue.html

The PAM matrix - Point Accepted Mutations Wi,j= -ln(pi*P2.5i,j/(pi*pj)) s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE s2* is a random permutation of s2 Z = [score(s1,s2)-E{score(s1,s2*)}]/ s.d.{score(s1,s2*)}]

From W.Pearson

The tactics of BLAST I The most widely used program for database searches in biological sequence databases is BLAST (Altschul et al., 1990) and variants of BLAST. i. It defines a neighbourhood of segments to the segments composing the query sequence. I q=YQPVNPAL {YQPVN, QPVNP, PVNPA, VNPAL} II YQPVN {AQPVN, BQPVN,.., YAPVN,..} ii. It finds segments in the database that matches these neighborhood segments very quickly.

The tactics of BLAST II iii. It heuristically finds large segments giving a good score. iv. If the score of this good segment is statistically significant, then this is extended 5'-ward and 3'-ward by a local alignment algorithm, giving a proposed local alignment.

Homology Test Wi,j= -ln(pi*P2.5i,j/(pi*pj)) D(s1,s2) is evaluated in D(s1,s2*) Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * * This test: 1.Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart. 2.It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary. 3.It samples in (pi*pj) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.

Week 36: Trees, Alignment & Database Search