1 / 52

Week 36: Trees, Alignment & Database Search

Week 36: Trees, Alignment & Database Search. Trees Alignment Database Search. Trees. Tree Concepts Reconstruction Methods Parsimony Likelihood Distance The Molecular Clock. Topologies 4 sequences with root ignored: s1 s3 s1 s2 s1 s2

akasma
Download Presentation

Week 36: Trees, Alignment & Database Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Week 36: Trees, Alignment & Database Search Trees Alignment Database Search

  2. Trees Tree Concepts Reconstruction Methods Parsimony Likelihood Distance The Molecular Clock

  3. Topologies • 4 sequences with root ignored: • s1 s3 s1 s2 s1 s2 • \ / \ / \ / • \ / \ / \ / • -------- -------- -------- • / \ / \ / \ • / \ / \ / \ • s2 s4 s3 s4 s4 s3 • For unrooted trees with k leaves and internal nodes with 3 incoming edges, the following holds: • Edges = 2k-3; Internal nodes = k-2; T(k) = (2k-5)T(k-1); T(3)=1; • k 3 4 5 6 7 20 • ----------------------------------------------- • T(k) 1 3 15 105 945 1023 • ----------------------------------------------- • Edges 3 5 7 9 11 37 • ----------------------------------------------- • Internal Nodes 1 2 3 4 5 18

  4. Central Principles of Phylogeny Reconstruction s1 s1 s1 s3 s3 s3 s2 s2 s2 s4 s4 s4 TTCAGT TCCAGT GCCAAT GCCAAT 1 0 Parsimony Distance Likelihood 2 Total Weight: 4 0 1 0.6 1 1 2 3 2 1 0.7 1.5 0.4 0.3 L=3.1*10-7 Parameter estimates

  5. The Small Parsimony Problem (Fitch-Hartigan-Sankoff) ? / \ / \ ? ? / \ / \ C G T C L’N / /\ \ / / \ \ / / \ \ / LA LC LG LT / \ RA RG .. \ Recursion:L’N = min{LNL + d(N,NL)} + min{LNR + d(N,NR)} Initialisation: Lleaf(N) = 0, if N at leaf - else infinity

  6. Fitch-Hartigan-Sankoff Algorithm (A,C,G,T) (9,7,7,7) / \ / \ Costs: Transition 2, / \ (A ,C,G, T) \ Transversion 5, indel 10. (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *

  7. Distance Concepts on Trees I A: Metric, d( , ) : i: d(a,b)=0 <=> a=b ii: d(a,b)=d(b,a) iii: d(a,b) <= d(a,c) + d(c,b) a b c

  8. s1 s3 s2 s4 s3 s1 s2 i Distance Concepts on Trees II Tree Metric: (distance function originates from tree) d(x,y) + d(z,æ) = d(x,z) + d(y,æ) > d(x,æ) + d(y,z), where z,y,z,æ is a permutation of a,b,c,d. (> implies that no branch has length 0) Reconstruction Principle: d(s1,i) = (d(s1,s2) + d(s1,s3) - d(s2,s3))/2

  9. i s1 s2 s3 Distance Concepts on Trees III Ultra Metric (distance function originates from tree) d(x,y) = d(x,z) > d(x,y), where z,y,z is a permutation of a,b,c. (> implies that no branch has length 0) Reconstruction Principle: d(s1,i) = d(s1,s2)/2

  10. Evolutionary Substitution Process t1 A t2 C C Pi,j(t) = probability of going from i to j in time t.

  11. Probability of a pattern - summing over internal states A C G ? ? ? ? A A T A C G T A C G T A C G T

  12. Felsenstein's recursion Conditional probability, CL(v,N) is the probability for observing the nucleotides at the leaves at a subtree hanging from v, if nucleotide N is found at v. P(v1,v2,N1,N2) = probability for N1 at v1, N2 at v2. I) Initial contition: CL(leaf,N)= 1{N is at leaf} II) Recursion P(v,vl,N,Nl)*CL(vl,Nl) * CL(v,N) = P(v,vr,N,Nr)*CL(vr,Nr) where r refers to right son, l to left son. III) L(p) = πN*BL(Rod,N,p). IV) Total likelihood:Product over all positions:L = Li(p).

  13. Output from Likelihood Method With Clock:Without Clock: s5 s4 23 5.2 \ / /\ 40.9 20.4 / \ \ / / \ ! / \ 1.6 5.6 23 sd4.6 124.4 / \ s1---6-------22---------------11---3 /\ \ ! ! 44.9 /\ \ /\ 7 3.4 4 sd.1.4 / \ \ / \ ! s1 s2 s3 s4 s5 s2 Likelihood: 7.9*10-14   = 0.31.1,0.18.1 6.2*10-12   = 0.34.1 0.16.1 ln(7.9*10-14) –ln(6.2*10-12) is 2 – distributed with (n-2) degrees of freedom.

  14. The Felsenstein Zone Felsenstein-Cavendar (1979) s1 s4 s2 s3 True Tree Reconstructed Tree s1 s2 s3 s4 Patterns:(16 only 8 shown) 0 1 0 0 00 0 0 0 0 1 0 01 0 1 0 0 0 1 01 1 0 0 0 0 0 10 1 1

  15. The Molecular Clock First noted by Zuckerkandl & Pauling (1964) as an empirical fact. How can one detect it? Known Ancestor Time Unknown AncestorTime /\ a at time T. / \ / \ ? \ / \ /\ \ / \ / \ \ / \ / \ \ s1 s2 s1 s2 s3

  16. History of Phylogenetic Methods 1958 Sokal and Michener publishes UGPMA method for making distrance trees with a clock. 1964 Parsimony principle defined, but not advocated by Edwards and Cavalli-Sforza. 1962-65 Zuckerkandl and Pauling introduces the notion of a Molecular Clock. 1967 First large molecular phylogenies by Fitch and Margoliash. 1969 Heuristic method used by Dayhoff to make trees and reconstruct ancetral sequences. 1970: Neyman analyzes three sequence stochastic model with Jukes-Cantor substitution. 1971-73 Fitch, Hartigan & Sankoff independently comes up with same algorithm reconstructing parsimony ancetral sequences. 1973 Sankoff treats alignment and phylogenies as on general problem – phylogenetic alignment.

  17. 1979 Cavender and Felsenstein independently comes up with same evolutionary model where parsimony is inconsistent. Later called the “Felsenstein Zone”. 1981: Felsenstein Maximum Likelihood Model & Program DNAML (i programpakken PHYLIP). 1981 Parsimony tree problem is shown to be NP-Complete. 1985: Felsenstein introduces bootstrapping as confidence interval on phylogenies. 1986 Bandelt and Dress introduces split decompostion as a generalization of trees. 1985-: Many authors (Sawyer, Hein, Stephens, M.Smith) tries to address the problem of recombinations in phylogenies. 1997-9 Thorne et al., Sanderson & Huelsenbeck introduces the Almost Clock. 2000 Rambaut (and others) makes methods that can find trees with non-contemporaneous leaves.

  18. Alignment Pairwise Alignment Again Triple – Quadruple - Many Similarity-Distance Conversion Local Alignment Statistical alignment Conclusion

  19. Parsimony Alignment of two strings. Sequences: s1=CTAGG s2=TTGT. Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10. {CTA,TT} + GG {CTAG,TTG} = {CTA,TTG} + G- {CTAG,TT} + -G Initial condition: D0,0=0. (Di,j = D(s1[1:i], s2[1:j])) Di,j=min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g} DCTA,TT + w(GG) =12 + 0 = 12 D4,3=DCTAG,TTG=min{DCTA,TTG + w(G-) = 4 + 10 = 14} DCTAG,TT + w(-G) = 22 + 10 = 32

  20. 40 32 22 14 9 17 T / 30 22 12 4 12 22 G / 20 12 2 - 12 22 32 T / 10 2 10 20 30 40 T / 0 10 20 30 40 50 C T A G G CTAGG Alignment: i v Cost 17 TT-GT

  21. Alignment of three sequences. s1=ATCG s2=ATGCC s3=CTCC Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC Configurations in an alignment column: - - n n n - n - - n - n - n n - n - - - n n n - Recursion:Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')} Initial condition: D0,0,0 = 0. Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3 New phenomena: ancestral sequence. A C ? A

  22. C G G C Parsimony Alignment of four sequences s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG Alignment: AT-CG ATGCC CT-CC ACGCG Configurations in alignment columns - - - n - - - n n n - n n n n - - - n - n n - n - - n - n n n - - n - - n - n - n - n n - n n - n - - - - n n - - n n n n - n - Recursion: Di= min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4] Initial condition: D0 = 0. Computation time: l1*l2*l3*l4*15 Memory : l1*l2*l3*l4

  23. Alignment of many sequences. s1=ATCG, s2=ATGCC, ......., sn=ACGCG Alignment: AT-CG s1 s3 s4 ATGCC \ ! / ..... ---------- ..... / \ ACGCG s2 s5 Configurations in an alignment column: 2n-1 Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n] Initial condition: D0,0,..0 = 0. Computation time: ln*(2n-1)*n (l:sequence length, n:number of sequences) Memory requirement: ln

  24. Close-to-Optimal Alignments (Waterman & Byers, 1983) A. Alignments within  of optimal. Ex.  = 2 40 32 22 14 9 * 17 T * / 30 22 12 4 12 22 G * / 20 12 2 - 12 22 32 T / 10 2 10 20 30 40 T / 0 10 20 30 40 50 C T A G G CTAGG Alignment: i iv Cost 19 TTGT- Caveat: There are enormous numbers of suboptimal alignments. B. Sets of postions that are on some suboptimal alignment.

  25. Longer Indels TCATGGTACCGTTAGCGT GCA-----------GCAT gk : cost of indel of length k. Initial condition: D0,0=0 Di,j = min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g1,Di,j-2 + g2,Di,j-3 + g3,, Di-1,j + g1,Di-2,j + g2,Di-3,j + g3,, } Cubic running time. Quadratic memory.

  26. If gk = a + b*k, then quadratic running time. Gotoh (1982)Di,j is split into 3 types: 1. D0i,j as Di,j, except s1[i] must mactch s2[j]. 2. D1i,j as Di,j, except s1[i] is matched with "-". 3. D2i,j as Di,j, except s2[i] is matched with "-". Then: D0i,j = min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1) + d(s1[i],s2[j]) D1i,j = min(D1i,j-1 + b, D0i,j-1 + a + b) D2i,j = min(D2i-1,j + b, D0i-1,j + a + b) Comment: 1. Evolutionary Consistency Condition: gi + gj > gi+j

  27. Gotoh Alignment,1981 Let all substitutions cost 2 og let gk= 3 + k, then align ACGT with AT. The alignment must be: ACGT with a cost 5. A--T - n 5 6 2 6 5 5 4 10 11 12 T T 4 0 4 5 6 4 8 9 10 11 A A 0 4 5 6 7 - - - - - A C G T A C G T n n n - - 6 2 6 5 - 8 9 6 7 T T - 0 6 7 8 - 8 45 6 A A 0 - - - - - 4 5 6 7 A C G T A C G T

  28. Distance-Similarity. (Smith-Waterman-Fitch,1982) Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 +g, Di-1,j +g} Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w} Distance: Transitions:2 Transversions 5 Indels:10 M largest distance between two nucleotides (5). Similarity s(n1,n2) M - d(n1,n2) wk k/(2*M) + gk w 1/(2*M) + g Similarity Parameters: Transversions:0 Transitions:3 Identity:5 Indels: 10 + 1/10

  29. 40/-40.4 32/-27.3 22/-12.2 14/0.9 9/11.0 17/2.9 T 30/-30.3 22/-17.2 12/-2.1 4/11.0 12/2.9 22/-7.2 G 20/-20.2 12/-7.1 2/8.012/-2.1 22/-12.2 32/-22.3 T 10/-10.1 2/3.0 10/-7.1 20/-17.2 30/-27.3 40/-37.4 T 0/0 10/-10.1 20/-20.2 30/-30.3 40/-40.4 50/-50.5 C T A G G Comments 1. The Switch from Dist to Sim is highly analogous to Maximizing {-f(x)} instead of Minimizing {f(x)}. 2. Dist will based on a metric: i. d(x,x) =0, ii. d(x,y) >=0, iii. d(x,y) = d(y,x) & iv. d(x,z) + d(z,y) >= d(x,y). There are no analogous restrictions on Sim, giving it a larger parameter space.

  30. Local alignment Global Alignment:Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w} Local: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w,0} 0 1 0 .6 1 2 .6 1.6 1.6 3 2.6 Score Parameters: C 0 0 1 0 1 .3 .6 0.6 2 3 1.6 Match: 1 A 0 0 0 1.3 0 1 1 2 3.3 2 1.6 Mismatch -1/3 G / 0 0 .3 .3 1.3 1 2.3 2.3 2 .6 1.6 Gap 1 + k/3 C / 0 0 .6 1.6 .3 1.3 2.6 2.3 1 .6 1.6 GCC-UCG U / GCCAUUG 0 0 2 .6 .3 1.6 2.6 1.3 1 .6 1 A ! 0 1 .6 0 1 3 1.6 1.3 1 1.3 1.6 C / 0 1 0 0 2 1.3 .3 1 .3 2 .6 C / 0 0 0 1 .3 0 0 .6 1 0 0 G / 0 0 0 .6 1 0 0 0 1 1 2 U 0 0 1 .6 0 0 0 0 0 0 0 A 0 0 1 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 C A G C C U C G C U U

  31. Progressive Alignment (Feng-Doolittle 1987 J.Mol.Evol.) Can align alignments and given a tree make a multiple alignment. * * alkmny-trwq acdeqrt akkmdyftrwq acdehrt kkkmemftrwq [ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6 * * *** * * * * * * Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrk Sodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrk Sodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg---ndtagct sagphfnp lsrk Sddm atkavcvlkgdgpqvq—-infeak-gdtvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq—infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp Lsrk Sods vatkavcvlkgdgpqvq—infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk S dpb datkavcvlkgdgpqvq—infeqkesdgpv---wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk sddm Sodb Sodl Sodh Sdmz sods Sdpb

  32. Thorne-Kishino-Felsenstein (1991) Process *A # C G l < m T= 0 # T = t # # # # i. P(s) = (1-l/m)(l/m)l pA#A* .. *pT #T l =length(s) ii. Time reversible

  33. l & m into Alignment Blocks A. Amino Acids Ignored: # - - - # - - - - * - - - - ## # # - # # # # * # # # # k k k e-mt[1-lb(t)](lb(t))k-1 [1-e-mt-mb(t)][1-lb(t)](lb(t))k-1 [1-lb(t)](lb(t))k pk(t) p’k(t) p’’k(t) p’0(t)= mb(t) where b(t)=[1-e(l-m)t]/[m-l] B. AA Considered:T - - - RQ S W Pt(T-->R)*pQ*..*pW*p4(t) 4 T - - - - - R Q S WpR *pQ*..*pW*p’4(t) 4

  34. Basic Pairwise Recursion (O(length3)) i j Survives Dies i-1 i i-1 i j-1 j j i i-1 i-1 i j-2 j j j-1

  35. a-globin (141) and b-globin (146) 430.108 : -log(a-globin) 327.320 : -log(a-globin ‡b-globin) 730.428 : -log(l(sumalign)) l*t: 0.0371805 +/- 0.0135899 m*t: 0.0374396 +/- 0.0136846 s*t: 0.91701 +/- 0.119556 E(Length) E(Insertions,Deletions) E(Substitutions) 143.499 5.37255 131.59 Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) = 0.00565064

  36. 1966: Levenstein formulates distance measure between sequences and instroduces dynamica programming algorithm finding the distance. 1970: Needleman and Wunch compares proteins maximising a similarity score. 1972: Sankoff & Sellers reinvents the basic algorithm. Sankoff can also align subject to the constraint that there must be exactly k indels. 1973: Sankoff makes multiple alignment and phylogeny - both exact & heuristic. 1975: Hirschberg gives linear memory algorithm. 1976: Waterman gives cubic algorithm allowing for indels of arbitrary length. 1976: Waterman introduces alignment without reference to phylogeny. 1981: Waterman, Smith and Fitch shows duality of simiarity and distance. 1982: Gotoh gives quadratic algorithm if gap penalty functionen is gk = a + b*k (for indel of length k). Uses 3 matrices in stead of 1. 1983: Waterman and Byers introduces close-to-optimal alignments. 1984-5: Ukkonen, Myers, Fickett accelerates algorithmen considerably. 1984: Hogeveg and Hespers introduces heuristic multiple phylogenetic alignment. 1984: Fredman introduces triple alignment generalisation of Needleman-Wunch. 1985: Lipman & Wilbur uses hashing.

  37. 1989: Myers introduces alignment with concave gap penalty function. 1991: Thorne, Kishino & Felsenstein makes good model for statistical alignment, partially introduced in 1986 by Thomson & Bishop. 1991: States & Botstein compares a DNA string with a protein in search of frameshift mutations. 1993-4: Gusfield, Lander, Waterman and others introduces parametric alignment. 1987: Feng-Doolittle introducesphylogenetisk alignment: "Once a gap always a gap". 1989: Kececioglou makes strong acceleration of Sankoff's exact algorithm. 1994: Krogh et al & Baldi et al. introduces Hidden Markov Models for multiple alignment. 1995: Mitcheson & Durbin introduces Tree-HMMs allowing sequences generated by an HMM to have a "tree correlation" structure, but not based on an explicit evolutionary process. 1999 - resurgence of interest in statistical alignment

  38. Database Search What is the probability model for database search? What is PAM matrix Statistical Alignment & Homology Testing.

  39. Illustration of database search. Query sequence: q=YQPVNPAL Database: s1=CVDAEGKYL and s2=TTEQRPKNPATYCG i. No mismatches - no gaps: longest common segment q = YQPVNPAL s2 = TTEQRPKNPATYCG length = 3 *** ii. Mismatches allowed - no gaps q = YQPVNPAL s2 = TTEQRPKNPATYCG * *** Similarity function s( , ). Total score, S = s(P,P) + s(V,K) + ..+ s(A,A)

  40. iii.Both mismatches & gaps:Local alignment (LA). q = YQ-PVNPAL s2= TTEQRPKNPATYCG * * *** Here I1 = QPVNPA and J1 = QRPKNPA. S = s(Q,Q) - g + s(P,P) + s(V,K) + ..+ s(A,A) i. If g & mismatch cost: = infinity, LA reduces to longest common segment. ii. If g: = infinity, LA reduces to best segment.

  41. Distributions of Scores. Model: q and database is a series of iid (independent, identically distributed) random variables, Xi's, where P(Xi=j) = pj (j any of the 20 amino acids). Scoring scheme: i. E(s(Xi,Xj)) = pi*pj*s(i,j) < 0 ii. max s(i,j) > 0 and g < 0. iii. s(i,j) = s(j,i)

  42. i. Longest Common Segments. The mean grows proportionally to log(nm), where n and m are the length of the 2 sequences. • ii. Best Segment with score S will follow an extreme value distribution. • P(S>x) = exp(exp(-l*x/u)), • u is a positioning parameter, l a parameter that determines how fast the distribution tails off. • P(S>x) ) = exp(K*m*n*exp(- lx)) • is the x that solves pi*pj*exp(s(i,j)*x) = 1 • K is also a known function of the pi's and the sij's. • iii. Local Alignment: • Distribution unknowm. • Looks like Extreme Value Distribution.

  43. From http://www.vuse.vanderbilt.edu/~mahas1/ce207/type1extremevalue.html

  44. The PAM matrix - Point Accepted Mutations Wi,j= -ln(pi*P2.5i,j/(pi*pj)) s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE s2* is a random permutation of s2 Z = [score(s1,s2)-E{score(s1,s2*)}]/ s.d.{score(s1,s2*)}]

  45. From W.Pearson

  46. From W.Pearson

  47. From W.Pearson

  48. The tactics of BLAST I The most widely used program for database searches in biological sequence databases is BLAST (Altschul et al., 1990) and variants of BLAST. i. It defines a neighbourhood of segments to the segments composing the query sequence. I q=YQPVNPAL {YQPVN, QPVNP, PVNPA, VNPAL} II YQPVN {AQPVN, BQPVN,.., YAPVN,..} ii. It finds segments in the database that matches these neighborhood segments very quickly.

  49. The tactics of BLAST II iii. It heuristically finds large segments giving a good score. iv. If the score of this good segment is statistically significant, then this is extended 5'-ward and 3'-ward by a local alignment algorithm, giving a proposed local alignment.

  50. Homology Test Wi,j= -ln(pi*P2.5i,j/(pi*pj)) D(s1,s2) is evaluated in D(s1,s2*) Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * * This test: 1.Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart. 2.It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary. 3.It samples in (pi*pj) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.

More Related