260 likes | 360 Views
Optimisation Alignment. http://www.stats.ox.ac.uk/~hein/lectures.htm. a -globin ( 141) and b -globin (146) V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
E N D
Optimisation Alignment. http://www.stats.ox.ac.uk/~hein/lectures.htm a-globin (141) and b-globin (146) V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH It often matches functional region with functional region Determines homology at residue/nucleotide level. Similarity/Distance between molecules can be evaluated Molecular Evolution studies. Homology/Non-homology depends on it.
Number of alignments, T(n,m) 1 9 41 129 321 681 T 1 7 25 63 129 231 G 1 5 13 25 41 61 T 1 3 5 7 9 11 T 1 1 1 1 1 1 C T A G G Alignments columns are equivalent to step (0,1), (1,0) and (1,1) in a [0,n][0,m] matrix. T(n,m) is the number of alignments of s1[1,n] and s2[1,m] then T(n,m)=T(n-1,m)+T(n,m-1)+T(n-1,m-1) T(0,0)=1 T(n,m) > 3 min(n,m) Thus alignment by alignment search for best alignment is not realistic. -n n- n- -n is equivalent to If then alignments are equivalent to choosing two subsets of s1 and and s2 that has to be matched, thus
Parsimony Alignment of two strings. Sequences: s1=CTAGG s2=TTGT. 5, indels (g) 10. Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10. CTAG CTA G = + TT-G TT- G Cost Additivity {CTA,TT}AL + GG (A) 0 12 Min [ ] {CTA,TTG}AL + G- (B) {CTAG,TTG}AL = 10 4 12 {CTAG,TT}AL + -G (C) 10 32 Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g} Initial condition: D0,0=0. (Di,j := D(s1[1:i], s2[1:j]))
T G T T C T A G G 40 32 22 14 9 17 30 22 12 4 22 20 12 212 22 32 10 2 10 20 30 40 10 20 30 40 50 12 0 CTAGG Alignment: i v Cost 17 TT-GT
Complexity of Accelerations of pairwise algorithm. e { Dynamical Programming: (n+1)(m+1)3=O(nm) Backtracking: O(n+m) Recursion without memory: T(n,m) > 3 min(n,m) Exact acceleration (Ukkonen,Myers). Assume all events cost 1. If de(s1,s2) <2e+|l1-l2|, then d(s1,s2)= de(s1,s2 Heuristic acceleration: Smaller band & larger acceleration, but no guarantee of optimum.
Alignment of three sequences. A C ? A s1=ATCG s2=ATGCC s3=CTCC A A C Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC Configurations in an alignment column: - - n n n - n - - n - n - n n - n - - - n n n - Recursion:Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')} Initial condition: D0,0,0 = 0. Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3 New phenomena: ancestral/consensus sequence.
Parsimony Alignment of four sequences C G G C s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG Alignment: AT-CG ATGCC CT-CC ACGCG G C C G Configurations in alignment columns: - - - n - - - n n n - n n n n - - - n - n n - n - - n - n n n - - n - - n - n - n - n n - n n - n - - - - n n - - n n n n - n - Recursion:Di= min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4] Initial condition: D0 = 0.Memory : l1*l2*l3*l4 Computation time: l1*l2*l3*l4*24Memory : l1*l2*l3*l4 New Phenomena: Cost and alignment is phylogeny dependent
Alignment of many sequences. s1=ATCG, s2=ATGCC, ......., sn=ACGCG Alignment: AT-CG ATGCC ..... ..... ACGCG s1 s3 s4 \ ! / ---------- / \ s2 s5 Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n] Configurations in an alignment column: 2n-1 Initial condition: D0,0,..0 = 0. Computation time: ln*(2n-1)*n Memory requirement: ln (l:sequence length, n:number of sequences)
Progressive Alignment (Feng-Doolittle 1987 J.Mol.Evol.) Can align alignments and given a tree make a multiple alignment. * * alkmny-trwq acdeqrt akkmdyftrwq acdehrt kkkmemftrwq [ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6 * * *** * * * * * * Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrk Sodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrk Sddm atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq— infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp Lsrk Sods vatkavcvlkgdgpqvq— infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—-infeqkesdgpv----wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk sddm Sodb Sodl Sodh Sdmz sods Sdpb
Approaches to Sequence Analysis Data {GTCAT,GTTGGT,GTCA,CTCA} Parsimony, similarity, optimisation. GT-CAT GTTGGT GT-CA- CT-CA- Ideal Practice: 1 phase analysis. Actual Practice: 2 phase analysis. statistics s1 s2 s3 s4 TKF91 - The combined substitution/indel process. Acceleration of Basic Algorithm Many Sequence Algorithm MCMC Approaches
T= 0 # - - - ## # # # T = t # # # # s1 r s2 s1 s2 s1 s2 Thorne-Kishino-Felsenstein (1991) Process A # C G * • (birth rate) < m(death rate) 1. P(s) = (1-l/m)(l/m)l pA#A* .. *pT #T l =length(s) 2. Time reversible:
# - - - - - # # # # k * - - - - * # # # # k l & m into Alignment Blocks A. Amino Acids Ignored: # - - - ## # # k e-mt[1-lb](lb)k-1 [1-lb-mb](lb)k [1-lb](lb)k p’k(t) pk(t) p’’k(t) b=[1-e(l-m)t]/[m-le(l-m)t] p’0(t)= mb(t) B. Amino Acids Considered: T - - - RQ S W Pt(T-->R)*pQ*..*pW*p4(t) 4 • T - - - - • R Q S WpR *pQ*..*pW*p’4(t) • 4
Dpk = Dt*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk] Dp’k=Dt*[l*(k-1) p’k-1+m*(k+1)*p’k+1-(l+m)*k*p’k+m*pk+1] Dp’’k=Dt*[l*k*p’’k-1+m*(k+1)*p’’k+1- [(k+1)l+km]*p’’k] Differential Equations for p-functions # - - ... - # # # ... # # - - - ... - - # # # ... # * - - - ... - * # # # ... # Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p1(0)= p0’’(0)= 1. p’0 (0)= 0
Basic Pairwise Recursion (O(length3)) i j Survives: Dies: i-1 i i-1 i j-1 j j i-1 i i j-2 j i-1 j j-1 …………………… …………………… …………………… e-mt[1-lb](lb)k-1, where …………………… …………………… b=[1-e(l-m)t]/[m-le(l-m)t] 0… j (j+1) cases 1… j (j) cases
Basic Pairwise Recursion (O(length3)) survive death j (i-1,j) j-1 (i-1,j-1) Initial condition: p’’=s2[1:j] ………….. (i-1,j-k) ………….. ………….. i-1 i (i,j)
Corner Cutting ~100-1000 Better Numerical Search ~10-100 Ex.: good start guess, 28 evaluations, 3 iterations Accelleration of Pairwise Algorithm (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) Simpler Recursion ~3-10 Faster Computers ~250 1991-->2000 ~106
a-globin (141) and b-globin (146) (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) 430.108 : -log(a-globin) 327.320 : -log(a-globin -->b-globin) 747.428 : -log(a-globin, b-globin) = -log(l(sumalign)) l*t: 0.0371805 +/- 0.0135899 m*t: 0.0374396 +/- 0.0136846 s*t: 0.91701 +/- 0.119556 E(Length) E(Insertions,Deletions) E(Substitutions) 143.499 5.37255 131.59 Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) = 0.00565064
VLSPADNAL.....DLHAHKR 141 AA long *########### …. ### 141 AA long 2 108 years 2 107 years 2 109 years *########### …. ### *########### …. ### ???????????????????? k AA long 109 years The invasion of the immortal link
Statistical Alignment via Hidden Markov Models Steel and Hein,2001 + Holmes and Bruno,2001 - # # E # # - E * * lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) - # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # #lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # - lb HMM formulation allows: p(C)f(CT) T Finding most probable alignment p(A) Probability of sequence pair l/m (1- lb)(1-e-m) p(C)f(CC) C l/m (1- lb)e-m Probability of specific edge C A C
Why multiple statistical alignment is non-trivial. Steel & Hein, 2001, Hein, 2001, Holmes and Bruno, 2001 *ACG C *TT GT s2 s1 a s3 *ACG G Statistical Alignment Optimisation Alignment AT-C G ATGC C CT-C C *###### * (l/m) • An HMM generating alignment according to TKF91: 1 0 #- 1 -- 2 #- 3 ## 4 -# 5 ## 4 0 # - 3 5 2
Maximum likelihood phylogeny and alignment Human alpha hemoglobin;Human beta hemoglobin; Human myoglobin Bean leghemoglobin Probability of data e-1560.138 Probability of data and alignment e-1593.223 Probability of alignment given data 4.279 * 10-15 = e-33.085 Ratio of insertion-deletions to substitutions: 0.0334 Gerton Lunter, Istvan Miklos, Alexei Drummond, Yun Song
Metropolis-Hastings Statistical Alignment Lunter, Drummond, Miklos, Jensen & Hein, 2005
Unobservable Knudsen.., 99 Eddy & co. C C A A Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 G C A U U Footprinting -Signals (Churchill and Felsenstein, 96) Unobservable The Basics of Evolutionary Annotation Many aligned sequences related by a known phylogeny positions 1 n 1 sequences k slow - rs fast - rf HMM
Statistical Alignment andFootprinting. acgtttgaaccgag---- 1 acgtttgaaccgag---- sequences sequences 1 k k acgtttgaaccgag---- 1 sequences (A,S) Alignment HMM k Alignment HMM Structure HMM Alignment HMM Signal HMM
SAPF - Statistical Alignment and Phylogenetic Footprinting Alignment HMM Signal HMM Sum out Annotate 1 2 Target
BigFoot http://www.stats.ox.ac.uk/research/genome/software • Dynamical programming is too slow for more than 4-6 sequences • MCMC integration is used instead – works until 10-15 sequences • For more sequences other methods are needed.