1 / 26

a -globin ( 141) and b -globin (146)

Optimisation Alignment. http://www.stats.ox.ac.uk/~hein/lectures.htm. a -globin ( 141) and b -globin (146) V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF

prem
Download Presentation

a -globin ( 141) and b -globin (146)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimisation Alignment. http://www.stats.ox.ac.uk/~hein/lectures.htm a-globin (141) and b-globin (146) V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH It often matches functional region with functional region Determines homology at residue/nucleotide level. Similarity/Distance between molecules can be evaluated Molecular Evolution studies. Homology/Non-homology depends on it.

  2. Number of alignments, T(n,m) 1 9 41 129 321 681 T 1 7 25 63 129 231 G 1 5 13 25 41 61 T 1 3 5 7 9 11 T 1 1 1 1 1 1 C T A G G Alignments columns are equivalent to step (0,1), (1,0) and (1,1) in a [0,n][0,m] matrix. T(n,m) is the number of alignments of s1[1,n] and s2[1,m] then T(n,m)=T(n-1,m)+T(n,m-1)+T(n-1,m-1) T(0,0)=1 T(n,m) > 3 min(n,m) Thus alignment by alignment search for best alignment is not realistic. -n n- n- -n is equivalent to If then alignments are equivalent to choosing two subsets of s1 and and s2 that has to be matched, thus

  3. Parsimony Alignment of two strings. Sequences: s1=CTAGG s2=TTGT. 5, indels (g) 10. Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10. CTAG CTA G = + TT-G TT- G Cost Additivity {CTA,TT}AL + GG (A) 0 12 Min [ ] {CTA,TTG}AL + G- (B) {CTAG,TTG}AL = 10 4 12 {CTAG,TT}AL + -G (C) 10 32 Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g} Initial condition: D0,0=0. (Di,j := D(s1[1:i], s2[1:j]))

  4. T G T T C T A G G 40 32 22 14 9 17 30 22 12 4 22 20 12 212 22 32 10 2 10 20 30 40 10 20 30 40 50 12 0 CTAGG Alignment: i v Cost 17 TT-GT

  5. Complexity of Accelerations of pairwise algorithm. e { Dynamical Programming: (n+1)(m+1)3=O(nm) Backtracking: O(n+m) Recursion without memory: T(n,m) > 3 min(n,m) Exact acceleration (Ukkonen,Myers). Assume all events cost 1. If de(s1,s2) <2e+|l1-l2|, then d(s1,s2)= de(s1,s2 Heuristic acceleration: Smaller band & larger acceleration, but no guarantee of optimum.

  6. Alignment of three sequences. A C ? A s1=ATCG s2=ATGCC s3=CTCC A A C Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC Configurations in an alignment column: - - n n n - n - - n - n - n n - n - - - n n n - Recursion:Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')} Initial condition: D0,0,0 = 0. Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3 New phenomena: ancestral/consensus sequence.

  7. Parsimony Alignment of four sequences C G G C s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG Alignment: AT-CG ATGCC CT-CC ACGCG G C C G Configurations in alignment columns: - - - n - - - n n n - n n n n - - - n - n n - n - - n - n n n - - n - - n - n - n - n n - n n - n - - - - n n - - n n n n - n - Recursion:Di= min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4] Initial condition: D0 = 0.Memory : l1*l2*l3*l4 Computation time: l1*l2*l3*l4*24Memory : l1*l2*l3*l4 New Phenomena: Cost and alignment is phylogeny dependent

  8. Alignment of many sequences. s1=ATCG, s2=ATGCC, ......., sn=ACGCG Alignment: AT-CG ATGCC ..... ..... ACGCG s1 s3 s4 \ ! / ---------- / \ s2 s5 Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n] Configurations in an alignment column: 2n-1 Initial condition: D0,0,..0 = 0. Computation time: ln*(2n-1)*n Memory requirement: ln (l:sequence length, n:number of sequences)

  9. Progressive Alignment (Feng-Doolittle 1987 J.Mol.Evol.) Can align alignments and given a tree make a multiple alignment. * * alkmny-trwq acdeqrt akkmdyftrwq acdehrt kkkmemftrwq [ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6 * * *** * * * * * * Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrk Sodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrk Sddm atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq— infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp Lsrk Sods vatkavcvlkgdgpqvq— infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—-infeqkesdgpv----wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk sddm Sodb Sodl Sodh Sdmz sods Sdpb

  10. Approaches to Sequence Analysis Data {GTCAT,GTTGGT,GTCA,CTCA} Parsimony, similarity, optimisation. GT-CAT GTTGGT GT-CA- CT-CA- Ideal Practice: 1 phase analysis. Actual Practice: 2 phase analysis. statistics s1 s2 s3 s4 TKF91 - The combined substitution/indel process. Acceleration of Basic Algorithm Many Sequence Algorithm MCMC Approaches

  11. T= 0 # - - - ## # # # T = t # # # # s1 r s2 s1 s2 s1 s2 Thorne-Kishino-Felsenstein (1991) Process A # C G * • (birth rate) < m(death rate) 1. P(s) = (1-l/m)(l/m)l pA#A* .. *pT #T l =length(s) 2. Time reversible:

  12. # - - - - - # # # # k * - - - - * # # # # k l & m into Alignment Blocks A. Amino Acids Ignored: # - - - ## # # k e-mt[1-lb](lb)k-1 [1-lb-mb](lb)k [1-lb](lb)k p’k(t) pk(t) p’’k(t) b=[1-e(l-m)t]/[m-le(l-m)t] p’0(t)= mb(t) B. Amino Acids Considered: T - - - RQ S W Pt(T-->R)*pQ*..*pW*p4(t) 4 • T - - - - • R Q S WpR *pQ*..*pW*p’4(t) • 4

  13. Dpk = Dt*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk] Dp’k=Dt*[l*(k-1) p’k-1+m*(k+1)*p’k+1-(l+m)*k*p’k+m*pk+1] Dp’’k=Dt*[l*k*p’’k-1+m*(k+1)*p’’k+1- [(k+1)l+km]*p’’k] Differential Equations for p-functions # - - ... - # # # ... # # - - - ... - - # # # ... # * - - - ... - * # # # ... # Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p1(0)= p0’’(0)= 1. p’0 (0)= 0

  14. Basic Pairwise Recursion (O(length3)) i j Survives: Dies: i-1 i i-1 i j-1 j j i-1 i i j-2 j i-1 j j-1 …………………… …………………… …………………… e-mt[1-lb](lb)k-1, where …………………… …………………… b=[1-e(l-m)t]/[m-le(l-m)t] 0… j (j+1) cases 1… j (j) cases

  15. Basic Pairwise Recursion (O(length3)) survive death j (i-1,j) j-1 (i-1,j-1) Initial condition: p’’=s2[1:j] ………….. (i-1,j-k) ………….. ………….. i-1 i (i,j)

  16. Corner Cutting ~100-1000 Better Numerical Search ~10-100 Ex.: good start guess, 28 evaluations, 3 iterations Accelleration of Pairwise Algorithm (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) Simpler Recursion ~3-10 Faster Computers ~250 1991-->2000 ~106

  17. a-globin (141) and b-globin (146) (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) 430.108 : -log(a-globin) 327.320 : -log(a-globin -->b-globin) 747.428 : -log(a-globin, b-globin) = -log(l(sumalign)) l*t: 0.0371805 +/- 0.0135899 m*t: 0.0374396 +/- 0.0136846 s*t: 0.91701 +/- 0.119556 E(Length) E(Insertions,Deletions) E(Substitutions) 143.499 5.37255 131.59 Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) = 0.00565064

  18. VLSPADNAL.....DLHAHKR 141 AA long *########### …. ### 141 AA long 2 108 years 2 107 years 2 109 years *########### …. ### *########### …. ### ???????????????????? k AA long 109 years The invasion of the immortal link

  19. Statistical Alignment via Hidden Markov Models Steel and Hein,2001 + Holmes and Bruno,2001 - # # E # # - E * * lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) - # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # #lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # - lb HMM formulation allows: p(C)f(CT) T Finding most probable alignment p(A) Probability of sequence pair l/m (1- lb)(1-e-m) p(C)f(CC) C l/m (1- lb)e-m Probability of specific edge C A C

  20. Why multiple statistical alignment is non-trivial. Steel & Hein, 2001, Hein, 2001, Holmes and Bruno, 2001 *ACG C *TT GT s2 s1 a s3 *ACG G Statistical Alignment Optimisation Alignment AT-C G ATGC C CT-C C *###### * (l/m) • An HMM generating alignment according to TKF91: 1 0 #- 1 -- 2 #- 3 ## 4 -# 5 ## 4 0 # - 3 5 2

  21. Maximum likelihood phylogeny and alignment Human alpha hemoglobin;Human beta hemoglobin; Human myoglobin Bean leghemoglobin Probability of data e-1560.138 Probability of data and alignment e-1593.223 Probability of alignment given data 4.279 * 10-15 = e-33.085 Ratio of insertion-deletions to substitutions: 0.0334 Gerton Lunter, Istvan Miklos, Alexei Drummond, Yun Song

  22. Metropolis-Hastings Statistical Alignment Lunter, Drummond, Miklos, Jensen & Hein, 2005

  23. Unobservable Knudsen.., 99 Eddy & co. C C A A Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 G C A U U Footprinting -Signals (Churchill and Felsenstein, 96) Unobservable The Basics of Evolutionary Annotation Many aligned sequences related by a known phylogeny positions 1 n 1 sequences k slow - rs fast - rf HMM

  24. Statistical Alignment andFootprinting. acgtttgaaccgag---- 1 acgtttgaaccgag---- sequences sequences 1 k k acgtttgaaccgag---- 1 sequences (A,S) Alignment HMM k Alignment HMM Structure HMM Alignment HMM Signal HMM

  25. SAPF - Statistical Alignment and Phylogenetic Footprinting Alignment HMM Signal HMM Sum out Annotate 1 2 Target

  26. BigFoot http://www.stats.ox.ac.uk/research/genome/software • Dynamical programming is too slow for more than 4-6 sequences • MCMC integration is used instead – works until 10-15 sequences • For more sequences other methods are needed.

More Related