370 likes | 497 Views
Approaches to Data Analysis. s 1. s 2. s 3. s 4. Data {GTCAT,GTTGGT,GTCA,CTCA}. Parsimony, similarity, optimisation. GT-CAT GTTGGT GT-CA- CT-CA-. statistics. statistics. Ideal Practice: 1 phase analysis. Actual Practice: 2 phase analysis. Origins of Statistical Alignment.
E N D
Approaches to Data Analysis s1 s2 s3 s4 Data {GTCAT,GTTGGT,GTCA,CTCA} Parsimony, similarity, optimisation. GT-CAT GTTGGT GT-CA- CT-CA- statistics statistics Ideal Practice: 1 phase analysis. Actual Practice: 2 phase analysis.
Origins of Statistical Alignment Bishop & Thompson 1986 Thorne Kishino & Felsenstein 1991 Challenges to Statistical Alignment Understanding the Basic Model Speed of the Basic Algorithm Analyzing Many Sequences - Multiple Statistical Alignment Realistic Models The Biological Problems Phylogeny & Molecular Evolution Alignment Homology Testing + More
Thorne-Kishino-Felsenstein (1991) Process * A # C G T= 0 # - - - ## # # # T = t # # # # l < m P(s) = (1-l/m)(l/m)l pA#A* .. *pT #T l =length(s) Time reversible
The invasion of the immortal link (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
Time reversibility Pi,j(t) = probability that i has evolved into j after time t. p(i) = probability of i after infinitely long time - equilibrium distribution p(i) Pi,j(t) = p(j) Pj,i(t) a t1 t2 s2 s1 s1 s2 t1 +t2
Two kinds of alignment Optimisation (here Parsimony): Shortest Path C T G A G G G T - - G C CTGAGG GTGC Statistical: Probability and Sum over all Paths C T G A G G G T - - G C CTGAGG GTGC
l & m into Alignment Blocks A. Amino Acids Ignored: # - - - # - - - - * - - - - ## # # - # # # # * # # # # k k k e-mt[1-lb(t)](lb(t))k-1 [1-e-mt-mb(t)][1-lb(t)](lb(t))k-1 [1-lb(t)](lb(t))k pk(t) p’k(t) p’’k(t) p’0(t)= mb(t) b(t)=[1-e(l-m)t]/[m-l] B. Amino Acids Considered: T - - - RQ S W Pt(T-->R)*pQ*..*pW*p4(t) 4 T - - - - - R Q S WpR *pQ*..*pW*p’4(t)
Illustration of single equation. # - - ... - # # # ... # pk+1 m # - - ... - - # # ... # p’k m*k l*k l*(k-1) m*(k+1) # - ... - - # ... # # - - - ... - - # # # ... # p’k+1 p’k-1 Dp’k=Dt*[l*(k-1) p’k-1+m*(k+1)*p’k+1 -(l+m)*k*p’k+m*pk+1]
Diff. Equations for p-functions # - - ... - # # # ... # Dpk = Dt*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk] # - - - ... - - # # # ... # Dp’k=Dt*[l*(k-1) p’k-1+m*(k+1)*p’k+1-(l+m)*k*p’k+m*pk+1] * - - - ... - * # # # ... # Dp’’k=Dt*[l*k*p’’k-1+m*(k-1)*p’’k+1-((k+1)l+mk)*p’’k] Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p0(0)= p0’’(0)= 1. p’0 (0)= 0
Basic Pairwise Recursion (O(length3)) i i-1 j j-1 i j Survives: Dies: i-1 i i-1 i j-1 j j i-1 i j-2 j …………………… …………………… …………………… …………………… …………………… …………………… 1… j (j) cases 0… j (j+1) cases
survive death Basic Pairwise Recursion (O(length3)) j (i,j) (i-1,j) j-1 (i-1,j-1) Initial condition: p’’=s2[1:j] ………….. (i-1,j-k) ………….. ………….. i-1 i
Fundamental Pairwise Recursion. P(s1i->s2j) = p’0P(s1i-1->s2j) + Initial Condition P(s10 ->s2j) = pj’’ps2[1:j] Probability of observationP(s1,s2) = P(s1) P(s1 ->s2) Simplification: Ri,j=(p1f(s1[i],s2[j])+p’1ps2j[j])P(s1i-1->s2j-1) + lb ps2[j]Ri,j-1 P(s1i->s2j) = Ri,j + p’0 P(s1i->s2j-1) P(s1i->s2j) = p’0P(s1i-1->s2j)+ lbP(s1i->s2j-1) + (p1f(s1[i],s2[j]+p’1ps2j[j]- lb ps2j[j] ))P(s1i-1->s2j-1)
Geometric Like Offspring Number # - - - # - - - - ## # # - # # # # k k e-mt[1-lb(t)](lb(t))k-1 [1-e-mt-mb(t)][1-lb(t)](lb(t))k-1 pk(t) p’k(t) p’0(t)= mb(t) Alternative traversal: Die forward in time Give birth backwards Trace leftmost unfinished branch. After one survivor, branch lengths With birth possibility always t.
Quadratic Recursion (i,j) (i-1,j) (i-1,j-1) (i,j-1) Two state recursion: Ri,j=(p1f(s1[i],s2[j])+p’1ps2j[j])P(s1i-1->s2j-1)+ lb ps2[j]Ri,j-1 P(s1i->s2j) = Ri,j + p’0 P(s1i->s2j-1) One state recursion: P(s1i->s2j) = p’0P(s1i-1->s2j)+ lbP(s1i->s2j-1) + (p1f(s1[i],s2[j]+p’1ps2j[j]- lb ps2j[j] ))P(s1i-1->s2j-1) 1. Summation, Maximization and Sampling of Alignments. 2. For more sequences: Ancestral Sequences & Alignments.
Likelihood Surface (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
a-globin (141) and b-globin (146) (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000) 430.108 : -log(a-globin) 327.320 : -log(a-globin -->b-globin) 730.428 : -log(a-globin, b-globin) = -log(l(sumalign)) l*t: 0.0371805 +/- 0.0135899 m*t: 0.0374396 +/- 0.0136846 s*t: 0.91701 +/- 0.119556 E(Length) E(Insertions,Deletions) E(Substitutions) 143.499 5.37255 131.59 Maximum contributing alignment: V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH Ratio l(maxalign)/l(sumalign) = 0.00565064
Likelihood Surface (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
Homology Test Wi,j= -ln(pi*P2.5i,j/(pi*pj)) D(s1,s2) is evaluated in D(s1,s2*) Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * * This test: 1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart. 2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary. 3. It samples in (pi*pj) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.
a-, myoglobin homology test (From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)
Algorithm for alignment on star tree (O(length6))(Steel & Hein, 2001) *ACGC *TT GT s2 s1 a *###### * (l/m) s3 *ACG GT
Binary Tree Problem TGA ACCT s1 s3 a1 a2 s2 s4 GTT ACG
Binary Tree Problem TGA ACCT a1a2 * * # # # - - # # # - # s1 s3 a1 a2 s2 s4 GTT ACG • The problem would be simpler if: • The ancestral sequences & their alignment was known. • ii. The alignment of ancestral alignment columns to leaf sequences was known. A markov chain generating ancestral alignments can solve the problem!!
# E * l/m 1- l/m #l/m 1- l/m - # E lb 1- lb lb 1- lb * * - # Markov Chains Generating the p-functions Ancestral Sequence Generator * # # # # p’’ function generator * - - - - * # # # # p’/p function generator # - - - - # # # # # - # E lb 1- lb 1-mb mb # # # - # - - - - - # # # # lb 1- lb - #
Generating Ancestral Alignments. - # # E # # - E * * lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) - # lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) _ #lb l/m (1- lb)e-m l/m (1- lb)(1- e-m) (1- l/m) (1- lb) # - lb a1 * - # E a2 * # # E lb l/m (1- lb)e-m (1- l/m) (1- lb)
The Basic Recursion ”Remove 1st step” - recursion: S E ”Remove last step” - recursion:
4-Sequence Recursion II: First Step Removal Pa(Sk): Epifixes (S[k+1:l]) starting in given MC starts in a. Pa(Sk) = Where P’(kS i,H) = F(kSi,H)
Example: 4 globins logLikelikelihood = -1593.223
O(lk)algorithm for k sequences s1 s3 a1 a2 s2 s4 Two Approaches: Use geometric tails of p-functions & suitable rearrangements. Make ”ancestral” Markov Chain for the leaves as well:
Contrasting Probability & Distance Recursions # # # # - # = = + Probability: O(l2k) – O(lk) possible Distance (Sankoff, 1973) - O(lk): A C - A 15 cases
k ancestral sequence Markov Chain State Space: * E # * E # All connected . , . , # & . . # #-tuples * E # # a4 - a4 - # / # # / a1 ---a2----a3 a1 ---a2----a3 # \ - \ - a5 - a5
k ancestral sequences: 2 Problems 1. Ambigous Indel/Alignment relationship. a #- / \ / \ s1 -# -# s2 s1 - # - - - # a # - - # - - s2 - - # - # - 2. Grand children before younger siblings. a # - - - - - - - - a1 # # - - - - # # # a2 - # # # # # - - -
Transition Probabilities between two k-ancestral states 0 #- 1 -- 2 #- 3 ## 4 -# 5 ## 6 #- 7 # - 1 4 0 # - 5 2 3 6 7
Gibbs Samplers for Statistical Alignment Holmes & Bruno (2001): Sampling Ancestors to pairs. Jensen & Hein (subm.): Sampling nodes adjacent to triples Slower basic operation, faster mixing
Work in Progress & Plans State Reduction (Lunter, Song, Hein & Miklos) Longer Insertion-Deletions (Miklos, Lunter, Holmes) * A TC CG * A TC CG Heterogeneity along Sequence(Skou, Hein,..) HMM/SCFG – like? TT Acceleration & Implementation (Lunter & Song) MCMC Methods (Ledet Jensen, Holmes,...........)
Statistical Alignment Summary Motivation for statistical alignment: i. Data is sequences - not alignment! ii. The focus on alignments is exagerated!! Progress Major Accelerations for pairwise/multiple statistical alignment Longer Insertion-Deletions models Challenges ahead Position Heterogeneity – hmm & scfg analogues. Algorithms for large data sets (>5 sequences) MCMC. Local alignment version Software ???
Acknowledgements (www.stats.ox.ac.uk/hein) Pairwise (with Knudsen, Wiuf, Møller, Wibling) Simpler recursion. Computational acceleration. Multiple Star Tree (with M.Steel) Binary Tree (with C.Storm, Jens Ledet, Lunter, Miklos,Song,Holmes,..) Gibbs Multiple Alignment (withJens Ledet) Articles & Manuscripts: 1. Hein,J.J., C.Wiuf, B.Knudsen, Møller, M., and G.Wibling (2000): Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit. (J. Molecular Biology 302.265-279) 2. J.J.Hein (2001): A generalisation of the Thorne-Kishino-Felsenstein model of Statistical Alignment to k sequences related by a binary tree. (Pac.Symp.Biocompu. 2001 p179-190 (eds RB Altman et al.) 3. Steel, M. & J.J.Hein (2001): A generalisation of the Thorne-Kishino-Felsenstein model of Statistical Alignment to k sequences related by a star tree. ( Letters in Applied Mathematics) 4. JJ Hein, J.L.Jensen, C.Pedersen (2002) Algorithms for Multiple Statistical Alignment. (submitted to PNAS) 5. J.L.Jensen & JJ Hein (2002) A Gibbs Sampler for Multiple Statistical Alignment. (submitted Statistical Journal…) 6. Lunter, Song, Miklos & Hein (2002) (In Press J.Com.Biol.) 7. Lunter, Song, & Hein (2003) (in prep.) 8. Miklos, Lunter & Holmes (2002) (in press MBE) 9. Miklos, I & Toroczkai Z. (2001) An improved model for statistical alignment, in WABI2001, Lecture Notes in Computer Science, (O. Gascuel & BME Moret, eds) 2149:1-10. Springer, Berlin 10 Miklos, I (2002) An improved algorithm for statistical alignment of sequences related by a star tree. Bul. Math. Biol. 64:771-779. 11 Miklos, I: (2002) “Algorithm for statistical alignment of sequences derived from a Poisson sequence length distribution” Disc. Appl. Math. accepted. 12 Holmes, I & W.Bruno (2001) “Evolutionary HMMs: A Bayesian Approach to Multiple Alignment ” Bioinformatics 17.9.803-20.