1 / 33

Probability Theory and Basic Alignment of String Sequences

Probability Theory and Basic Alignment of String Sequences. Chapter 1.1-2.3. Overview. Probability Theory -Maximum Likelihood -Bayes Theorem Pairwise Alignment -The Scoring Model -Alignment Algorithms. Probability Theory. Probability Theory. What is a probabilistic model?

seven
Download Presentation

Probability Theory and Basic Alignment of String Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3 S. Maarschalkerweerd & A. Tjhang

  2. Overview • Probability Theory -Maximum Likelihood -Bayes Theorem • Pairwise Alignment -The Scoring Model -Alignment Algorithms S. Maarschalkerweerd & A. Tjhang

  3. Probability Theory S. Maarschalkerweerd & A. Tjhang

  4. Probability Theory • What is a probabilistic model? • Simple example: What is probability of base sequence x1x2…xn?  p(xi), p(x1), p(x2)…p(xn) independent of each other If pC = 0.3; pT = 0.2 and sequence is CTC: P(CTC)=0.3*0.2*0.3=0.018 S. Maarschalkerweerd & A. Tjhang

  5. Maximum Likelihood Estimation • Estimate parameters of the model from large sets of examples (training set) • For example: P(T) and P(C) are estimated from their frequency in a database of residues • Avoid overfitting • Database too small, model also fits to noise in the training set S. Maarschalkerweerd & A. Tjhang

  6. Probability Theory • Conditional Probability -P(X,Y) = P(X|Y) P(Y) (joint probability) -P(X) = Y P(X,Y) = Y P(X|Y) P(Y) (marginal probability) S. Maarschalkerweerd & A. Tjhang

  7. Bayes’ Theorem • P(X|Y) = - Posterior probability • Example: P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01 P(X|C) = 0.9; P(X|¬C) = 0.05 - On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer? P(Y|X) P(X) P(Y) S. Maarschalkerweerd & A. Tjhang

  8. Pairwise Alignment S. Maarschalkerweerd & A. Tjhang

  9. Pairwise Alignment • Goal: determine whether 2 sequences are related (homologous). • Issues regarding pairwise alignment: • What sorts of alignment should be considered? • The scoring system used to rank alignments. • The algorithm used to find optimal (or good) scoring alignments. • The statistical methods to evaluate significance of an alignment score. S. Maarschalkerweerd & A. Tjhang

  10. Example • You need a ‘smart’ scoring model to distinguish b from c. S. Maarschalkerweerd & A. Tjhang

  11. The Scoring Model S. Maarschalkerweerd & A. Tjhang

  12. The Scoring Model • When sequences are related, then both sequences have to be from a common ancestor. • Due to mutation sequences can change. • Substitutions • Gaps (insertions or deletions) • Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest) S. Maarschalkerweerd & A. Tjhang

  13. The Scoring Model • Total score of an alignment: • Sum of terms for each aligned pair of residues • Terms for each gap • Take the sum of those terms S. Maarschalkerweerd & A. Tjhang

  14. Substitution Matrices • We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids) • We can compute these score by: s(a,b) = log( ) pab= probability that residues a and b have been derived independently from some unknown original residue c. qa= frequency of a pab qaqb S. Maarschalkerweerd & A. Tjhang

  15. BLOSUM50 S. Maarschalkerweerd & A. Tjhang

  16. Gap Penalties • (g) = -gd (linear score) • (g) = -d-(g-1)e (affine score) • d = gap-open penalty • e = gap-extension penalty • g = gap length • P(gap) = f(g)  qxi i in gap S. Maarschalkerweerd & A. Tjhang

  17. Alignment Algorithms S. Maarschalkerweerd & A. Tjhang

  18. Alignment Algorithms • Needleman-Wunsch (global alignment) • Smith-Waterman (local alignment) • Repeated matches • Overlap matches • Hybrid match conditions S. Maarschalkerweerd & A. Tjhang

  19. Dynamic Programming • Enormous amount of possible alignments • Algorithm for finding optimal alignment: Use Dynamic Programming • Save sub-results for later reuse, avoiding calculation of same problem S. Maarschalkerweerd & A. Tjhang

  20. Needleman-Wunsch Algorithm • Global alignment • For sequences of size n and m, make (n+1)x(m+1) matrix • Fill in from top left to bottom right F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d • Keep pointer to cell that is used to derive F(i,j) • Takes O(nm) time and memory { S. Maarschalkerweerd & A. Tjhang

  21. 0 -8 -8 -2 -8 -8 Matrix -2 S. Maarschalkerweerd & A. Tjhang

  22. Matrix Traceback S. Maarschalkerweerd & A. Tjhang

  23. Smith-Waterman Algorithm • Local alignment • Two differences with Needleman-Wunsch: 0 F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d 2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell) { • F(i,j) = max S. Maarschalkerweerd & A. Tjhang

  24. Matrix S. Maarschalkerweerd & A. Tjhang

  25. Smith-Waterman Algorithm • Expected score for a random match s(a,b) must be negative • There must be some s(a,b) greater than 0 or no alignment is found S. Maarschalkerweerd & A. Tjhang

  26. Repeated Matches • Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them • Find parts of sequence in the other sequence • Not every alignment is useful threshold S. Maarschalkerweerd & A. Tjhang

  27. Repeated Matches { F(i, 0) F(i-1, j-1) + s(xi,yj) F(i-1, j) – d F(i, j-1) – d F(i-1, 0) F(i-1, j) – T, j = 1,…m; F(i,j) = max { F(i,0) = max S. Maarschalkerweerd & A. Tjhang

  28. Matrix Threshold T = 20 S. Maarschalkerweerd & A. Tjhang

  29. Overlap Matches • Find match between start of a sequence and end of a sequence (can be the same) • Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border S. Maarschalkerweerd & A. Tjhang

  30. Overlap Matches • F(0,j) = 0, for j = 1,…,m • F(i,0) = 0, for i = 1,…,n F(i-1, j-1) + s(xi,yj) • F(i,j) = max F(i-1, j) – d F(i, j-1) – d { S. Maarschalkerweerd & A. Tjhang

  31. Matrix S. Maarschalkerweerd & A. Tjhang

  32. Hybrid Match Conditions • Different types of alignment can be created by • adjusting rhs of this formula: F(i,j) = max {…. • adjusting the traceback • Example: • We want to align two sequences from the beginning of both the sequences until local alignment has been found. S. Maarschalkerweerd & A. Tjhang

  33. Summary • Probability theory is important for sequence analysis • Goal: determine whether 2 sequences are related • For that, we need to find an optimal alignment between those sequences using algorithms • Scoring model is required to rank different alignments • Different algorithms for different types of alignments • use dynamic programming S. Maarschalkerweerd & A. Tjhang

More Related