1 / 50

Why do we want to align protein sequences?

Protein Sequence Comparison Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129/06/lecture-notes. Why do we want to align protein sequences?. If two sequences align well, the corresponding proteins are homologous; they probably share the same structure and/or function

isolde
Download Presentation

Why do we want to align protein sequences?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Sequence ComparisonPatrice Koehlhttp://koehllab.genomecenter.ucdavis.edu/teaching/ecs129/06/lecture-notes

  2. Why do we want to align protein sequences? • If two sequences align well, the corresponding proteins are homologous; they probably share the same structure and/or function • Sequence Alignment is a Tool for organizing the protein sequence space detection of homologous proteins build evolutionary history

  3. Alignment Methods • Rigorous algorithms - Needleman-Wunsch (global) - Smith-Waterman (local) • Rapid heuristics - FASTA - BLAST

  4. What is sequence alignment? • Given two sequences of letters and a scoring scheme for evaluating letter matching, find the optimal pairing of letters from one sequence to the other. • Different alignments: Favors identity Favors similarity ACCTAGGC ACCTAGGC AC-T-GG ACT-GG Gaps

  5. Aligning Biological Sequences • Aligning DNA: 4 letter alphabet ACGTTGGC AC-T-GG • Aligning protein: 20 letter alphabet MCYTSWGC MC-T-WG

  6. Computing Cost The computational complexity of aligning two sequences when gaps are allowed anywhere is exponential in the length of the sequences being aligned. Computer science offers a solution for reducing the running time: Dynamic Programming

  7. Dynamic Programming (DP) Concept A problem with overlapping sub-problems and optimal sub-structures can be solved using the following algorithm: (1) break the problem into smaller sub-problems (2) solve these problems optimally using this 3-step procedure recursively (3) use these optimal solutions to construct an optimal solution to the original problem

  8. DP and Sequence Alignment Key idea: The score of the optimal alignment that ends at a given pair of positions in the sequences is the score of the best alignment previous to these positions plus the score of aligning these two positions.

  9. i ? i j ? DP and Sequence Alignment Test all alignments that can lead to i aligned with j j

  10. i ? j DP and Sequence Alignment Find alignment with best [previous score + score(i,j)] Best alignment that ends at (i,j) i j

  11. Implementing the DP algorithm for sequences Aligning 2 sequence S1 and S2 of lengths N and M: • Build a NxM alignment matrix A such that • A(i,j) is the optimal score for alignments • up to the pair (i,j) • 2) Find the best score in A • 3) Track back through the matrix to get • the optimal alignment of S1 and S2.

  12. Example 1 Sequence 1: ATGCTGC Sequence 2: AGCC Score(i,j) = 10 if i=j, 0 otherwise no gap penalty

  13. Example 1 1) Initialize

  14. Example 1 2) Propagate

  15. Example 1 2) Propagate

  16. Example 1 2) Propagate

  17. Example 1 2) Propagate

  18. Example 1 3) Trace Back ATGCTGC AXGCXXC Score: 40 Alignment:

  19. Mathematical Formulation Global alignment (Needleman-Wunsch): Wk: penalty for a gap of size k

  20. Complexity 1) The computing time required to fill in the alignment matrix is O(NM(N+M)), where N and M are the lengths of the 2 sequences 2) This can be reduced to O(NM) by storing the best score for each row and column. True if gap penalty is linear!

  21. Example 2 High Score: 30 Alignments: AATGC AG GC AATGC A GGC AATGC AGGC AATG C A GGC AATG C A GGC

  22. Example 2 with Gap Gap cost: -2 High Score: 28 Alignments: AATGC AG GC AATGC A GGC AATGC AGGC

  23. Complexity (2) 1) The traceback routine can be quite costly in computing time if all possible optimal paths are required, since there may be many branches. 2) Usually, an arbitrary choice is made about which branch to follow. Then computing time is O(max(N,M)) By simply following pointers.

  24. The Scoring Scheme • Scores are usually stored in a “weight” matrix or “substitution” matrix or “matching” matrix. • Defining the “proper” matrix is still an active area of research • Usually, start from known, reliable alignment. Compute fi, the frequency of occurrence of residue type i, and qij, the probability that residue types i and j are aligned; score is computed as:

  25. Example of a Scoring matrix

  26. Gap penalty • Most common model: WN = G0 + N * G1 WN : gap penalty for a gap of size N G0 : cost of opening a gap G1 : cost of extending the gap by one N : size of the gap

  27. Global versus Local Alignment • Global alignment finds best arrangement that maximizes total score • Local alignment identifies highest scoring subsequences, sometimes at the expense of the overall score Local alignment algorithm is just a variation of the global alignment algorithm!

  28. Modifications for local alignment • The scoring matrix has negative values for mismatches • The minimum score for any (i,j) in the alignment matrix is 0. • The best score is found anywhere in the filled alignment matrix These 3 modifications cause the algorithm to search for matching sub-sequences which are not penalized by other regions (modif. 2), with minimal poor matches (modif 1), which can occur anywhere (modif 3).

  29. Mathematical Formulation Local alignment (Smith Waterman): Wk: penalty for a gap of size k

  30. Global versus Local Alignment Match: +1; Mismatch: -2; Gap: -1 ACCTGS ACC-NS ACC ACC Global: Local:

  31. Heuristic methods • O(NM) is too slow for database search • Heuristic methods based on frequency of shared subsequences • Usually look for ungapped small sequences FASTA, BLAST

  32. FASTA • Create hash table of short words of the query sequence (from 2 to 6 characters) • Scan database and look for matches in the query hash table • Extend good matches empirically

  33. BLAST • Break query sequence and database sequences into words • Search for matches (even not perfect) that scores at least T • Extend matches, and look for alignment that scores at least S Tutorial: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html

  34. Summary • Dynamic programming finds the optimal alignment between two sequences in a computing time proportional to NxM, where N and M are the sequence lengths • Critical user choices are the scoring matrix, the gap penalties, and the algorithm (local or global)

  35. Statistics of Sequence Alignment

  36. Significance • We have found that the score of the alignment between two sequences is S. Question: What is the “significance” of this score? • Otherwise stated, what is the probability P that the alignment of two random sequences has a score at least equal to S ? • P is the P-value, and is considered a measure of statistical significance. If P is small, the initial alignment is significant.

  37. Basic Statistics: Binomial Distribution A given experiment may yield the event A or the event not(A) with probabilities p, and q=1-p, respectively. If the experiment is repeated N times and X is the number of time A is observed, then the probability that X takes the value k is given by: With the binomial coefficient: N=40; p=0.2 Properties: E(X)=Np Var(X)=Np(1-p)

  38. Basic Statistics: Poisson Distribution The Poisson distribution is a discrete distribution usually defined over a volume or time interval. Given a process with expected number of success l in the given interval, the probability of observing exactly X success is given by: l=8 Properties: E(X)=l Var(X)=l

  39. Basic Statistics: Normal Distribution A normal distribution in a variable X with mean m and variance s2 is a statistical distribution with probability function: N=40; p=0.2 The normal distribution is the limiting case of a binomial distribution P(n,N,p) with:

  40. Basic Statistics: Extreme Value Distribution A extreme value distribution in a variable X is a statistical distribution with probability function: a=0; b=1 Note the presence of a long tail

  41. BLAST Input

  42. BLAST results

  43. BLAST results (2)

  44. Statistics of Protein Sequence Alignment • Statistics of global alignment: Unfortunately, not much is known! Statistics based on Monte Carlo simulations (shuffle one sequence and recompute alignment to get a distribution of scores) • Statistics of local alignment Well understood for ungapped alignment. Same theory probably apply to gapped-alignment

  45. Statistics of Protein Sequence Alignment What is a local alignment ? “Pair of equal length segments, one from each sequence, whose scores can not be improved by extension or trimming. These are called high-scoring pairs, or HSP” http://www.people.virginia.edu/~wrp/cshl98/Altschul/Altschul-1.html

  46. The E-value for a sequence alignment HSP scores follow an extreme value distribution, characterized by two parameters, K and l. The expected number of HSP with score at least S is given by: S m, n : sequence lengths E : E-value

  47. The Bit Score of a sequence alignment Raw scores have little meaning without knowledge of the scoring scheme used for the alignment, or equivalently of the parameters K and l. Scores can be normalized according to: S’ is the bit score of the alignment. The E-value can be expressed as:

  48. The P-value of a sequence alignment The number of random HSP with score greater of equal to S follows a Poisson distribution: (E: E-value) Then: Note: when E <<1, P ≈E

  49. The database E-value for a sequence alignment • Database search, where database contains NS sequences • corresponding to NR residues: • All sequences are a priori equally likely to be related to the query: • Longer sequences are more likely to be related to the query: • BLAST reports EDB2

  50. Summary • Statistics on local sequence alignment are defined by: • Raw score • Bit score (normalized score) • E-value • P-value • Statistics on database search: -database E-value

More Related