310 likes | 435 Views
An Introduction to Sequence Alignment. July 16, 2003 Sining Chen Expressionist Meeting. What?. DNA/RNA sequences: strings composed of an alphabet of 4 letters Protein sequences: alphabet of 20 letters. Why?. Identify a gene Find clues to gene function (ortholog?)
E N D
An Introduction to Sequence Alignment July 16, 2003 Sining Chen Expressionist Meeting
What? • DNA/RNA sequences: strings composed of an alphabet of 4 letters • Protein sequences: alphabet of 20 letters
Why? • Identify a gene • Find clues to gene function (ortholog?) • Find other organisms with this gene (homology) • Gather info for an evolutionary model • …
An Example • GCGCATGGATTGAGCGA • TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A
Alignments • -GCGC-ATGGATTGAGCGA • TGCGCCATTGAT-GACC-A • Three elements: • Perfect matches • Mismatches • Insertions & deletions (indel)
Significant Similarity in an Alignment • Ho: the current alignment is a result of random line-up (the 2 sequences are unrelated) • Ha: the sequences diverge from a common ancestor (related) • Test statistic: Ymax = length of the longest running perfect match subsequence
Exact Matching Subsequences • In DNA alignment, the matching probability • Under Ho lengths of exact match subseq Y should follow a geometric distribution
Well-matching Subsequences • Evolution may cause small differences to even sequences with a reasonably recent common ancestor. • We consider Ymax to be the longest subseq with up to k mismatches. • Y follow hyper-geometric distribution • P-value: exact/simulated/approximate (independence among Y does not hold any more)
Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?
Scoring Rule • Example Score = (# matches) – (# mismatches) – (# indels) x 2
Example Example: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3 ------GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23
Edit Distance • The edit distance between two sequences is the “cost” of the “cheapest” set of edit operations needed to transform one sequence into the other • Computing edit distance between two sequences almost equivalent to finding the alignment that minimizes the distance
Computing Edit Distance • How can we compute the edit distance?? • If |s| = n and |t| = m, there are more than alignments • 2 sequences each of length 1000: > 10^600 • The additive form of the score allows to perform dynamic programming to compute edit distance efficiently
Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )
Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )
Recursive Argument • Suppose we have two sequences:s[1..n+1] and t[1..m+1] The best alignment must be in one of three cases: 1. Last position is (s[n+1],t[m +1] ) 2. Last position is (s[n +1],-) 3. Last position is (-, t[m +1] )
Recursive Argument Define the notation: • Using the recursive argument, we get the following recurrence for V:
Recursive Argument • Of course, we also need to handle the base cases in the recursion:
Dynamic ProgrammingAlgorithm We fill the matrix using the recurrence rule
Conclusion: d(AAAC,AGC) = -1 DynamicProgramming Algorithm
Reconstructing the Best Alignment • We now trace back the path the corresponds to the best alignment AAAC AG-C
Reconstructing the Best Alignment • Sometimes, more than one alignment has the best score AAAC A-GC
Complexity Space: O(mn) Time: O(mn) • Filling the matrix O(mn) • Backtrace O(m+n)
Other scoring schemes • Needleman and Wunsch: 1 for identical amino acid, 0 otherwise • Dayhoff PAM scoring matrix: variations include BLOSUM matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci. 89, 10915-10919). • … • Different Gap Cost Function
Discussed above are global alignment algorithms • For local alignment algorithm: use score 0 as threshold as in: Smith, Waterman (1981), J. Mol. Biol., 147, 195-197
There exist other algorithms faster than dynamic programming under specific assumptions, e.g. BLAST, MegaBLAST • Zhang Z, Schwartz S, Wagner L, Miller W. 2000. A greedy algorithm for aligning DNA sequences. J Comp Biol. 7:203-214. Assume only sequencing error to search for very similar sequences. MegaBLAST behind NCBI HomoloGene.
Multiple Global Alignment • CLUSTAL W (Thompson, Higgins, Gibson 1994, Nucleic Acids Res, 22, 4673-4680) • Ungapped local alignments: Lawrence et.al. 1993, Science 262, 208-214 • Gapped local alignments: Zhu et.al., 1998, Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25-39
BLAST • Search a sequence database for fragments similar to the query sequence • Compile a list of high-scoring short words shared by the query sequence and the database; • Scan the database for “hits” • Expand the “hits” to MSP (maximum segment pair = a pair of equal-length/no-gap segments with the highest alignment score)
BLAST • Altschul, et. al. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. • Variations of BLAST designed for specific purposes • http://www.ncbi.nlm.nih.gov/BLAST/