680 likes | 698 Views
Explore the sequence evolution, relationships, and functional connections of ORZ, OTZ, Orz, Crz, on_, or2, oΩ, STO, and Oroz. Learn about pairwise alignment, optimal alignment scoring, LCS, edit distance, global vs. local alignment, and maximizing sum intervals.
E N D
Sequence Alignment Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao
orz’s sequence evolution • the origin? • their evolutionary relationships? • their putative functional relationships? • orz (kid) • OTZ (adult) • Orz (big head) • Crz (motorcycle driver) • on_ (soldier) • or2 (bottom up) • oΩ (back high) • STO (the other way around) • Oroz (me)
What? THETR UTHIS MOREI MPORT ANTTH ANTHE FACTS The truth is more important than the facts.
On July 20, 1969, Armstrong and Apollo 11 Lunar Module (LM) pilot Buzz Aldrin became the first people to land on the Moon.
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACTCGGATCA--T Sequence A Sequence B
Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACTCGGATCA--T Deletion gap Insertion gap
Alignment Graph C G G A T C A T Sequence A: CTTAACT Sequence B: CGGATCAT CTTAACT C---TTAACTCGGATCA--T
A simple scoring scheme • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C TC G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score
An optimal alignment-- the alignment of maximum score • Let A=a1a2…am and B=b1b2…bn . • Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj • With proper initializations, Si,j can be computedas follows.
ComputingSi,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n
Match: 8 Mismatch: -5 Gap symbol: -3 Initializations C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S3,5 = ? C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S3,5 = ? C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 S3,5 = 5 C G G A T C A T CTTAACT optimal score
C T T A A C – TC G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T CTTAACT
Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?
Match: 8 Mismatch: -5 Gap symbol: -3 Initializations G A A T C T G C CAATTGA
Match: 8 Mismatch: -5 Gap symbol: -3 S4,2 = ? G A A T C T G C CAATTGA
Match: 8 Mismatch: -5 Gap symbol: -3 S4,2 = ? G A A T C T G C CAATTGA
Match: 8 Mismatch: -5 Gap symbol: -3 S5,5 = ? G A A T C T G C CAATTGA
Match: 8 Mismatch: -5 Gap symbol: -3 S5,5 = ? G A A T C T G C CAATTGA
Match: 8 Mismatch: -5 Gap symbol: -3 S5,5 = 14 G A A T C T G C CAATTGA optimal score
C A A T - T G AG A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C CAATTGA
Longest Common Subsequence (LCS) A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.
Alignment vs. LCS Sequence A: CAATTGA Sequence B: GAATCTGC Compute their optimal alignmentunder the following scoring scheme: Match: 1 Mismatch: 0 Gap symbol: 0
Alignment score = LCS length Match: 1 Mismatch: 0 Gap symbol: 0 G A A T C T G C CAATTGA optimal score
C AAT - TG AG AAT C TG C LCS: AATTG 0 +1 +1 +1 +0 +1 +1 +0 = 5 G A A T C T G C CAATTGA optimal score
Edit distance • The edit distance (Levenshtein distance) between Sequence A and Sequence B is equal to the minimum number of operations (deletion, insertion, or substitution) required to transform Sequence A to Sequence B. CA A T -T GAGA A T CT GC edit distance = 3
Alignment vs. Edit distance Sequence A: CAATTGA Sequence B: GAATCTGC Alignment score: maximized Edit distance: minimized Compute their optimal alignmentunder the following scoring scheme: Match: 0 Mismatch: -1 Gap symbol: -1
|Optimal Alignment score| = Edit distance Match: 0 Mismatch: -1 Gap symbol: -1 G A A T C T G C CAATTGA
|Optimal Alignment score| = Edit distance Match: 0 Mismatch: -1 Gap symbol: -1 G A A T C T G C CAATTGA optimal score
Global Alignment vs. Local Alignment • global alignment: • local alignment:
Maximum-sum interval • Given a sequence of real numbers a1a2…an, find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval ending at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.
Computing a segment sum in O(1) time? • Input: a sequence of real numbers a1a2…an • Query: the sum of ai ai+1…aj
Computing a segment sum in O(1) time • prefix-sum(i) = a1+a2+…+ai • all n prefix sums are computable in O(n) time. • sum(i, j) = prefix-sum(j) – prefix-sum(i-1) j i prefix-sum(j) prefix-sum(i-1)
Maximizing sum(i, j) O(n)-time Method 1 • sum(i, j) = prefix-sum(j) – prefix-sum(i-1) • For each location j, prefix-sum(j) is fixed. To compute the maximum-sum interval ending at position j can be done by finding the minimum prefix-sum before position j. j i prefix-sum(j) prefix-sum(i-1)
Maximum-sum interval Sequence 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 prefix-min(j) 0 0 000 0-1-1 -1-1 -1-5 -5 -5 -5 -5 prefix-sum(j) 0 9 6 7 14 -1 1 4 0 2 -5 1 -1 7 11 2 max_sum(j) 0 9 6 7 14 -1 2 5 1 3 -4 6 4 12 16 7 The maximum sum prefix-sum(j)= a1+a2+…+aj prefix-min(j): the minimum prefix-sum before position j max_sum(j)= prefix-sum(j)-prefix-min(j) The maximum-sum interval: 6 -2 8 4
ai Maximum-sum interval(The recurrence relation) • Define S(i) to be the maximum sum of the intervals ending at position i. O(n)-time Method 2 If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.
Maximum-sum interval(Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum
Maximum-sum interval(Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4
An optimal local alignment • Si,j: the score of an optimal local alignment ending at (i, j) between a1a2…ai and b1b2…bj. • With proper initializations, Si,j can be computedas follows.
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT
Match: 8 Mismatch: -5 Gap symbol: -3 local alignment C G G A T C A T CTTAACT The best score
A – C - TA T C A T 8-3+8-3+8 = 18 C G G A T C A T CTTAACT The best score