160 likes | 202 Views
Sequence Alignment. Bioinformatics. Sequence Comparison. Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity Edit distance (transforming S to T) Scoring mechanism
E N D
Sequence Alignment Bioinformatics
Sequence Comparison • Problem: Given two sequences S & T, are S and T similar? • Need to establish some notion of similarity • Edit distance (transforming S to T) • Scoring mechanism • Related Problem: Given a target sequence, obtain sequences in a database that are similar to the target
Edit Distance • Sequences S and T are strings over an alphabet (e.g.,{a,c,t,g}) • Edit operations (indels) • Insertion of a character • Deletion of a character • Example: need 3 indels to transform attc to tttac
Alignment • We can model edit distance by aligning the two strings: -att-c t-ttac • An alignment of strings S and T is described by two strings S’ and T’ of the same length such that • S’ (T’) contains the characters of S (T) in order interspersed with spaces (-) • No position exists that contain spaces for both S’ and T’
Gaps, Matches, and Mismatches • When comparing characters that occur in the same positions in S’ and T’, four possibilities arise • - in S’ -> insertion (gap) • - in T’ -> deletion (gap) • Characters match -> match • Characters don’t match -> mismatch • Can assign weights to each possibility (usually a positive number for matches, a negative number for gaps and mismatches)
Scoring and Optimal Alignments • Given strings S and T, and an alignment (S’,T’), a score can be computed based on pre-established weights for gaps, matches, and mismatches • Add all the weights for each position in S’ and T’ • Note that there are many possible alignments for S and T • An optimal alignment for S and T is the alignment that yields the maximum score
Problem Formulations for Sequence Comparison • Original Formulation: Given two sequences S & T, are S and T similar? • Revised Formulation: Given two sequences S & T, and weights for matches, gaps, and mismatches, determine the score of an optimal alignment of S & T
Brute-force Algorithm Compare(S, T) generate all possible alignments for S and T for each alignment determine score return maximum score Note: This is an exponential algorithm due to the number of possible alignments for S and T
Edit Graphs are Alignments • Path from upper left corner to lower right corner represents an alignment • Vertical arrow: gap (deletion) • Horizontal arrow: gap (insertion) • Diagonal: match or mismatch • Alignment: AT-C-TGAT-TGCAT-A- • Score: (assume 5 for match, -2 for mismatch) –2+5+-2+5+-2+5+-2+5+-2 = 10
Entries in an Edit Graph • Strategy: Fill up the intersections (green circles) with (running) scores based on the path traversed so far • Each circle can be computed according to results of at most three other values a + match/mismatch weight X = either b + gap weight c + gap weight a b c x
Dynamic Programming Algorithm • Start with upper left corner (score 0) • Fill up top row and and leftmost column • Fill up succeeding rows using the formula • Resulting value on the lower right corner is the optimal score a + match/mismatch weight X = Max b + gap weight c + gap weight
Algorithm Analysis • Let N be the lengths of S and T • Need to compute (N+1)(N+1) entries • O(N2) algorithm
Determining the Actual Alignment • Need to remember which contributed to the computation of an entry (which resulting value was the maximum) • Perform a back-trace from lower right corner back to the upper left corner • Multiple optimal alignments possible because of ties
Other Complexity Issues • When performing a search on a database, time complexity is dependent on the size D of the database since you run the algorithm on each sequence in the database: O(DN2) • Space requirement: an (N+1)(N+1) table • Can improve to 4N if we fill up the table according by “inverted Ls”. Topmost row and leftmost column first, then go by inner row and column, one stage at a time
Variations • Scoring mechanism is driven by the weights for gaps, matches and mismatches • Can have different weights for starting a gap versus extending a gap (e.g., blastp and blastn) • Can have a table that allows different match/mismatch scores (e.g., BLOSUM)