110 likes | 124 Views
Learn about variations in alignment algorithms, dynamic programming equations, heuristic methods, and the significance of alignment scores in bioinformatics.
E N D
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Outline • Variations of the basic global/local alignment algorithms • Basic information theory concepts • Significance of alignment scores (to be continued in the next class)
Dynamic Programming Equations Alignment: F(0,0)-F(n,m) Alignment: 0-F(i,j) We can vary both the model and the alignment strategies
In general, we can vary • Initial values • Recursive functions • Start and end of paths • The model s, d
Variation I: Repeated Matches X= HEAGAWGHEE Y= HEA . AW –HE . Find non-overlapping copies of sections of Y in X: Unmatched regions Matched regions Alignment: (0,0)-(n+1,0)
Variation II: Overlap Matches X is contained in Y, doesn’t penalize “overhanging ends” Ignore overhanding prefix Matched regions Ignore overhanding suffix Alignment: (0,0)- maximum{(i,m), (n,j)}
Variation III: A general gap model Gap-open penalty Gap-extension penalty Alignment: F(0,0)-F(n,m) This can be more easily described as a Finite State Automaton (FSA)… (3 States: Match, Insertion in X, Insertion in Y)
Heuristic alignment algorithms • Motivation: Complexity of alignment algorithms O(nm) • Current protein DB: 100 million base pairs • Imagine matching each sequence with a 1,000 base pair query • Takes about 3 hours! • Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment • Two well known programs • BLAST: Basic Local Alignment Search Tool • FASTA: • Both find high scoring local alignments between a query sequence and a target database • Basic idea: first locate high-scoring short stretches and the extend them
BLAST (Basic Local Alignment Search Tool) • Three steps • Compiling a list of high-scoring “words” of fixed length • Scanning database to find occurrences of these words • Extend each word occurrence • Basic BLAST only finds ungapped alignments; newer versions can find gapped alignments (PSI-BLAST) • Visit BLAST (need some help!)
FASTA (Fast Alignment) • Quite similar to BLAST • Multi-step procedure • Locate all identically matching words of a fixed length (1-2 for proteins, 4-6 for DNAs) • Look for diagonals with many mutually supporting word matches • The best diagonals are selected as “seeds” for extension • Extend a seed word to find maximal scoring ungapped regions (possibly joining several seeds) • Check to see if adjacent ungapped matches can be joined by a gapped region allowing for gap costs • Finally the full dynamic programming algorithm is run on the regions of best matching alignments
Significance of Scores • How do we assess the significance of an alignment score? • Two basic approaches • The classical approach: Extreme value distribution • Assume a null (random) model for scores M0 • P(Score > s|M0, x, y)=? • The Bayesian approach: Model comparison • Assume two models for (x,y): random M0; aligned: M1 • P(M1|x,y)/P(M0|x,y)=? prior Log-odds score of the alignment