Pairwise Sequence Alignment Methods & Heuristic Algorithms Overview

Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Outline • Variations of the basic global/local alignment algorithms • Basic information theory concepts • Significance of alignment scores (to be continued in the next class)

Dynamic Programming Equations Alignment: F(0,0)-F(n,m) Alignment: 0-F(i,j) We can vary both the model and the alignment strategies

In general, we can vary • Initial values • Recursive functions • Start and end of paths • The model s, d

Variation I: Repeated Matches X= HEAGAWGHEE Y= HEA . AW –HE . Find non-overlapping copies of sections of Y in X: Unmatched regions Matched regions Alignment: (0,0)-(n+1,0)

Variation II: Overlap Matches X is contained in Y, doesn’t penalize “overhanging ends” Ignore overhanding prefix Matched regions Ignore overhanding suffix Alignment: (0,0)- maximum{(i,m), (n,j)}

Variation III: A general gap model Gap-open penalty Gap-extension penalty Alignment: F(0,0)-F(n,m) This can be more easily described as a Finite State Automaton (FSA)… (3 States: Match, Insertion in X, Insertion in Y)

Heuristic alignment algorithms • Motivation: Complexity of alignment algorithms O(nm) • Current protein DB: 100 million base pairs • Imagine matching each sequence with a 1,000 base pair query • Takes about 3 hours! • Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment • Two well known programs • BLAST: Basic Local Alignment Search Tool • FASTA: • Both find high scoring local alignments between a query sequence and a target database • Basic idea: first locate high-scoring short stretches and the extend them

BLAST (Basic Local Alignment Search Tool) • Three steps • Compiling a list of high-scoring “words” of fixed length • Scanning database to find occurrences of these words • Extend each word occurrence • Basic BLAST only finds ungapped alignments; newer versions can find gapped alignments (PSI-BLAST) • Visit BLAST (need some help!)

FASTA (Fast Alignment) • Quite similar to BLAST • Multi-step procedure • Locate all identically matching words of a fixed length (1-2 for proteins, 4-6 for DNAs) • Look for diagonals with many mutually supporting word matches • The best diagonals are selected as “seeds” for extension • Extend a seed word to find maximal scoring ungapped regions (possibly joining several seeds) • Check to see if adjacent ungapped matches can be joined by a gapped region allowing for gap costs • Finally the full dynamic programming algorithm is run on the regions of best matching alignments

Significance of Scores • How do we assess the significance of an alignment score? • Two basic approaches • The classical approach: Extreme value distribution • Assume a null (random) model for scores M0 • P(Score > s|M0, x, y)=? • The Bayesian approach: Model comparison • Assume two models for (x,y): random M0; aligned: M1 • P(M1|x,y)/P(M0|x,y)=? prior Log-odds score of the alignment

Pairwise Sequence Alignment Methods & Heuristic Algorithms Overview

Pairwise Sequence Alignment Methods & Heuristic Algorithms Overview

Presentation Transcript

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment (I)

Pairwise sequence Alignment

Pairwise Sequence Alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise Sequence Alignment

Pairwise sequence Alignment

Pairwise Sequence Alignment Part 2

Pairwise Sequence Alignment (II)

Pairwise Sequence Alignment

Pairwise Sequence Alignment (cont.)

Pairwise Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment (practice)

Pairwise Sequence Alignment (II)

Pairwise sequence alignment