FASTA and LFASTA Algorithms for Sequence Comparison

Lecture #7: FASTA & LFASTA BIOINF 2051 Fall 2002

Dot Plot Alpha chain vs. Beta chain of Human Hemoglobin

FASTA and LFASTA • Pearson and Lipman (1988) • FASTA – program that calculates the initial and optimal similarity scores between two sequences • LFASTA – program for detecting local similarities – finds multiple alignments between smaller portions of two sequences

The FASTA algorithm • Four steps: • Identify regions of similarity: • Using the ktup parameter which specifies # consecutive identities required in a match • 10 best diagonal regions found based on #matches and distance between matches • Rescore regions and identify best initial regions • PAM250 or other scoring matrix used for rescoring the 10 diagonal regions identified in step 1 to allow for conservative replacements and runs of identities shorter than ktup • For each the best diagonal regions, identify “initial region” that is best scoring subregion

The FASTA algorithm • Optimally join initial regions with scores > T • Given: location of initial regions, scores, gap penalty • Calculate an optimal alignment of initial regions as a combination of compatible regions with maximal score • Use resulting score to rank the library sequences • Selectivity degradation limited by using initial regions that score greater than some threshold T • Align the highest scoring library sequences using modification of global and local alignment algorithms • Considers all possible alignments of the query and library sequence that falls within a band centered around the highest scoring initial region

LFASTA • FASTA – reports only one highest scoring alignment between two sequences • LFASTA – local sequence comparison tool that can identify multiple local alignments between 2 sequences • Optimal algorithms for sensitive local sequence comparison are computationally intensive in terms of time and memory

LFASTA vs. FASTA • LFASTA uses same first 2 steps for finding initial regions as FASTA, except: • Instead of saving 10 initial regions, LFASTA saves all diagonal regions with similarity scores > some threshold • Construction of optimized alignments • Instead of focusing on a single region, LFASTA computes a local alignment for each initial region • Also, apart from band around initial region, LFASTA considers potential sequence alignments for some distance before and after the initial region.

Self-comparison of myosin heavy chain from C. elegans • See plot from a local similarity self-comparison of the myosin heavy chain (NBRF code MWKW) using the PAM 250 matrix • The amino-terminal half of the molecule forms a large globular head without any periodic structure • The symmetrical parallel lines along the C-terminal half correspond to the 28-residue repeat responsible for the a-helical coiled-coil structure of the rod segment

FASTA and LFASTA Algorithms for Sequence Comparison

FASTA and LFASTA Algorithms for Sequence Comparison

Presentation Transcript

LECTURE

Lecture 25 Lecture 26

Lecture

Lecture

Lecture VIII Lecture IX

Lecture

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

Lecture S1: Sample Lecture

Lecture