90 likes | 107 Views
Understand FASTA and LFASTA algorithms used for sequence comparison, their steps, optimization, and key differences for efficient local alignments. Self-comparison of C. elegans myosin heavy chain demonstrated.
E N D
Lecture #7: FASTA & LFASTA BIOINF 2051 Fall 2002
Dot Plot Alpha chain vs. Beta chain of Human Hemoglobin
FASTA and LFASTA • Pearson and Lipman (1988) • FASTA – program that calculates the initial and optimal similarity scores between two sequences • LFASTA – program for detecting local similarities – finds multiple alignments between smaller portions of two sequences
The FASTA algorithm • Four steps: • Identify regions of similarity: • Using the ktup parameter which specifies # consecutive identities required in a match • 10 best diagonal regions found based on #matches and distance between matches • Rescore regions and identify best initial regions • PAM250 or other scoring matrix used for rescoring the 10 diagonal regions identified in step 1 to allow for conservative replacements and runs of identities shorter than ktup • For each the best diagonal regions, identify “initial region” that is best scoring subregion
The FASTA algorithm • Optimally join initial regions with scores > T • Given: location of initial regions, scores, gap penalty • Calculate an optimal alignment of initial regions as a combination of compatible regions with maximal score • Use resulting score to rank the library sequences • Selectivity degradation limited by using initial regions that score greater than some threshold T • Align the highest scoring library sequences using modification of global and local alignment algorithms • Considers all possible alignments of the query and library sequence that falls within a band centered around the highest scoring initial region
LFASTA • FASTA – reports only one highest scoring alignment between two sequences • LFASTA – local sequence comparison tool that can identify multiple local alignments between 2 sequences • Optimal algorithms for sensitive local sequence comparison are computationally intensive in terms of time and memory
LFASTA vs. FASTA • LFASTA uses same first 2 steps for finding initial regions as FASTA, except: • Instead of saving 10 initial regions, LFASTA saves all diagonal regions with similarity scores > some threshold • Construction of optimized alignments • Instead of focusing on a single region, LFASTA computes a local alignment for each initial region • Also, apart from band around initial region, LFASTA considers potential sequence alignments for some distance before and after the initial region.
Self-comparison of myosin heavy chain from C. elegans • See plot from a local similarity self-comparison of the myosin heavy chain (NBRF code MWKW) using the PAM 250 matrix • The amino-terminal half of the molecule forms a large globular head without any periodic structure • The symmetrical parallel lines along the C-terminal half correspond to the 28-residue repeat responsible for the a-helical coiled-coil structure of the rod segment