190 likes | 527 Views
FASTA Database Sequence Searches. Quick Overview/Review Software at version 3: better statistical estimates, error handling, and larger variety of scoring matrices (PAM120, BLOSUM80, PAM250, BLOSUM50, BLOSUM62) fasta3 : input protein sequence with protein seq DB
E N D
FASTA Database Sequence Searches • Quick Overview/Review • Software at version 3: • better statistical estimates, error handling, and larger variety • of scoring matrices (PAM120, BLOSUM80, PAM250, • BLOSUM50, BLOSUM62) • fasta3 : input protein sequence with protein seq DB • fastx3/fasty3 : input DNA sequence with protein seq DB • fastx (faster but frame shifts between codons) • fasty (frame shifts within codons) • tfasta3/tfastx3/tfasty3 : protein comparisons against a • DNA sequence DB.
FASTA Internals • Four step procedure in performing the search • Lookup table method finds exact residue matches to be used as candidates to build the final alignments • Regions are scored using a local alignment scoring method (PAM, BLOSUM) to find 10 best regions • Take best scored sequence and attempt to join the initial segments together • In a similar method to Smith-Waterman, all possible alignments considered within a band around the highest scoring segments
FASTA Lookup Method • Lookup table is merely a list of all amino acids from each sequence and the relative position differences • (Mount, p294)
FASTA Lookup Method • Several residues at same offset indicate an area of identical alignment, called a k-tuple. • The k-tuple size is an important parameter in • performing FASTA searches. • k-tuple searches of size = 1 are more sensitive but slower. • k-tuple = 2 is 4x faster than size = 1.
FASTA k-tuples ktup = 3 ktup = 2 ktup = 1
FASTA Diagonal Method • Find segments of the compared sequences bounded by regions with k-tuples of size “ktup” (search parameter) • Segments from input and database sequences are of the same length! • This allows similarity based on identical matches and allows mismatches but doesn’t allow insertion or deletion of residues or gaps. • accdff and abcdef are okay • acdf and abcdef are not
FASTA Illustrated Step 1 Lookup Table Diagonal Method Step 2 Score with Favorite Method Step 3 Select Best Segments for Joining Step 4 Smith-Waterman Like Method
FASTA Results and Significance • Search scores based on extreme value distribution • Yev(x) = exp[ -x – exp[-x] ] • Pev(S > x) = 1 – exp[ - exp[ -L(x-u) ] ] • Values determined by fitting actual score distribution to EVD • L = 1.2825 / s • u = xavg – 0.4500 s • z = (x – m) / s • P(S > z) = 1 – exp[ - exp[ -1.2825z – 0.5772 ] ] • Chance a random sequence has score S that is greater than z and thereby a more significant match • Remember: “z” in FASTA is really z’, z’ = 50 + 10z
FASTA Results and Significance • Expectation (Z) of finding score S in D “databases” • D is number of “databases” searched (in FASTA output) • Book uses term “databases” but in reality it is the number of • whole sequences searched against. • E(S > z) = D * P(S > z)
BLAST Overview • Similarly set of search tools as FASTA • BLASTP (prot-prot), BLASTN (nucl-nucl) • BLASTX (transl.nucl-prot), TBLASTN (prot-transl.nucl) • TBLASTX (transl.nucl-transl.nucl) • Version now at 2.2 • Better statistical significance calculations • Gapped alignment for increased sensitivity
BLAST Algorithm • The set up… • Chop up input query sequence into smaller words • default size is 3 for protein searches, 11 for nucleic acids • Smaller words are generated in each “frame” • 123, 234, 345, 456, 567, 678, … • Each word is scored (BLOSUM62) against sequences in the search database(s). • Threshold value “T” is used to discard comparisons that do not score above the value “T”. Matches above T are “hits”.
BLAST, 1 hits or 2? • Original BLAST (v1) • Each hit was extended in both directions to determine if the word was a subset of a larger matching region • accounted for ~ 90% of the BLAST execution time • The extensions continue until score drops below best score achieved in short sequence matches. • New BLAST (v2) • Hits that were “HSPs” (high-scoring segment pairs) typically had multiple hits along the same diagonal within a “short distance” of each other. • Find another non-overlapping hit in a window of length “A”. • If there is one, then attempt the same hit extension as in v1.
BLAST, 1 hits or 2… • New 2 hit criteria reduces number of candidates to extend but reduces sensitivity of the search • Hits that might be significant (and “remote”) are not extended. • “T” value must be decreased in order to achieve sensitivity in original BLAST. • More “weaker” hits are included but not enough to offset gains. • Two-hit method generates 3.2 times as many hits but only 0.14 times as many hit extensions were done.
BLAST, Gapped Alignments • Weakness of BLAST v1 • A single significant alignment constructed from two(or more) less significant extended (ungapped) hits could be discarded if one (or more) of the extended hits weren’t found. • For example, if you changed thresholds slightly. • The hit threshold “T” had to be set low to make sure the less significant smaller hits could be found to generate the single, more significant alignment. • Reminder: ungapped extensions based on BLOSUM62 scoring and allows residue substitutions
BLAST, Gapped Alignments • BLAST v2 Modification: Gapped Alignments • Rather than construct an alignment between HSPs, • attempt a gapped extension on any HSP that exceeds the • threshold “Sg”. • Similar to 2-hit process except “off-diagonal” using • previously discovered HSP candidates. • Look within a given distance “G” for another HSP. • Using gaps and scoring matrix, determine score of the new, larger gapped sequence. • Keep gapped sequence if score remains above threshold “Xg”.
BLAST, Are we done yet? • Final clean up • After all extensions (both gapped and ungapped) are done, • finishing touches are applied to the fragments through a • modified dynamic programming method. • All sequence matches above a specific expect score (user • specified) are reported to the user.