1 / 16

FASTA Database Sequence Searches

FASTA Database Sequence Searches. Quick Overview/Review Software at version 3: better statistical estimates, error handling, and larger variety of scoring matrices (PAM120, BLOSUM80, PAM250, BLOSUM50, BLOSUM62) fasta3 : input protein sequence with protein seq DB

avak
Download Presentation

FASTA Database Sequence Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FASTA Database Sequence Searches • Quick Overview/Review • Software at version 3: • better statistical estimates, error handling, and larger variety • of scoring matrices (PAM120, BLOSUM80, PAM250, • BLOSUM50, BLOSUM62) • fasta3 : input protein sequence with protein seq DB • fastx3/fasty3 : input DNA sequence with protein seq DB • fastx (faster but frame shifts between codons) • fasty (frame shifts within codons) • tfasta3/tfastx3/tfasty3 : protein comparisons against a • DNA sequence DB.

  2. FASTA Internals • Four step procedure in performing the search • Lookup table method finds exact residue matches to be used as candidates to build the final alignments • Regions are scored using a local alignment scoring method (PAM, BLOSUM) to find 10 best regions • Take best scored sequence and attempt to join the initial segments together • In a similar method to Smith-Waterman, all possible alignments considered within a band around the highest scoring segments

  3. FASTA Lookup Method • Lookup table is merely a list of all amino acids from each sequence and the relative position differences • (Mount, p294)

  4. FASTA Lookup Method • Several residues at same offset indicate an area of identical alignment, called a k-tuple. • The k-tuple size is an important parameter in • performing FASTA searches. • k-tuple searches of size = 1 are more sensitive but slower. • k-tuple = 2 is 4x faster than size = 1.

  5. FASTA k-tuples ktup = 3 ktup = 2 ktup = 1

  6. FASTA Diagonal Method • Find segments of the compared sequences bounded by regions with k-tuples of size “ktup” (search parameter) • Segments from input and database sequences are of the same length! • This allows similarity based on identical matches and allows mismatches but doesn’t allow insertion or deletion of residues or gaps. • accdff and abcdef are okay • acdf and abcdef are not

  7. FASTA Illustrated Step 1 Lookup Table Diagonal Method Step 2 Score with Favorite Method Step 3 Select Best Segments for Joining Step 4 Smith-Waterman Like Method

  8. FASTA Results and Significance • Search scores based on extreme value distribution • Yev(x) = exp[ -x – exp[-x] ] • Pev(S > x) = 1 – exp[ - exp[ -L(x-u) ] ] • Values determined by fitting actual score distribution to EVD • L = 1.2825 / s • u = xavg – 0.4500 s • z = (x – m) / s • P(S > z) = 1 – exp[ - exp[ -1.2825z – 0.5772 ] ] • Chance a random sequence has score S that is greater than z and thereby a more significant match • Remember: “z” in FASTA is really z’, z’ = 50 + 10z

  9. FASTA Results and Significance • Expectation (Z) of finding score S in D “databases” • D is number of “databases” searched (in FASTA output) • Book uses term “databases” but in reality it is the number of • whole sequences searched against. • E(S > z) = D * P(S > z)

  10. BLAST Overview • Similarly set of search tools as FASTA • BLASTP (prot-prot), BLASTN (nucl-nucl) • BLASTX (transl.nucl-prot), TBLASTN (prot-transl.nucl) • TBLASTX (transl.nucl-transl.nucl) • Version now at 2.2 • Better statistical significance calculations • Gapped alignment for increased sensitivity

  11. BLAST Algorithm • The set up… • Chop up input query sequence into smaller words • default size is 3 for protein searches, 11 for nucleic acids • Smaller words are generated in each “frame” • 123, 234, 345, 456, 567, 678, … • Each word is scored (BLOSUM62) against sequences in the search database(s). • Threshold value “T” is used to discard comparisons that do not score above the value “T”. Matches above T are “hits”.

  12. BLAST, 1 hits or 2? • Original BLAST (v1) • Each hit was extended in both directions to determine if the word was a subset of a larger matching region • accounted for ~ 90% of the BLAST execution time • The extensions continue until score drops below best score achieved in short sequence matches. • New BLAST (v2) • Hits that were “HSPs” (high-scoring segment pairs) typically had multiple hits along the same diagonal within a “short distance” of each other. • Find another non-overlapping hit in a window of length “A”. • If there is one, then attempt the same hit extension as in v1.

  13. BLAST, 1 hits or 2… • New 2 hit criteria reduces number of candidates to extend but reduces sensitivity of the search • Hits that might be significant (and “remote”) are not extended. • “T” value must be decreased in order to achieve sensitivity in original BLAST. • More “weaker” hits are included but not enough to offset gains. • Two-hit method generates 3.2 times as many hits but only 0.14 times as many hit extensions were done.

  14. BLAST, Gapped Alignments • Weakness of BLAST v1 • A single significant alignment constructed from two(or more) less significant extended (ungapped) hits could be discarded if one (or more) of the extended hits weren’t found. • For example, if you changed thresholds slightly. • The hit threshold “T” had to be set low to make sure the less significant smaller hits could be found to generate the single, more significant alignment. • Reminder: ungapped extensions based on BLOSUM62 scoring and allows residue substitutions

  15. BLAST, Gapped Alignments • BLAST v2 Modification: Gapped Alignments • Rather than construct an alignment between HSPs, • attempt a gapped extension on any HSP that exceeds the • threshold “Sg”. • Similar to 2-hit process except “off-diagonal” using • previously discovered HSP candidates. • Look within a given distance “G” for another HSP. • Using gaps and scoring matrix, determine score of the new, larger gapped sequence. • Keep gapped sequence if score remains above threshold “Xg”.

  16. BLAST, Are we done yet? • Final clean up • After all extensions (both gapped and ungapped) are done, • finishing touches are applied to the fragments through a • modified dynamic programming method. • All sequence matches above a specific expect score (user • specified) are reported to the user.

More Related