1 / 19

BIC I, Week 6 lectures

BIC I, Week 6 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu , arh@it.rit.edu. FASTA & BLAST. Faster than dynamic programming But may miss some alignments Use heuristics to speed up search. Heuristic.

tamal
Download Presentation

BIC I, Week 6 lectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIC I, Week 6 lectures Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu, arh@it.rit.edu

  2. FASTA & BLAST • Faster than dynamic programming • But may miss some alignments • Use heuristics to speed up search

  3. Heuristic • In computing, the term heuristic is used to refer to an “ad hoc” technique that seems to work well • The FASTA & BLAST heuristic is that sequences that align well will tend to have short regions of exact (or almost exact) matching • Begins by searching for sequences in the database that have short regions of exact match with our target sequence

  4. FASTA • FAST = Family of algorithms • FASTA most often used • 3 basic steps • Find very short exact matches between query sequence and database sequence(s) • Extend the best short exact matches to look for longer stretches of similarity • Optimize the best hits with dynamic programming

  5. FASTA • Like dot plots, FASTA compares 2 sequences at a time • But, compares a “word” of 4-6 nucleotides (or 2 amino acids) at a time = “ktup” • Matches all identical words from the 2 sequences and creates diagonals by connecting non-overlapping matches • Rescores using a substitution matrix such as PAM250  returns best = “init1”

  6. FASTA • High scoring diagonals joined together allowing for gaps, eliminate some • Diagonal with best score  “initn” • Optimal local alignment between query sequence and all database sequences with high “initn” values • Smith-Waterman is used • Returns  “opt” score FASTA

  7. FASTA • FASTA calculates an expectation of significance value, known as E value. • E value: (def. from NCBI website) • The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. • <0.02 similarity measure between query sequence and database sequence(s) is statistically significant.

  8. FASTA • Data structures used • Table: if searching for ktup = 4 nucleotides • Table will be 44 = 256 • Table index is the 4-mer; entries are the sequences and location in sequence where 4-mer occurs • Offset vector: for sequences s,t • Indices range from (1-|t|) to (|s|) • Initialize all to zero • From table, for each pair increment element in table corresponding to difference in positions (s-t)

  9. FASTA program

  10. BLAST • rhys

  11. Applying the Heuristic • Makes a list of all the w-tuples in the target sequence • For nucleotide sequences, the w-tuples are just exactly what’s there • For amino acid sequences, the w-tuples are all those that are within a certain “Blosum” threshold of actual w-tuples • Searches the database for sequences that have a high hit rate of exact matches with the w-tuples in the list

  12. Sensitivity, Selectivity and Stringency • The larger w is, the more selective the process • Adjusting the threshold value down increases the chances of hitting evolutionarily distant sequences (In BLAST this is done by adjusting the “Expect” value up) • BLASTp allows you to select your PAM or BLOSUM matrix, also affecting the S’s

  13. Extend the hits • When a sequence has been identified that has good w-tuple hits • Use dynamic programming to extend the hits • Extension stops when there is no increase in score • There may be more than one area of sequence similarity – try to merge these

  14. Statistical Analysis • can be used on BLAST results • resulting in “E” values that give some indication of the quality of the match • BLAST results are usually sorted according to these “E” values

  15. Sample BLAST pages • (without graphics) • Main BLAST search page

  16. Nucleotide BLAST Search • BLASTn • Note the ability to modify the window size and the Expect value

  17. Protein BLAST Search • BLASTp • Note the choice of BLOSUM or PAM matrices • Note the options for penalizing gaps

  18. BLAST Results Page • Results • Note the E-values

  19. Conclusions • Exact pattern matching is easy and fast • Pairwise alignment with indels and similarity can be done, but not so quickly • Current computer technology is making the dynamic programming Smith-Waterman algorithm increasingly attractive • Pairwise alignment can be done quickly using BLAST or FASTA but may miss some alignments

More Related