190 likes | 336 Views
BIC I, Week 6 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu , arh@it.rit.edu. FASTA & BLAST. Faster than dynamic programming But may miss some alignments Use heuristics to speed up search. Heuristic.
E N D
BIC I, Week 6 lectures Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu, arh@it.rit.edu
FASTA & BLAST • Faster than dynamic programming • But may miss some alignments • Use heuristics to speed up search
Heuristic • In computing, the term heuristic is used to refer to an “ad hoc” technique that seems to work well • The FASTA & BLAST heuristic is that sequences that align well will tend to have short regions of exact (or almost exact) matching • Begins by searching for sequences in the database that have short regions of exact match with our target sequence
FASTA • FAST = Family of algorithms • FASTA most often used • 3 basic steps • Find very short exact matches between query sequence and database sequence(s) • Extend the best short exact matches to look for longer stretches of similarity • Optimize the best hits with dynamic programming
FASTA • Like dot plots, FASTA compares 2 sequences at a time • But, compares a “word” of 4-6 nucleotides (or 2 amino acids) at a time = “ktup” • Matches all identical words from the 2 sequences and creates diagonals by connecting non-overlapping matches • Rescores using a substitution matrix such as PAM250 returns best = “init1”
FASTA • High scoring diagonals joined together allowing for gaps, eliminate some • Diagonal with best score “initn” • Optimal local alignment between query sequence and all database sequences with high “initn” values • Smith-Waterman is used • Returns “opt” score FASTA
FASTA • FASTA calculates an expectation of significance value, known as E value. • E value: (def. from NCBI website) • The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. • <0.02 similarity measure between query sequence and database sequence(s) is statistically significant.
FASTA • Data structures used • Table: if searching for ktup = 4 nucleotides • Table will be 44 = 256 • Table index is the 4-mer; entries are the sequences and location in sequence where 4-mer occurs • Offset vector: for sequences s,t • Indices range from (1-|t|) to (|s|) • Initialize all to zero • From table, for each pair increment element in table corresponding to difference in positions (s-t)
BLAST • rhys
Applying the Heuristic • Makes a list of all the w-tuples in the target sequence • For nucleotide sequences, the w-tuples are just exactly what’s there • For amino acid sequences, the w-tuples are all those that are within a certain “Blosum” threshold of actual w-tuples • Searches the database for sequences that have a high hit rate of exact matches with the w-tuples in the list
Sensitivity, Selectivity and Stringency • The larger w is, the more selective the process • Adjusting the threshold value down increases the chances of hitting evolutionarily distant sequences (In BLAST this is done by adjusting the “Expect” value up) • BLASTp allows you to select your PAM or BLOSUM matrix, also affecting the S’s
Extend the hits • When a sequence has been identified that has good w-tuple hits • Use dynamic programming to extend the hits • Extension stops when there is no increase in score • There may be more than one area of sequence similarity – try to merge these
Statistical Analysis • can be used on BLAST results • resulting in “E” values that give some indication of the quality of the match • BLAST results are usually sorted according to these “E” values
Sample BLAST pages • (without graphics) • Main BLAST search page
Nucleotide BLAST Search • BLASTn • Note the ability to modify the window size and the Expect value
Protein BLAST Search • BLASTp • Note the choice of BLOSUM or PAM matrices • Note the options for penalizing gaps
BLAST Results Page • Results • Note the E-values
Conclusions • Exact pattern matching is easy and fast • Pairwise alignment with indels and similarity can be done, but not so quickly • Current computer technology is making the dynamic programming Smith-Waterman algorithm increasingly attractive • Pairwise alignment can be done quickly using BLAST or FASTA but may miss some alignments