220 likes | 381 Views
CS 6990 Bioinformatics BLAST. Fall 2003 Dr. Susan Bridges. Overview. B asic L ocal A lignment S earch T ool BLAST is a collection of programs Developed by Altschul, et al. Simplification of the Smith Waterman Dynamic Programming algorithm
E N D
CS 6990BioinformaticsBLAST Fall 2003 Dr. Susan Bridges Department of Computer Science and Engineering Bioinformatics
Overview • Basic Local Alignment Search Tool • BLAST is a collection of programs • Developed by Altschul, et al. • Simplification of the Smith Waterman Dynamic Programming algorithm • Like FASTA, it looks for exact matches of short words • Unlike FASTA, it scores using all values in a similarity matrix. Department of Computer Science and Engineering Bioinformatics
BLAST Terminology • Segment—a substring of a sequence • Segment pair of two sequences—pair of segments of the same length (no gaps), one from each sequence • w-mer—a substring (or word) of w characters Department of Computer Science and Engineering Bioinformatics
Goal • Form a gapless alignment between pairs and score the alignment using an amino acid substitution matrix. • Example (using PAM 120) K A L M R V A K N S -4 3 -4 -3 -1 Total score of alignment = -9 Department of Computer Science and Engineering Bioinformatics
Steps in the Algorithm • Compile a list of high-scoring words in the query sequence • Find matches in the db for each high-scoring word and synonyms • For each match in the db, extend the alignment in both directions Department of Computer Science and Engineering Bioinformatics
Step 0 (optional) • Filter regions of low-complexity or repeats. • Filtering is applied to the query sequence, not the db sequences. These regions are marked with an X in protein sequences and an N in nucleotide sequences and are then ignored by BLAST. • Makes the search focus on more important parts of the sequence. Department of Computer Science and Engineering Bioinformatics
Step 1 • Compile a list of high-scoring words in the query sequence • Defaults of w=3 for proteins, and w=11 for nucleic acid sequences • The total number of words will be n-w+1 • Each word has a score t toward the query sequence computed using scoring matrix • Threshold T: t-scores above T for any word pair indicates synonyms (T is called the neighborhood word score threshold) Department of Computer Science and Engineering Bioinformatics
Step 1 Example (w=2) Adipokinetic hormone II of migratory locust q l n f s a g w q l l n n f f s s a a g g w Department of Computer Science and Engineering Bioinformatics
Step 1 continued • Find all words in the db that are synonyms of the high scoring query words Department of Computer Science and Engineering Bioinformatics
Example continued (T=8, PAM120 Scoring Matrix) Department of Computer Science and Engineering Bioinformatics
Another Example Department of Computer Science and Engineering Bioinformatics
Step 2 (original BLAST) • For each word or synonym from the query sequence , search each db sequence for hits • Each hit is considered a seed alignment and is extended in both directions as long as the cumulative score can be increased. Extension is halted when one of the following occurs: • The cumulative alignment score falls off by the quantity X from its maximum achieved value • The cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments • The end of either sequence is reached. • High scoring segment pairs are called HSPs • The highest scoring segment pair for the pairwise comparison for the query sequence and the db sequence is referred to as the maximal-scoring segment pair (MSP) Department of Computer Science and Engineering Bioinformatics
Step 2 (BLAST2 or Gapped Blast) • Uses a lower value for T in previous step to give a longer word list • Use short matched sequences on the same diagonal within distance A of each other as starting points for longer ungapped alignment. Joined regions are extended as before allowing small gaps. Database sequence x x x x x x x x x x Query sequence Department of Computer Science and Engineering Bioinformatics
Step 3 • The HSP’s of the entire database are compared to a cutoff score S, and when greater than S, are listed. Department of Computer Science and Engineering Bioinformatics
Step 4 • Statistical significance calculations are done for each HSP score. Department of Computer Science and Engineering Bioinformatics
Step 5 • Alignment of the segments are done using an efficient version of dynamic programming that divides the task into subalignments based on HSPs in the sequences • The alignment score is obtained • The E() value for this score is calculated. • If the calculated E() for the database sequence meets the user given E() for the program, this score is reported. Department of Computer Science and Engineering Bioinformatics
BLAST output • The list of hits • Database accession codes, name, description, general information about the hit. • Score in bits, the alignment score expressed in units of information. • Expectation value E() Department of Computer Science and Engineering Bioinformatics
BLAST programs • BLASTN—nucleotide query searching nucleotide db • BLASTP—protein query searching a protein db • BLASTX—Translated nucleotide query sequence searching a translated nucleotide (6 frames) db. • TBLASTN—Protein query searching a translated nucleotide (6 frames) db. • TBLASTX—Translated nucleotide query (6 frames) searching a translated nucleotide (6 frames) db Department of Computer Science and Engineering Bioinformatics
Scoring Schemes • Default scoring matrix for blastp, blastx, tblastn, and tblastx is BLOSUM62 • Blosum62 is considered a good general purpose scoring matrix. • PAM matrices are also provided in BLAST distribution. When distance is unknown, Altschul (1991, 1992) recommends trying searches with at least PAM40, PAM120, and PAM250. Department of Computer Science and Engineering Bioinformatics
Scoring Schemes for BLASTN • Value used for w is 11. Only allows detection of moderately diverged homologs. • Two parameters, M and N are used for scoring nucleotide sequence matches • Reward for matches is M (must be positive) • Penalty for mismatches is N (must be negative) • Default values are M = 5 and N = -4 with a ratio of 1.25. • Values of M and N with a ratio of 3.0 or greater are not allowed. Department of Computer Science and Engineering Bioinformatics
Genetic Codes • Default is Standard of Universal Code (1) • Other alternatives with dbgcode and C parameter • Vertebrate mitochondrial (2) • Yeast mitochondrial (3) • Mold, protozoan, coelenterate mitochondrial and mycoplasma/spiroplasma (4) • Invertebrate mitochondrial (5) • Ciliate macronuclear (6) • Echinodermate mitochondiral (9) • Alternative ciliate macronuclear (10) • Eubacterial (11) • Alternative yeast (12) • Ascidian mitochondrial (13) • Flatworm mitochondrial (14) Department of Computer Science and Engineering Bioinformatics
References • Setubal and Meidanis, Introduction to Computational Molecular Biology • NCBI Education Pages, http://www.ncbi.nih.gov/Education/BLASTinfo/BLAST_algorithm.html • Weizmann Institute of Science, http://bioportal.weizmann.ac.il/course/introbioinfo/ • Computers and the Human Genome Project, http://www-cse.stanford.edu/classes/sophomore-college/projects-00/computers-and-the-hgp/BLAST.html • The BLAST Help Manual, http://www.ncbi.nlm.nih.gov/BLAST/blast_help.shtml Department of Computer Science and Engineering Bioinformatics