300 likes | 740 Views
MSCS 230 Bioinformatics - BLAST and FASTA. 2. Overview. BLASTMathematical basis for central ideaModeling sequence alignmentsSteps of BLASTFASTAStepsFinding k-tupsBioPerl. MSCS 230 Bioinformatics - BLAST and FASTA. 3. BLAST (Mount, Chapter 7). Basic Local Alignment Search ToolAltschul et al.
E N D
1. BLAST and FASTA Craig A. Struble, Ph.D.
Department of Mathematics, Statistics, and Computer Science
Marquette University
2. MSCS 230 Bioinformatics - BLAST and FASTA 2 Overview BLAST
Mathematical basis for central idea
Modeling sequence alignments
Steps of BLAST
FASTA
Steps
Finding k-tups
BioPerl
3. MSCS 230 Bioinformatics - BLAST and FASTA 3 BLAST (Mount, Chapter 7) Basic Local Alignment Search Tool
Altschul et al. 1990,1994,1997
Heuristic method for local alignment
Designed specifically for database searches
Idea: Good alignments contain short lengths of exact matches
4. MSCS 230 Bioinformatics - BLAST and FASTA 4 Mathematical Basis of “The Idea”(Mount, Chapter 3) Model matches as a sequence of coin tosses
Let p be the probability of a “head”
For a “fair” coin, p = 0.5
(Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is
R = log1/p (n).
5. MSCS 230 Bioinformatics - BLAST and FASTA 5 Mathematical Basis of “The Idea”(Mount, Chapter 3) Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
6. MSCS 230 Bioinformatics - BLAST and FASTA 6 Modeling Sequence Alignments To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”.
For DNA, the probability of a “head” is 1/4
What is it for amino acid sequences?
7. MSCS 230 Bioinformatics - BLAST and FASTA 7 Modeling Sequence Alignments So, for one particular alignment, the Erdös-Rényi property can be applied
What about for all possible alignments?
Consider that sequences are being shifted back and forth, dot matrix plot
The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
8. MSCS 230 Bioinformatics - BLAST and FASTA 8 Modeling Sequence Alignments So suppose m = n = 10 and we’re looking at DNA sequences
R=log4(100)=3.32
This analysis makes assumptions about the base composition (uniform) and no gaps, but it’s a good estimate.
9. MSCS 230 Bioinformatics - BLAST and FASTA 9 Back to BLAST Steps of BLAST
Filter out low-complexity regions
where L is length, N is alphabet size, ni is the number of letter i appearing in sequence. Example: AAAT
K=1/4 log4(24/(3!*1!*0!*0!))=0.25
10. MSCS 230 Bioinformatics - BLAST and FASTA 10 Back to BLAST Steps of BLAST
Query words of length 3 (for proteins) or 11 (for DNA) are created from query sequence using a sliding window
Expected run length in sequences of ~90 for proteins and 64 for DNA.
The values 90 and 64 can easily be obtained from the expected run length formula shown earlier.The values 90 and 64 can easily be obtained from the expected run length formula shown earlier.
11. MSCS 230 Bioinformatics - BLAST and FASTA 11 BLAST Steps of BLAST
Using BLOSUM62 (for proteins) or scores of +5/-4 (DNA, PAM40), score all possible words of length 3 or 11 respectively against a query word.
Select a neighborhood word score threshold (T) so that only most significant sequences are kept. Approximately 50 hits per query word.
Repeat 3 and 4 for each query word in step 2. Total number of high scoring words is approximately 50 * sequence length.
12. MSCS 230 Bioinformatics - BLAST and FASTA 12 BLAST Steps of BLAST
Organize the high-scoring words into a search tree
13. MSCS 230 Bioinformatics - BLAST and FASTA 13 BLAST Steps of BLAST
Scan each database sequence for match to high-scoring words. Each match is a seed for an ungapped alignment.
14. MSCS 230 Bioinformatics - BLAST and FASTA 14 BLAST Steps of BLAST
(Original BLAST) extend matching words to the left and right using ungapped alignments. Extension continues as long as score increases or stays same. This is a HSP (high scoring pair).
(BLAST2) Matches along the same diagonal (think dot plot) within a distance A of each other are joined and then the longer sequence extended as before. (Requires lower T)
15. MSCS 230 Bioinformatics - BLAST and FASTA 15 BLAST Steps of BLAST
Using a cutoff score S, keep only the extended matches that have a score at least S.
Determine statistical significance of each remaining match (from last time).
Try to extend the HSPs if possible.
Show Smith-Waterman local alignments.
16. MSCS 230 Bioinformatics - BLAST and FASTA 16 FASTA Find k-tups in the two sequences (k=1 for proteins, 4-6 for DNA sequences)
Score and select top 10 scoring “local diagonals”
For proteins, each k-tup found is scored using the PAM250 matrix
For DNA, the number of k-tups found
Penalize intervening gaps
17. MSCS 230 Bioinformatics - BLAST and FASTA 17 Finding k-tups
18. MSCS 230 Bioinformatics - BLAST and FASTA 18 FASTA
19. MSCS 230 Bioinformatics - BLAST and FASTA 19 FASTA Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores.
Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score
Score of 28 for a query sequence of length 200 and k=2.
20. MSCS 230 Bioinformatics - BLAST and FASTA 20 FASTA After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
21. MSCS 230 Bioinformatics - BLAST and FASTA 21 BioPerl Using the Bio::Tools::Run::RemoteBLAST module
Execute a remote BLAST search
Iterate over each HSP match
Using the Bio::Tools::pSW module
Perform a local sequence alignment of two sequences
22. MSCS 230 Bioinformatics - BLAST and FASTA 22 Bio::Tools::Run::RemoteBLAST Results accessed by either Bio::Tools::BPlite or Bio::SearchIO, depends on readmethod setting
In this example, Bio::SearchIO is used (which is the default)
23. MSCS 230 Bioinformatics - BLAST and FASTA 23 Example Script
24. MSCS 230 Bioinformatics - BLAST and FASTA 24 Example Script
25. MSCS 230 Bioinformatics - BLAST and FASTA 25 Bio::Tools::pSW
26. MSCS 230 Bioinformatics - BLAST and FASTA 26 Bio::Tools::pSW