1 / 54

BLAST (I)

BLAST (I). B asic L ocal A lignment S earch T ool. 范振業 工研院生醫所 email: jimfann@itri.org.tw 02/26/2008. Reference & Sources. Jian Ye, Scott McGinnis, and Thomas L. Madden (2006) "BLAST: improvements for better sequence analysis" Nucleic Acids Res. July 1; 34 (Web Server issue): W6-W9

nailah
Download Presentation

BLAST (I)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BLAST (I) Basic Local Alignment Search Tool 范振業 工研院生醫所 email: jimfann@itri.org.tw 02/26/2008

  2. Reference & Sources Jian Ye, Scott McGinnis, and Thomas L. Madden (2006) "BLAST: improvements for better sequence analysis" Nucleic Acids Res. July 1; 34 (Web Server issue): W6-W9 McGinnis S, Madden TL. (2004) "BLAST: at the core of a powerful and diverse set of sequence analysis tools." Nucleic Acids Res. Jul 1;32 (Web Server issue): W20-5. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. http://www.ncbi.nlm.nih.gov/BLAST/ ftp://ftp.ncbi.nih.gov/blast/ Joseph Bedell, Ian Korf, Mark Yandell (2003) BLAST. O'Reilly [http://www.oreilly.com/catalog/blast/] http://www.bioinfbook.org [Jonathan Pevsner (2003) Bioinformatics and Functional Genomics. John Wiley & Sons, Inc.]

  3. Why use BLAST? • To discover functional, structural and evolutionary similarities • Because “similarity” may be an indicator of “homology” and thus provide some insight into function or gene identification. • Applications include • identifying orthologs and paralogs • discovering new genes or proteins • discovering variants of genes or proteins • investigating expressed sequence tags (ESTs) • exploring protein structure and function

  4. http://www.ncbi.nlm.nih.gov/BLAST/

  5. BLAST

  6. Format for query sequence Key • FASTA (PEARSON and LIPMAN, 1988) >gi|129295|sp|P01013|OVAX_CHICK Gene X protein (Ovalbumin-related) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE ... • Bare Sequence QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE … • Identifiers: • accession number: ( P01013 ) • accession number + version codes: ( AAA68881.1 ) • gi: ( 129295 , gi|129295 ) description

  7. IUPAC Nucleic acid codes

  8. IUPAC Amino acid codes

  9. Peptide Sequence Databases (FASTA format)

  10. Nucleotide Sequence Databases

  11. BLAST - Algorithm parameters

  12. SEG* filtering of low-complexity segments *Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in amino acid sequences and sequence databases." Comput. Chem. 17:149-163. [Ftp site: ftp://ftp.ncbi.nih.gov/pub/seg/seg/] From: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Seg.html

  13. BLAST: background on sequence alignment There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”). From: www.bioinfbook.org/

  14. BLAST: background on sequence alignment [2] The second approach is local sequence alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. S-W is guaranteed to find optimal alignments, but it is computationally expensive (requires (O)n2 time). BLAST and FASTA are heuristic approximations to local alignment. Each requires only (O)n2/k time; they examine only part of the search space. From: www.bioinfbook.org/

  15. How a BLAST search works “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990) From: www.bioinfbook.org/

  16. Pairwise alignment scores are determined using a scoring matrix such as Blosum62 From: www.bioinfbook.org/

  17. How the original BLAST algorithm works: three phases Phase 1 (Seeding):<Nucleotide word perfect match> compile a list of word pairs (w=3) above threshold T Example: for a human RBP (retinol binding protein) query …FSGTWYA… (query word is in red) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS From: www.bioinfbook.org/

  18. Phase 1: Seeding compile a list of words (w=3) GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11) From: www.bioinfbook.org/

  19. How a BLAST search works: 3 phases Phase 1: Seeding You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options. Scan the database for entries that match the compiled list. This is fast and relatively easy. From: www.bioinfbook.org/

  20. How a BLAST search works: 3 phases • Phase 2: Extension • when you manage to find a hit • (i.e. a match between a “word” and a database • entry), extend the hit in either direction. • Keep track of the score (use a scoring matrix) • Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit! From: www.bioinfbook.org/

  21. How a BLAST search works: 3 phases Phase 2: Extension The quick brown fox jumps over the lazy dog. The quiet brown cat purrs when she sees him. Scoring matrix: match = 1; mismatch = -1; seed = “T” The quick brown fox jump The quiet brown cat purr 123 45654 56789 876 5654 <- score 000 00012 10000 123 4345 <- drop off score <- drop off score = 5; <- drop off score = 2; From: Korf et. al. 2003

  22. How a BLAST search works: 3 phases Phase 2: Extension In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search. -y Dropoff (X) for blast extensions in bits (default if zero) default = 20 for blastn, 7 for other programs From: www.bioinfbook.org/

  23. How a BLAST search works: 3 phases • Phase 3: Evaluation of HSP • High-scoring segment pair (HSP) • Alignment threshold (set by software) • in Footer: • S1: ungapped • S2: gapped • Final alignment threshold (set as -e)

  24. Nucleotide BLAST • Megablast is intended for comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast. • Discontiguous megablast uses an initial seed that ignores some bases (allowing mismatches) and is intended for cross-species comparisons. • BlastN is slow, but allows a word-size down to seven bases.

  25. Nucleotide BLAST - seeding • Exact word match for: • Blastn => word length = 11 • Mega blast => word length = 28 • discontiguous Mega BLAST • Template length: 16, 18, 21. • Word size (i.e. number of 1s in the template): 11, 12 • Template type: coding, non-coding. • Require two words for extension: yes/no.

  26. Scoring Matrix* *Ftp site: ftp://ftp.ncbi.nih.gov/blast/matrices/ • Simple scoring system ftp://ftp.ncbi.nih.gov/blast/matrices/IDENTITY ftp://ftp.ncbi.nih.gov/blast/matrices/MATCH

  27. Scoring Matrix • Conservative amino acids substitution due to similar physicochemical properties • Isoleucine for Valine (both small, hydrophobic) • Serine for Threonine (both polar) • ... tiny P aliphatic C small S+S G G I A S V C N SH L D T hydrophobic Y M K E Q F W H R positive aromatic polar charged From: www.sanbi.ac.za/

  28. Scoring Matrix • => Substitution matrix • to increase sensitivity of the alignment algorithm • flexible lookup scheme for any pair of amino acids • PAM, BLOSUM • Calculating Similarity Scores (log-odds scores)

  29. Scoring - (BLOSUM 62) From: http://blast.wustl.edu/doc/infotheory.html

  30. PAM (Percent Accepted Mutations) matrices • Derived from global alignments of protein families . Family members • share at least 85% identity (Dayhoff et al., 1978). • Construction of phylogenetic tree and ancestral sequences of each protein family • Computation of number of replacements for each pair of amino acids From: www.sanbi.ac.za/; http://www.sdsc.edu/~babu/UCSD/week02/dbSearch_tut.html

  31. PAM (Percent Accepted Mutations) matrices • The numbers of replacements were used to compute a so-called • PAM-1 matrix. • The PAM-1 matrix reflects an average change of 1% of all amino • acid positions. PAM matrices for larger evolutionary distances can • be extrapolated from the PAM-1 matrix. • [Matrix multiplication using PAM-1] • PAM250 = 250 mutations per 100 residues. • Family of matrices – PAM10… PAM200 • [Greater numbers mean bigger evolutionary distance] • . From: www.sanbi.ac.za/

  32. PAM Matrices • If changes were purely random • Frequency of each possible substitution is proportional to background frequencies • In related proteins: • Observed substitution frequencies called the target (replacement) frequencies are biased toward those that do not seriously disrupt the protein’s function • These point mutations are “accepted” during evolution • Log-oddsapproach: • Scores proportional to the natural log of the ratio of target frequencies to background frequencies From: http://omega.cbmi.upmc.edu/~vanathi/

  33. PAM Matrices: salient points • Derived from global alignments of closely related sequences. • Matrices for greater evolutionary distances are extrapolated from those for lesser ones. • The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. • Does not take into account different evolutionary rates between conserved and non-conserved regions. From: http://omega.cbmi.upmc.edu/~vanathi/

  34. BLOSUM Matrices • Henikoff, S. & Henikoff J.G. (1992) • Use blocks of protein sequence fragments from different families (the BLOCKS database) • Amino acid pair frequencies calculated by summing over all possible pairs in block • Different evolutionary distances are incorporated into this scheme with a clustering procedure (identity over particular threshold = same cluster) From: http://omega.cbmi.upmc.edu/~vanathi/

  35. BLOSUM Matrices • Probabilities estimated from blocks of sequence fragments • Blocks represent structurally conserved regions • Target frequencies are identified directly • Sequences more than x% identitical within the block where substitutions are being counted, are grouped together and treated as a single sequence • BLOSUM 50 : >= 50% identity • BLOSUM 62 : >= 62 % identity From: http://omega.cbmi.upmc.edu/~vanathi/

  36. BLOSUM Matrices From: http://www.sdsc.edu/~babu/UCSD/week02/dbSearch_tut.html

  37. BLOSUM Matrices - Summary • Derived from local, ungapped alignments of distantly related sequences • All matrices are directly calculated • The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. • The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches From: http://omega.cbmi.upmc.edu/~vanathi/lec5fall02.ppt

  38. Comparable BLOSUM and PAM Matrices Relative Entropy the average information per alignment position in order to distinguish relevant alignments from alignments expected by chance From: http://www.sdsc.edu/~babu/UCSD/week02/dbSearch_tut.html

  39. Gap Penalties Linear gap penalty score: (g) = - bk Affine gap penalty score: (g) = -(a+bk) (g) = gap penalty score of a gap of length g a = gap opening penalty b = gap extension penalty k = gap length Query: 85 ADDGCPKPPEIAHGYVEHSVRYQCKNYYKLRTEGDG------VYTLNNEKQWINKAVGDK 138 ADDGCPKPP+IAHGYVEHSVRYQCKNYYKLRTEGDG VYTLNNEKQWINKAVGDK Sbjct: 62 ADDGCPKPPQIAHGYVEHSVRYQCKNYYKLRTEGDGKMWTTRVYTLNNEKQWINKAVGDK 121 From: http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html

  40. Statistics of BLAST searches Karlin-Altschul equation: normalized-score to bit-score: P = 1 - e-E E-value & p-value:

  41. BLAST: E values and p values Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000

  42. BLAST - Report Format

  43. Header BLAST Report Body Footer Bedell et.al.2003

  44. Header

  45. Body: Graphical Overview

  46. Body: One-line summaries [# set by -v]

  47. Body: Alignments [# set by -b] [view set by -m ?] • ALIGNMENT_VIEW - Choose how to view alignments. • Pairwise • Pairwise with dots for identities • Query-anchored with dots for identities • Query-anchored with letters for identities • Flat query-anchored with dots for identities • Flat query-anchored with letters for identities • Hit Table • The default "pairwise" view shows how each subject sequence aligns individually to the query sequence. • The "query-anchored" view shows how all subject sequences align to the query sequence. • For each view type, you can choose to show "identities" (matching residues) as letters or dots.

  48. Alignments Views - pairwise [set by -m 0]

  49. Alignments Views - Query-anchored with dots for identities [set by -m 1]

  50. Alignments Views – Query-anchored with letters for identities [set by -m 2]

More Related