880 likes | 1.27k Views
LocusLink. Gene database. UniGene. Homologene. Basic Local Alignment Search Tool. Genome Resources and Sequence Similarity. MapViewer. Topics. Why use sequence similarity? BLAST algorithm blastn, blastp, megablast BLAST statistics BLAST output Examples. : NCBI’s tool.
E N D
LocusLink Gene database UniGene Homologene Basic Local Alignment Search Tool Genome Resources and Sequence Similarity MapViewer
Topics • Why use sequence similarity? • BLAST algorithm • blastn, blastp, megablast • BLAST statistics • BLAST output • Examples
: NCBI’s tool Why Do We Need Sequence Similarity Searching? • To identify and annotate sequences • To evaluate evolutionary relationships • Other: • model genomic structure (e.g., Spidey) • check primer specificity in silico
Seq 1 Seq 1 Seq 2 Seq 2 Global alignment Local alignment Global vs Local Alignment
Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21 Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa)
Global Dynamic Programming • Full sequence must be aligned • Gaps at ends are penalized as much as internal ones • F(n,m) is the best score for alignment • Traceback can give >1 correct alignment • Used to examine closely related sequences • http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html
Local alignments - How • Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one) • The alignment can end anywhere in the matrix • Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0
Heuristic alignment algorithms • Shortcuts are important • Searching a sequence length of 1000 against a database with 108 residues requires approximately 1011 matrix cells. At ten million matrix cells a second, it would take about 3 hours. • BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.
Seeding • BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide) • It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.
Extension • The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming. • These locally optimal alignments are called high scoring segment pairs or HSP’s • Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments
Alignment example • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. • Assume the seed is the capital T, extend the alignment • You’ll hit a mismatch c/e should you continue and how far? • Generate a variable X to measure how far the score drops off. • Set X = 5 and try the alignment… • Set X = 2 and try again … • A large X value will increase the speed, however, speed is often modulated by word size and other parameters…
Gapped BLAST – a time saver • Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A) • A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver • Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit • T is adjustable, higher the T then the smaller the search space
Evaluation • Once seeds are extended to generate alignments, these alignments are tested for statistical significance. • Can establish thresholds for reporting
The Flavors of BLAST • Standard BLAST • traditional “contiguous” word hit • position independent scoring • nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) • Megablast • optimized for large batch searches • can use discontiguous words • PSI-BLAST • constructs PSSMs automatically; uses as query • very sensitive protein search • RPS BLAST • searches a database of PSSMs • tool for conserved domain searches
BLASTN variations • BLASTN seeds are always identical words; T is never used • To make BLASTN faster, increase word size, to make it more sensitive decrease word size • MegaBLAST increases word size to 28 • The minimum word size is 7 • http://monod.uwaterloo.ca/papers/02ph.pdf
BLASTP implementation • To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default) • Affine gaps • Decreased penalty for gap extension relative to gap introduction
Also, FASTA • Similar to Gapped BLAST – except bigger neighborhood • Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6 • Once identified, looks for diagonals with many mutually supporting word matches • Extensions similar to BLAST
Scoring Matrices • Scoring matrix specifies a score, sij, for aligning sequence Iwith sequence II. • Choice of matrix depends on the divergence level of desired/expected hits. • Examples: PAM, BLOSUM • Both can be modified for different divergence levels (eg, BLOSUM40, BLOSUM62) • Advice: try several matrices when possible.
Dayhoff Family of Matrices • Dayhoff model measures sequence evolution in units of “PAMs” • One PAM unit represents the evolutionary distance in which 1% of the amino acids have changed. • Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change) • Mutability of alanine was defined to be 100.
Dayhoff Family of Matrices • Problems with the original Dayhoff scheme • It does not consider the genetic code. • Not all amino acid substitutions can occur by a single nucleotide substitution event. • Parameters were estimated from a small sample of closely related proteins. • Evolution at the “average site” of the “average protein” is being modeled.
BLOSUM Scoring Matrices Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)
Basic Local Alignment Search Tool • Widely used similarity search tool • Heuristic approach based on Smith Waterman algorithm • Finds best local alignments • Provides statistical significance • All combinations (DNA/Protein) query and database. • DNA vs DNA • DNA translation vs Protein • Protein vs Protein • Protein vs DNA translation • DNA translation vs DNA translation • www, standalone, and network clients
X dropoff (X1) X dropoff (X2) X dropoff (X3) How BLAST Works • Make lookup table of “words” for query • Scan database for hits • Ungapped extensions of hits (initial HSPs) • Gapped extensions (no traceback) • Gapped extensions (traceback; alignment details)
GTACTGGACATGGACCCTACAGGAA Query: 11-mer WORD SIZE default minimum blastn 11 7 megablast 28 8 Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words . . .
GTQITVEDLFYNIATRRKALKN Query: Word size = 3 (default) Word size can only be 2 or 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. Protein Words GTQ TQI QIT ITV TVE VED EDL DLF ... Make a lookup table of words [ -f 11 = blastp default ]
Minimum Requirements for a Hit ATCGCCATGCTTAATTGGGCTT CATGCTTAATT one exact match • Nucleotide BLAST requires one exact match • Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN neighborhood words [ -A 40 = blastp default ] two matches
example query words Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 YLSHFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP BLASTP Summary High-scoring pair (HSP)
Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 [ -r 1 -q -3 ] CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
Scoring Systems - Proteins • Position Independent Matrices • PAM Matrices (Percent Accepted Mutation) • Derived from observation; small dataset of alignments • Implicit model of evolution • All calculated from PAM1 • PAM250 widely used • BLOSUM Matrices (BLOck SUbstitution Matrices) • Derived from observation; large dataset of highly conserved blocks • Each matrix derived separately from blocks with a defined percent identity cutoff • BLOSUM62 - default matrix for BLAST • Position Specific Score Matrices (PSSMs) • PSI- and RPS-BLAST
Common amino acids have low weights Rare amino acids have high weights F Negative for less likely substitutions Y Positive for more likely substitutions D D F BLOSUM62 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X
PSSM scores 1 5 7 4 4 Position-Specific Score Matrix Serine/Threonine protein kinases catalytic loop DAF-1
Position-Specific Score Matrix A R N D C Q E G H I L K M F P S T W Y V 435 K -1 0 0 -1 -2 3 0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E 0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0 0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1 0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0 -1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1 1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2 5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2 -3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1 441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2 3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4 -5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A 4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1 -2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10 -6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8 -3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0 -4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6 -3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6 -3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1 448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5 -3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2 -2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K 0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1 -1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1 -6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5 -5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5 -3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6 -5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6 -3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5 455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3 -2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1 1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D -3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0 -2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4 -2 3 0 1 1 -2 -2 -3 5 -1 -3 catalytic loop [ >./blastpgp -i NP_499868.2 -d nr -j 3 -Q NP_499868.pssm ]
E = Kmne-S or E = mn2-S’ K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2 (applies to ungapped alignments) Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance, ≥ S your score Alignments expected number of random hits Score (S) More info:www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Gapped Alignments • Gapping provides more biologically realistic alignments • Gapped BLAST parameters are simulated for each scoring matrix • Affine gap costs = -(a+bk) • a = gap open penalty b = gap extend penalty • A gap of length 1 receives the score -(a+b)
An Alignment BLAST Cannot Make 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Reason: no contiguous exact match of 7 bp.
Score = 290 bits (741), Expect = 7e-77Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)Frame = +3 BLAST 2 Sequences (blastx) output: An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX
Other BLAST Algorithms • Megablast • Discontiguous Megablast • PSI-BLAST
Megablast: NCBI’s Genome Annotator • Long alignments of similar DNA sequences • Greedy algorithm • Concatenation of query sequences • Faster than blastn; less sensitive
Discontiguous Megablast • Uses discontiguous word matches • Better for cross-species comparisons
Templates for Discontiguous Words W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 W = word size; # matches in template t = template length Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5
BLAST Databases: Nucleic Acid • nr (nt) • traditional GenBank divisions • NM_ and XM_ RefSeqs • dbest • EST division • htgs • HTG division • gss • GSS division • chromosome • NC_ RefSeqs • env_nr • environmental sample[filter] • e.g., 16S rRNA
BLAST Databases: Protein nr(non-redundant protein sequences) • GenBank CDS translations • NP_ RefSeqs • Outside databases PIR, Swiss-Prot, PRF • PDB (sequences from structures) env_nr (environmental sample[filter])