400 likes | 537 Views
Biology 4900. Biocomputing. Chapter 4. BLAST. BLAST. BLAST allows user to search a sequence ( the query ) against millions of sequences in the NCBI database ( the target ).
E N D
Biology 4900 Biocomputing
Chapter 4 BLAST
BLAST • BLAST allows user to search a sequence (the query) against millions of sequences in the NCBI database (the target). • Global alignments (e.g., Needleman-Wunsch) would be time consuming and computationally intensive for this amount of data. • BLAST is designed for local alignment, not global alignment. • Allows for faster searches, can match subsets of proteins (e.g., domains). C-terminal domain of CaM (from 3cln.pdb)
Other BLAST Programs • Blastx: Compares nucleotide query sequence translated in all reading frames (3 possible proteins for each DNA strand) against a protein sequence DB. • Tblastn: Compares protein query sequence against a nucleotide sequence DB. • Tblastx: Compares the 6-frame translations of a nucleotide query sequence against the 6-frame translations of a nucleotide sequence database. 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG Pevsner, Bioinformatics and Functional Genomics, 2009
Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnproteinDNA 36 tblastxDNADNA
BLAST (Altschul 1990) • Blast uses a pre-indexed database of ‘words’ for all proteins in the database (Similar to FASTA). • A word is defined as a short sequence of letters. • For Blastp, the default word (W) size is 3 letters. • For Blastn, the default word (W) size is 11 letters. • For MegaBLAST (nucleotide), the default word (W) size is 28 letters. • When you run a query, BLAST breaks your query sequence into a series of words, and generates neighborhood words, as in the following example: For sequence…FSGTWYA… A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS Words Neighborhood Words http://www.incogen.com/bioinfo_tutorials/Bioinfo-Lecture_2-pairwise-align.html
Why use BLAST? • BLAST searching is fundamental to understanding the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include • identifying orthologs and paralogs • discovering new genes or proteins • discovering variants of genes or proteins • investigating expressed sequence tags (ESTs) • exploring protein structure and function
Four steps to becoming a Master BLASTer (1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters (may leave as default params the first time) Then click “BLAST” http://mestadelsbilder.wordpress.com/2011/10/23/master-blaster/
Step 1: Choose your sequence Sequence can be input in FASTA format as text or by file upload, or as accession number
Example of the FASTA format for a BLAST query Note link here
Step 2: Choose the BLAST program Blastn and blastp are the main programs you will want to use
Step 3: choose the database to search nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences protein databases nucleotide databases
Step 4a: Select optional search parameters organism Entrez! algorithm
Step 4a: optional blastp search parameters Expect Word size Right. So, what are these? Scoring matrix Filter, mask
Step 4a: optional blastn search parameters Expect Word size Match/mismatch scores Filter, mask
Algorithm Parameters: Expect • This setting specifies the statistical significance threshold for reporting matches against database sequences. • The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). • If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. • Lower EXPECT thresholds (e.g., set expect to 6) are more stringent, leading to fewer chance matches being reported. http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: Word Size • BLAST is a heuristic algorithm (makes approximations) that works by finding word-matches between the query and database sequences. This process finds "hot-spots" that BLAST can then potentiallyextend into full-blown alignments. • For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. • For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit! http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: Filters • The Low-complexity filter option masks part of query sequence that may represent very common, non-complex subsets of sequence. • May not be very useful. • The Species-repeats repeats for: filter option is designed to ignore species-specific genomic repeats in very long sequences. http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: Masks • The Mask for lookup table only option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat filter is checked). • The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence. • The Mask lower case letters option lets you cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases. These parts of sequence in LC letters masked, or ignored Ex. agvgpADEEWGYilmaagDDEEE http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Algorithm Parameters: Match/Mismatch Scores • Many nucleotide searches use a simple scoring system that consists of a "reward" for a match and a "penalty" for a mismatch. • The (absolute) reward/penalty ratio should be increased as one looks at more divergent sequences. • A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved • A ratio of 0.5 (1/-2) is best for sequences that are 95% conserved • A ratio of about one (1/-1) is best for sequences that are 75% conserved States DJ, Gish W, and Altschul SF (1991)
Algorithm Parameters: Matrices • A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. • Some matrices are good for comparing sequences that diverge very little, while other matrices are good for comparing sequences that diverge a lot. • The BLOSUM-62 matrix is among the best for detecting most weak protein similarities. • The BLOSUM-45 matrix may be better for particularly long and weak alignments. • The older PAM matrices may be better for short alignments, as these need to have a higher percentage of matching residues to exceed background noise (be detectable beyond random chance). http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
Matrices and Gap Costs The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gap costs" which charge the score -a for the existence of a gap, and the score -b for each residue in the gap. Thus a gap of k residues receives a total score of -(a+bk); specifically, a gap of length 1 receives the score -(a+b). Your total raw score for the alignment is reduced when you introduce gaps into the query sequence. Calculate the score in BLOSUM-62 for a gap with 7 residues… http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Matrix/
BLAST (Altschul 1990) • Neighborhood words are similar to constructed words from query, with one or more mismatched symbols. • These are given scores based on the matrix that you are using (for BLAST, the default matrix is BLOSUM62). • Neighborhood words that score above a user-defined threshold are also searched. Word Letter score Total score GTW 6,5,11 22 GSW 6,1,11 18 ATW 0,5,11 16 NTW 0,5,11 16 GTY 6,5,2 13 ANT 1,0,-5 -4 Neighborhood word hit > threshold (T) (T=11) Neighborhood word hit < threshold (T)
BLAST (Altschul 1990) • Blast then searches the entire database for the search words and neighborhood words. • Once a match is found, BLAST then extends the search in both directions of the sequence, scoring each subsequent match, until the score drops below some cutoff value. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!
BLAST (1997) • In a 1997 refinement of BLAST, two independent hits are required. • The hits must occur in close proximity to each other. • With this modification, only 1/7 as many extensions occur, greatly speeding the time required for a search.
Changing BLAST Input Parameters • Increasing W or T will increase speed, but will result in loss of sensitivity (i.e., you will miss some matches) • The expect value(E-value) can be changed in order to limit the number of hits to the most significant ones. • Lower E-value = better hit. • E-value is dependent on length of query sequence and size of database. • Example: an alignment obtaining an E-value of 0.05 means that there is a 5 in 100 chance of occurring by chance alone.
BLAST Output from DB Search • Graphic Summary includes conserved domains, when applicable.
BLAST Output from DB Search • Graphic Summary includes distribution of blast hits. • Color coded by bit Score. • Higher score related to higher sequence identity.
BLAST search output: tabular output High scores low E values
Blast Output include evolutionary tree view Run 3cln to observe tree view options
Pairwise Alignment with Dot Plots 3CLN 1EXR >lcl|24241 3CLN:A|PDBID|CHAIN|SEQUENCE Length=148 Score = 268 bits (684), Expect = 3e-97, Method: Compositional matrix adjust. Identities = 130/148 (88%), Positives = 143/148 (97%), Gaps = 0/148 (0%) Query 1 AEQLTEEQIAEFKEAFALFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 A+QLTEEQIAEFKEAF+LFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN Sbjct 1 ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGN 60 Query 61 GTIDFPEFLSLMARKMKEQDSEEELIEAFKVFDRDGNGLISAAELRHVMTNLGEKLTDDE 120 GTIDFPEFL++MARKMK+ DSEEE+ EAF+VFD+DGNG ISAAELRHVMTNLGEKLTD+E Sbjct 61 GTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEE 120 Query 121 VDEMIREADIDGDGHINYEEFVRMMVSK 148 VDEMIREA+IDGDG +NYEEFV+MM +K Sbjct 121 VDEMIREANIDGDGQVNYEEFVQMMTAK 148
Pairwise Alignment with Dot Plots 1RTP 3CLN Score = 30.0 bits (66), Expect = 1e-06, Method: Compositional matrix adjust. Identities = 14/51 (27%), Positives = 26/51 (51%), Gaps = 3/51 (6%) Query 62 TIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNL 112 + D +F M+ K K D ++++ F + DKD +G+I EL ++ Sbjct 23 SFDHKKFFQMVGLKKKSAD---DVKKVFHILDKDKSGFIEEDELGSILKGF 70 Score = 25.8 bits (55), Expect = 3e-05, Method: Compositional matrix adjust. Identities = 11/40 (28%), Positives = 21/40 (53%), Gaps = 0/40 (0%) Query 4 LTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNP 43 L ++ + K+ F + DKD G I ELG++++ + Sbjct 35 LKKKSADDVKKVFHILDKDKSGFIEEDELGSILKGFSSDA 74 3CLN 1RTP
Statistics of Local Alignments • For local pairwise alignments, best approach to determining statistical significance is to estimate an expect value (E value). • The expect value E is the number of alignments with scores greater than or equal to score S (your score)that are expected to occur by chance in a database search. • A score with an associated E value of 10-3 means that this particular score may occur 1 time out of 1000 alignments by chance. • An E value is related to a probability value p. • The key equation describing an E value is: • E = Kmn e-lS Pevsner, Bioinformatics and Functional Genomics, 2009
E = Kmn e-lS • This equation is derived from a description of the extreme value distribution • S = the score • E = the expect value = the number of high-scoring segment pairs (HSPs) expected to occur with a score of at least S • m, n = the length of two sequences • l, K = Karlin Altschul statistics
Some properties of the equation E = Kmn e-lS • The value of E decreases exponentially with increasing S • (higher S values correspond to better alignments). Very • high scores correspond to very low E values. • The E value for aligning a pair of random sequences must • be negative! Otherwise, long random alignments would • acquire great scores • Parameter K describes the search space (database). • For E=1, one match with a similar score is expected to • occur by chance. For a very much larger or smaller • database, you would expect E to vary accordingly
From raw scores to bit scores • There are two kinds of scores: • raw scores (calculated from a substitution matrix) and • bit scores (normalized scores) • Bit scores are comparable between different searches • because they are normalized to account for the use • of different scoring matrices and different database sizes • S’ = bit score = (lS - lnK) / ln2 • The E value corresponding to a given bit score is: • E = mn 2 -S’ • Bit scores allow you to compare results between different • database searches, even using different scoring matrices.
How to interpret BLAST: E values and p values The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e-E
How to interpret BLAST: E values and p values Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep 10 0.99995460 5 0.99326205 2 0.86466472 1 0.63212056 0.1 0.09516258 (about 0.1) 0.05 0.04877058 (about 0.05) 0.001 0.00099950 (about 0.001) 0.0001 0.0001000