790 likes | 929 Views
Genomics and Personalized Care in Health Systems Lecture 4. Blast Search. Leming Zhou, PhD, DSc School of Health and Rehabilitation Sciences Department of Health Information Management. Outline. BLAST algorithm BLAST search Other programs. Pairwise Local Alignment.
E N D
Genomics and Personalized Care in Health SystemsLecture 4. Blast Search Leming Zhou, PhD, DSc School of Health and Rehabilitation Sciences Department of Health Information Management
Outline • BLAST algorithm • BLAST search • Other programs
Pairwise Local Alignment • Pairwise local sequence alignment: identify similar segments in two sequences • Smith-Waterman algorithm (a dynamic programming algorithm) is guaranteed to find optimal alignments, but it is computationally expensive [O(nm)]. • BLAST and FASTA are heuristic approximations to local alignment and they run much faster than Smith-Waterman algorithm but retain sensitivity of the search
BLAST • BLAST [Basic Local Alignment Search Tool] is a sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query • It is the most widely used and referenced computational biology resource • The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length W with a score of at least T when compared to the query using a substitution matrix • Word hits are then extended in both directions to generate an alignment with score exceeding a given threshold S
Applications of BLAST • BLAST searching is fundamental to understanding the relatedness of any query sequence to other known proteins or DNA sequences. • Applications include • Identifying homologs • Discovering new genes • Discovering variants of genes • Exploring protein structure and function • … page 87
BLAST Algorithm • Filter out low complexity regions • Locate words with a fix size in the query sequence • Scan the sequence database for entries that match the words in the query sequence • If there is a hit (i.e. a match between a word in the query and a word in the database entry), extend the hit in both directions. Keep track of the score and stop the extension when the score drops below a threshold
Word Size • The initial search is done for a word of length W • Default values: • Protein sequence search: W = 3 • Nucleotide sequence search: W = 11 • Highly similar nucleotide sequence: W=28 • Each word in the query sequence index is compared to the database index and residue pairs are scored
Scoring Matrices • Nucleotide: • Protein: • PAM matrices • BLOSUM matrices • BLOSUM 62 is default for major database searches A G C T A +1 –2 –2 -2 G –2 +1 –2 -2 C –2 –2 +1 -2 T –2 –2 –2 +1
Initial Search • The initial search is done for a word of length Wthat scores at least Twhen compared to the query using a scoring matrix
Hit Extension • Word hits are then extended in both directions in an attempt to generate an alignment with a score exceeding a given threshold to derive the High-scoring Segment Pairs • This procedure stops when the score becomes lower than the threshold …..SLAALLNKCKTPQGQRLVNQWIKQPLMDKNR IEERLNLVEA… +LA++L+ TP G R++ +W+ P+ D + ER +A …..TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA….
Gap Penalty • Gap penalties can be calculated linearly (constant penalty for each gap) or using affine gap penalty: • G = a + bx • a: gap open penalty • b: gap extension penalty • x: number of gaps • The choice of gap open and gap extension penalty is empirical. Usually we choose a high value for gap open penalty and a low value for gap extension penalty.
Four Steps of a BLAST search • Enter query sequence • Select one BLAST program • Choose the database to search • Set optional parameters
Enter Query Sequence • A sequence can be pasted into a text field in FASTA format or as accession number • A sequence or a sequence list can also be uploaded as a file • Users may indicate a range of the query sequence instead of using the whole query sequence • You may enter a descriptive title for your BLAST search
Align Two or More Sequences • You may provide two or more sequence and perform pairwise BLAST search
Select a BLAST Program • BLAST Programs: • BLASTN: DNA query sequence against a DNA database • BLASTP: protein query sequence against a protein database • BLASTX: DNA query sequence, translated into all six reading frames, against a protein database • TBLASTN: protein query sequence against a DNA database, translated into all six reading frames • TBLASTX: DNA query sequence, translated into all six reading frames, against a DNA database, translated into all six reading frames • Choose the right one according to the sequence you have and your purpose of the search
DNA vs. Protein Searches • Consider the two sequences: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA • Ungapped DNA alignment: AUGGAATTAGTTATTAGTGCTTTAATTGTTGAATAA ||||| | || || || | || || || | | AUGGAGCTGGTGATCTCAGCGCTGATCGTCGAGTGA • 21 identical resides (out of 36) 58% identity • Translate each to protein first: ELVISISALIVE ELVISISALIVE • 100% identical at amino acid level
DNA vs. Protein Searches • If nucleotide region contains a gene, it is beneficial to translate the sequence to protein sequence first • Target and query translated into all six reading frames • 3 in forward, 3 in reverse • Number of comparisons needed grows • More sensitive, but slower
Choose the Database to Search • BLASTN
Choose the Database to Search • BLASTP
Optional Parameters • Specify the organism to search or exclude • Common name, taxonomy id, … • Exclude certain sequences • Exclude predicted sequences or sequences from metagenomics • Use Entrez query to select a subset of the blast database page 93
Algorithm Parameters Optional Parameters
Algorithm Parameters • Expect value • Word size • Filtering/masking • Substitution matrix
Scoring Parameters Match/Mismatch scores Gap Penalty
Expect Value • It is important to assess the statistical significance of search results. • For local alignments, the scores follow an extreme value distribution • Expected value (E value) is the number of matches expected to occur randomly with a given score • The lower the E value, more significant the match. • E = Kmn e-lS • K: A variable with a value dependent upon the substitution matrix used and adjusted for search base size. • m, n: length of the query and database sequences • λ: A statistical parameter used as a natural scale for the scoring system • S: alignment score
More about E Value • The value of E decreases exponentially with increasing alignment score S (higher S values correspond to better alignments). Very high scores correspond to very low E values. • For E=1, one match with a similar score is expected to occur by chance. • For a much larger or smaller database, you would expect E to vary accordingly
Why Set Expect Threshold to 1000 • When you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). • A match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be real and of interest to you. • By setting the E value cutoff to 1000 or a bigger value you do not change the way the search was done, but you do change which results are reported to you. • All hits with E value less than 1000 are reported
E Values • Orthologs from closely related species will have the highest scores and lowest E values • Often E = 10-30 to 10-100 • Closely related homologs with highly conserved function and structure will have high scores • Often E = 10-15 to 10-50 • Distantly related homologs may be hard to identify • Less than E = 10-4 • These values may be served as general guideline but not a strict range for those situations
Set the Expect Threshold • The Expect Threshold can be any positive real number. • The lower the number the more stringent the matches displayed. • The default value of 10 signifies that 10 matches can be expected by chance in a search of the database using a random query with similar length. • No match with an E-value higher than the Expect Threshold selected will be displayed • Increase the Expect Threshold to 1000 or more when searching with a short query
Raw Scores and Bit Scores • There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) • Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 • The E value corresponding to a given bit score is: E = mn 2 -S’ • Bit scores allow you to compare results between different database searches, even using different scoring matrices.
Low Complexity Regions • Low complexity regions: amino acid or DNA sequence regions that offer very low information due to their highly biased content • poly-A tails in DNA sequences • runs of purines or pyrimidines • Tandem repeats, such as ACACACACACACAC… • runs of a single amino acid, etc.
Short Repeats • DNA or amino acids less than 10 bases that repeat themselves • Short, tandem repeats • Such regions can be the cause of disease, but are common in genomes
Interspersed Repeats • Larger repeats are found interspersed throughout genomes • Humans: > 40% interspersed repeats • Plants have large numbers of these as well • Short Interspersed Repeats (SINES, 300 bp) • Long Interspersed Repeats (LINES, 1k bp)
RepeatMasker • RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences using Smith-Waterman algorithm called cross-match. • The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). • http://www.repeatmasker.org/
Soft vs. Hard Masking • In BLAST, two options to mask repetitive elements and low complexity regions: • Hard masking: replace regions with X’s or N’s • Soft masking: repetitive regions and low complexity regions are shown in lower case
Filters and Masking • Filter low-complexity region • This function mask off segments of the query sequence that have low compositional complexity. • Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output. • Filtering is only applied to the query sequence (or its translation products), not to database sequences. • Mask for lookup table only • This option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats. • The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence. • Mask lower case letters • With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. • This allows you to customize what is filtered from the sequence.
Filters and Masking • If the number of hits returned is small when searching with a short query, it may help to re-search with filtering turned off.