180 likes | 418 Views
BLAST. Chris Cobb Bioinformatics April 2008. Overview. Sequence comparison Compare regions of similarity Global alignment not necessarily 'valuable' Indicative of evolutionary relationships. Not always... Horizontal gene transfer Convergent evolution
E N D
BLAST • Chris Cobb • Bioinformatics • April 2008
Overview • Sequence comparison • Compare regions of similarity • Global alignment not necessarily 'valuable' • Indicative of evolutionary relationships. • Not always... • Horizontal gene transfer • Convergent evolution • Similar function different sequence • Databases • Growing exponentially
Sequence similarity • Percent identity • Max matches in an alignment / alignment length • Best alignment gives us maximum number of matches • Example • Cytochrome c • Small electron carrier protein in humans • Remember 3D structure! • 104 nucleotides
Cytochrome c • Comparison between mouse and human • (95/104) = 91% match. • Evolutionary distance measured by similarity value.
Sequence Alignment • How do we perform this alignment? • What considerations do we take? • The intuitive solution is to line them up along x-y axis and compare. • In CS this is a dynamic programming problem. • Small example
Dynamic Programming • Set weights on different situations • +1 match • -1 mismatch • -2 insertion/deletion • Note: this gets expensive • Gaps resolved by insertion/deletion • If treated as individual events, carries a lot of weight • Instead use affine penalty function • penalty = G + nL • G : open penalty • L : extension penalty • n : size of gap
Smith-Waterman • Dynamic programming algorithm with affine gap penalty developed by Smith and Waterman in 1981 • Guarantees optimal local alignment • Local alignment avoids noise • Reliable statistical model for optical local alignments. • Expectation that optimal alignment would occur by chance. (Karlin-Altschul). • Low expectation = high chance they are homologous • Open source implementations available
Substitution Matrix • +1 for match not good metric. • Substitution matrix • Probability one amino acid mutates into another amino acid. • Built empirically with large dataset to reflect true probabilities of mutation during evolutionary process • BLOSUM, PAM common examples. • BLOSUM for local alignment • PAM for global
Karlin Altschul scoring • Expected frequencey of High Scoring Pair (HSP) versus random occurance: • E = K M N e ^ (-λS) • S: alignment score • λ: unique positive value used to normalize score • MN: search space (|query| * |target|) • K: constant, aprox .1 • E-(MN): linear change • E-(S): exponential change (small change in score leads to big effect on value). • Assumptions: • i.i.d. (indepentantly identically distributed) • Roughly equal in length
BLAST!!! • Basic Local Alignment Search Tool
BLAST • Pre-indexed database • Position of every 'word' is remembered • High Scoring segment Pair (HSP) • Local alignment with no gaps that scores high. • Query starts with best HSP and expands from there.
BLAST search • Query • Break query into 'words' ('k' character strings) • n – k + 1 words • n = query length • k = word length (default in Blast is 11 and 3) • Scan database for words • Don't just use exact word, use similar words • When two words within a certain (T) distance from each other match the target, this as a segment pair. • Extend the HSP until the score drops by 'X' below its max value • Report statistically significant scores.
BLAST statistics • HSP has a P value based on Poisson distribution • Small P value means significant score • Applies to ungapped segments • For gapped alignment, calculate E-Value • Karlin-Altschul method • E value decreases exponentially as Score increases! • Boils down to: • Results that aren't likely to happen by chance are best. • When we find these, report them as significant • Low E values mean more significant response.
Using BLAST • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html
BLAST customization • Lowering Neighborhood Word Threshold (T) finds more distantly related sequences • Raising the Segment Extension Cutoff (X) extends region that might be considered a HSP • Changing (E) Expectation just changes the score threshold (how good it has to be to show up).
BLAST scores • Raw score • Sum of substitution score + gap score • Substitution score given by sub matrix (BLOSUM, PAM) • Gap score calculated using affine penalty function • G: gap opening penalty • L: gap length penalty • Bit score • Normalized raw score so that scores from different substitution matrices can be compared. • E-value • Probability sequence with same or better score would occur randomly • Based on the Karlin-Altschul method
Sources • http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=bioinfo.chapter.A05 • http://www.ludwig.edu.au/course/course2002/talks/flegg02search/sld018.htm • http://en.wikipedia.org/wiki/Smith-Waterman_algorithm • http://math.la.asu.edu/~cbs/pdfs/projects/Fall_2005/Karlin-AltschulStatistics.pdf • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html