600 likes | 611 Views
Explore dynamic programming techniques for gene transcription and sequence alignment, including global and local alignment, statistical significance evaluation, and extreme value distributions. Learn algorithms, examples, and best practices.
E N D
Alignment Class II We continue where we stopped last week: Dynamic programing
a gene transcription pre-mRNA splicing mature mRNA translation protein Reminder -Structure of a genome
Pairwise Sequence Alignment • Example • Which one is better? HEAGAWGHEE PAWHEAE HEAGAWGHE-E HEAGAWGHE-E P-A--W-HEAE --P-AW-HEAE
Example • Gap penalty: -8 • Gap extension: -3 HEAGAWGHE-E --P-AW-HEAE (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE
Global Alignment • Notation • xi – ith letter of string x • yj – jth letter of string y • x1..i – Prefix of x from letters 1 through I • F – matrix of optimal scores • F(i,j) represents optimal score lining up x1..i with y1..j • d – gap penalty • s – scoring matrix
Global Alignment • The work is to build up F • Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd • Fill from top left to bottom right using the recursive relation
Example _ _ X _ _ X X X X_ _X _X X_
Example X X X_ _X _X X_ X _ XY __ XY X_ X_Y _X_ _XY X__ XY _X XY_ __X
Global Alignment yj aligned to gap Move ahead in both s(xi,yj) d d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows
Completed Table Score Gap –8 Error –2 Fit +6
Traceback • Trace arrows back from the lower right to top left • Diagonal – both • Up – upper gap • Left – lower gap HEAGAWGHE-E --P-AW-HEAE
Summary • Uses recursion to fill in intermediate results table • Uses O(nm) space and time • O(n2) algorithm • Feasible for moderate sized sequences, but not for aligning whole genomes.
Local Alignment • Smith-Waterman (1981) • Another dynamic programming solution
Traceback Start at highest score and traceback to first 0 AWGHE AW-HE
Summary • Similar to global alignment algorithm • For this to work, expected match with random sequence must have negative score. • Behavior is like global alignment otherwise • Similar extensions for repeated and overlap matching • Care must be given to gap penalties to maintain O(nm) time complexity
Statistical Significance of Sequence Alignments • STATISTICAL SIGNIFICANCE = probability that our score would be found between random (or unrelated) sequences • Examine alignment for long runs of matches and placement of gaps • Try alternative alignments and compare scores • Calculate statistical significance of alignment score using extreme value distribution formula • Scramble one of sequences 1000's of times and realign to obtain idea of distribution of scores with unrelated sequences of same size
Alternative alignments Programs like LALIGN (stands for local alignment) produce as many different alignments as you like. Each subsequent alignment does not align the same sequence positions.
Extreme Value Distributions • The average of n samples taken from any distribution with finite mean and variance will have a normal distribution for large n. This is the CLT. • The largest member of a sample of size n has a LEV, Type Ilargest extreme value, also called Gumbel, distribution, regardless of the parent population, • IF the parent has an unbounded tail that decreases at least as fast as an exponential function, and has finite moments (as does the normal, for example). The LEV, has pdf given by • f (x | q1 , q2 ) = 1/ q2 exp( -z -exp( -z ) ) ) • where z = (x - q1) / q2 and q1, q2 are location and scale* parameters, respectively, and q1 > 0.
Normal Distribution versus Extreme Value Distribution Normal distribution: y = exp(-x2/2) / sqrt(2π) Normal Extreme value distribution: y = exp(-x – exp(-x)) Extreme Value
Extreme Value Distribution • Probability density function: • f(x) = exp(-x-exp(-x)) • Cumulative distribution function: • Prob(S<x) = exp(-exp(-x)) • Prob(S≥x) = 1 - Prob(S<x) = 1 - exp(-exp(-x)) • Sample mean m, sample variance σ2 • λ = 1.2825 / σ • μ = m – 0.45σ • Prob(x) = exp(-x-exp(-λ(x-μ)))
0.4 A. 0.2 Yev -2 -1 0 1 2 3 4 5 X Calculating the statistical significance • How is this done? • Make many random protein sequences of varying lengths • Locally align two sequences of a given length with the same scoring matrix and gap penalty as used for our alignment • Repeat for many pairs of sequences of approximately the same length to see how high a score we can get • Look at the distribution of scores for a given range of lengths
The formula for extreme values The probability that a score S between two unrelated sequences is equal to or greater than a value x, P (S > x) is given by: P = 1 – exp ( K m n e-lx) wherem and n are the sequence lengths, lis a “scaling factor” for the scoring matrix used, and K is a constant that depends on the scoring matrix and gap penalty combination that is used. What we want is a score that gives a very low value of P, say less than 0.01- 0.05. However, there is a trick here. We usually calculate another value E, the expect value for the alignment score that depends on how we found P. This method is used by BLAST. For blosum62, gap –11/-1, l = 0.3 and K=0.1 (roughly).
Values Describing Scores • Both the Gumbel Extreme Value Distribution and Karlin-Ashtul Distribution use E values and P values • E-value (Expect value): the average number of times such a match would be found • P-value (probability): probability of finding an alignment under assumptions • Important note: Alignments that are statistically important may not be biologically important
The Dot Matrix Method. • Provides a ‘Gestalt’ of all possible alignments between two sequences. • To begin — Let us use a very simple 0, 1 (match, no-match) identity scoring function without any windowing. The sequences to be compared are written out along the x and y axes of a matrix. • Put a dot wherever symbols match; identities are highlighted.
Since this is a comparison between two of the same sequences, an intra-sequence comparison, the most obvious feature is the main identity diagonal. Two short perfect palindromes can also be seen as crosses directly off the main diagonal; they are “ANA” and “SIS.”
Dot Matrices • The biggest asset of dot matrix analysis is it allows you to visualize the entire comparison at once, not concentrating on any one ‘optimal’ region, but rather giving you the ‘Gestalt’ of the whole thing. • Since your own mind and eyes are still better than computers at discerning complex visual patterns, especially when more than one pattern is being considered, you can see all these ‘less than best’ comparisons as well as the main one and then you can ‘zoom-in’ on those regions of interest using more detailed procedures.
It is impossible to tell whether the evolutionary event that caused the discrepancy between the two sequences was an insertion or a deletion and hence this phenomena is called an ‘indel.’
The ‘duplication’ here is seen as a distinct column of diagonals; whenever you see either a row or column of diagonals in a dotplot, you are looking at direct repeats.
Again, notice the diagonals. However, they have now been displaced off of the center diagonal of the plot and, in fact, in this example, show the occurrence of a ‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such transpositions in sequences. Inverted repeats still show up as perpendicular lines to the diagonals, they are just now not on the center of the plot. The ‘deletion’ of ‘PRIMER’ is shown by the lack of a corresponding diagonal.
Filtered Windowing • Reconsider the same plot. Notice the extraneous dots that neither indicate runs of identity between the two sequences nor inverted repeats. These merely contribute ‘noise’ to the plot and are due to the ‘random’ occurrence of the letters in the sequences, the composition of the sequences themselves. • How can we ‘clean up’ the plots so that this noise does not detract from our interpretations? Consider the implementation of a filtered windowing approach; a dot will only be placed if some ‘stringency’ is met. • What is meant by this is that if within some defined window size, and when some defined criteria is met, then and only then, will a dot be placed at the middle of that window. Then the window is shifted one position and the entire process is repeated. This very successfully rids the plot of unwanted noise.
In this plot a window of size three and a stringency of two is used to considerably improve the signal to noise ratio Windowing
Alignment to Databsases Fasta Blast
Query: DNA Protein Database: DNA Protein FASTA FastA is a family of programs: FastA, TFastA, FastX, FastY
Gap opening penalty -12, -16 by default for fasta with proteins and DNA, respectively Gap extension penalty -2, -4 by default for fasta with proteins and DNA, respectively Blosum50 default. Lower PAM higher blosum to detect close sequences Higher PAM and lower blosum to detect distant sequences The larger the word-length the less sensitive, but faster the search will be Max number of scores and alignments is 100 FastA
Initn, init1, opt, z-score calculated during run Database code hyperlinked to the SRS database at EBI E score - expectation value, how many hits are expected to be found by chance with such a score while comparing this query to this database. E() does not represent the % similarity Accession number Description Length FastA Output
FASTA-Stages • Find k-tups in the two sequences (k=1,2 for proteins, 4-6 for DNA sequences) • Score and select top 10 scoring “local diagonals” • For proteins, each k-tup found is scored using the PAM250 matrix • For DNA, the number of k-tups found • Penalize intervening gaps
FASTA-Stages • Rescan top 10 regions, score with PAM250 (proteins) or DNA scoring matrix. Trim off the ends of the regions to achieve highest scores. • Try to join regions with gapped alignments. Join if similarity score is one standard deviation above average expected score • After finding the best initial region, FASTA performs a global alignment of a 32 residue wide region centered on the best initial region, and uses the score as the optimized score.
Finding k-tups position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a . . . . . protein 2 . . . . . a c s p r k position in offset amino acid protein A protein B pos A - posB ----------------------------------------------------- a 6 6 0 c 2 7 -5 k - 11 n 1 - p 4 9 -5 r - 10 s 3 8 -5 t 5 - ----------------------------------------------------- Note the common offset for the 3 amino acids c,s and p A possible alignment is thus quickly found - protein 1 n c s p t a | | | protein 2 a c s p r k
BLAST • Basic Local Alignment Search Tool • Altschul et al. 1990,1994,1997 • Heuristic method for local alignment • Designed specifically for database searches • Idea: Good alignments contain short lengths of exact matches
Query: DNA Protein Database: DNA Protein Blast Application • Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX • BlastN - nt versus nt database • BlastP - protein versus protein database • BlastX - translated nt versus protein database • tBlastN - protein versus translated nt database • tBlastX - translated nt versus translated nt database
Blast – Basic Local Alignment Search Tool • Blast uses a heuristic search algorithm and uses statistical methods of Karlin and Altshul (1990) • Blast programs were designed for fast database searching with minimal sacrifice of sensitivity for distantly related sequences
Mathematical Basis of BLAST • Model matches as a sequence of coin tosses • Let p be the probability of a “head” • For a “fair” coin, p = 0.5 • (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log1/p (n). • Example: Suppose n = 20 for a “fair” coin R=log2(20)=4.32 • Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.
Mathematical Basis of BLAST • To model random sequence alignments, replace a match with a “head” and mismatch with a “tail”. • For DNA, the probability of a “head” is 1/4 • Same logic applies to amino acids AATCAT ATTCAG HTHHHT
Mathematical Basis of BLAST • So, for one particular alignment, the Erdös-Rényi property can be applied • What about for all possible alignments? • Consider that sequences are being shifted back and forth, dot matrix plot • The expected length of the longest match is R=log1/p(mn) where m and n are the lengths of the two sequences.