300 likes | 444 Views
Statistical significance for trace recalling alignments. Brief TR review. Generate ambiguity sequence from trace (up to 2 bases per position) Align ambiguity sequence to genomic sequence (primary alignment) Use aligned genome sequence to “subtract out” one template sequence
E N D
Brief TR review • Generate ambiguity sequence from trace (up to 2 bases per position) • Align ambiguity sequence to genomic sequence (primary alignment) • Use aligned genome sequence to “subtract out” one template sequence • Remaining sequence comes from other template
Brief TR review • Can align this sequence to genome to differentiate real secondary trace from noise (secondary alignment)
What constitutes a good alignment? • Expect primary alignment to have a very high percent identity • Represents stronger of 2 templates present • Ambiguity sequence so generally expect more matches than DNA vs. DNA alignment • Expect few (non-intron) gaps
What constitutes a good alignment? • Expect secondary alignment to have lower percent identity • Weaker of 2 signals in trace • DNA + N only • Still don’t expect many gaps (N’s are used as placeholders)
What constitutes a good alignment? • EST_GENOME doesn't report statistical significance of alignments • Have been using mainly percent identity and length of alignment to determine if procedure “worked” • Decent heuristic but not really the right way to determine significance (and can be misleading)
What is the right way to determine significance? • Would like something like a BLAST E-value • Mathematically sound • Single number • No arbitrary cutoffs
What is the BLAST E-value? • For a given search: • Scoring system (matrix and gap penalties) • Lengths of query and database • For a given MSP: • Score = S • E = expected frequency of chance occurrences of an MSP with score S or greater
What is the BLAST E-value? • So if E = 5 you expect to see 5 MSPs with the same or better score and your MSP is not significant • If E = 10-50 you really don’t expect to see an alignment this good by chance
How is E defined? • E = kmne-λS where: • m = length of query • n = length of database • S = score of MSP • k & λ are constants
How is E defined? • For ungapped alignments there are closed form solutions for k & λ given the scoring matrix • For gapped alignments they must be estimated empirically
What’s going on behind the scenes here? • The BLAST statistics paper (Karlin & Altschul, 1990) states the formula for E without proof • References another paper (Karlin, Dembo & Kawabata, 1990) in Annals of Statistics which contains the proof
What’s going on behind the scenes here? • The BLAST statistics paper (Karlin & Altschul, 1990) states the formula for E without proof • References another paper (Karlin, Dembo & Kawabata, 1990) in Annals of Statistics which contains the proof
What’s going on behind the scenes here (take 2)? • MSP scores are distributed by an extreme value or Gumbel distribution • Looks like a normal distribution with probability density skewed to the right
Why an extreme value distribution? • You get an EVD when the random variable you’re looking at is the maximum of another set of random variables • Might help to see this in another domain • Classic example: Floods • Record height of river every day (Xi) over many years • Define annual flood to be maximum river level for each calendar year (Yj)
Why an extreme value distribution? • The Yj values will be distributed by an EVD • 100 year levees • Distribution of Yj values sensitive to distribution of Xi but not as much as you might suspect • Similar EVD if Xi is exponentially or normally distributed!
Why an extreme value distribution? • MSP scores in an alignment are analogous to annual flood levels • Highest scoring pair between 2 sequences is maximum of several high scoring pairs
What do k (or u) & λ do? • u = ln(kmn)/λ for alignment • EVD cdf F(x) = e-e-λ(x-u) • EVF pdf f(x) = λe-λ(x-u) e-e-λ(x-u) • λ controls the spread and u (or k) controls the location
Back to trace recalling • Said earlier that for gapped alignment k & λ must be estimated • One method to get parameters is to generate many random sequences, align them to get optimal score and fit an EVD to them • u & λ can be obtained from mean and standard deviation of this set of scores
Parameters • By method of moments • λ = π/sqrt(6*V) • u = μ – γ/ λ where γ = Euler’s constant = 0.577
Back to trace recalling • Secondary alignment is straightforward (DNA vs. DNA alignments) • Primary alignment is a bit trickier • As originally implemented λ and u depended on percentage of ambiguous positions in ambiguity sequence
Why was this happening? • Gave same score to match between ambiguous positions and non-ambiguous positions • Random alignments to 100% ambiguous sequence were ~2x as long and high scoring as 0% ambiguous sequence
How was it corrected? • Setting match score for ambiguous base at half score of non-ambiguous base fixed problem • This actually makes more sense anyway • No matter what percentage ambiguity positions get λ = 0.66 and k = 0.477
In conclusion • Have added this to trace recalling pipeline • Now in addition to checking for alternate splices by comparing alignments can screen out bad alignments by E-value