Statistical significance for trace recalling alignments

Statistical significance for trace recalling alignments

Brief TR review • Generate ambiguity sequence from trace (up to 2 bases per position) • Align ambiguity sequence to genomic sequence (primary alignment) • Use aligned genome sequence to “subtract out” one template sequence • Remaining sequence comes from other template

Brief TR review • Can align this sequence to genome to differentiate real secondary trace from noise (secondary alignment)

What constitutes a good alignment? • Expect primary alignment to have a very high percent identity • Represents stronger of 2 templates present • Ambiguity sequence so generally expect more matches than DNA vs. DNA alignment • Expect few (non-intron) gaps

What constitutes a good alignment? • Expect secondary alignment to have lower percent identity • Weaker of 2 signals in trace • DNA + N only • Still don’t expect many gaps (N’s are used as placeholders)

What constitutes a good alignment? • EST_GENOME doesn't report statistical significance of alignments • Have been using mainly percent identity and length of alignment to determine if procedure “worked” • Decent heuristic but not really the right way to determine significance (and can be misleading)

What is the right way to determine significance? • Would like something like a BLAST E-value • Mathematically sound • Single number • No arbitrary cutoffs

What is the BLAST E-value? • For a given search: • Scoring system (matrix and gap penalties) • Lengths of query and database • For a given MSP: • Score = S • E = expected frequency of chance occurrences of an MSP with score S or greater

What is the BLAST E-value? • So if E = 5 you expect to see 5 MSPs with the same or better score and your MSP is not significant • If E = 10-50 you really don’t expect to see an alignment this good by chance

How is E defined? • E = kmne-λS where: • m = length of query • n = length of database • S = score of MSP • k & λ are constants

How is E defined? • For ungapped alignments there are closed form solutions for k & λ given the scoring matrix • For gapped alignments they must be estimated empirically

What’s going on behind the scenes here? • The BLAST statistics paper (Karlin & Altschul, 1990) states the formula for E without proof • References another paper (Karlin, Dembo & Kawabata, 1990) in Annals of Statistics which contains the proof

What’s going on behind the scenes here? • The BLAST statistics paper (Karlin & Altschul, 1990) states the formula for E without proof • References another paper (Karlin, Dembo & Kawabata, 1990) in Annals of Statistics which contains the proof 

Yikes

What’s going on behind the scenes here (take 2)? • MSP scores are distributed by an extreme value or Gumbel distribution • Looks like a normal distribution with probability density skewed to the right

Extreme value distribution

Why an extreme value distribution? • You get an EVD when the random variable you’re looking at is the maximum of another set of random variables • Might help to see this in another domain • Classic example: Floods • Record height of river every day (Xi) over many years • Define annual flood to be maximum river level for each calendar year (Yj)

Why an extreme value distribution? • The Yj values will be distributed by an EVD • 100 year levees • Distribution of Yj values sensitive to distribution of Xi but not as much as you might suspect • Similar EVD if Xi is exponentially or normally distributed!

Why an extreme value distribution? • MSP scores in an alignment are analogous to annual flood levels • Highest scoring pair between 2 sequences is maximum of several high scoring pairs

What do k (or u) & λ do? • u = ln(kmn)/λ for alignment • EVD cdf  F(x) = e-e-λ(x-u) • EVF pdf  f(x) = λe-λ(x-u) e-e-λ(x-u) • λ controls the spread and u (or k) controls the location

u controls position

λ controls spread

Back to trace recalling • Said earlier that for gapped alignment k & λ must be estimated • One method to get parameters is to generate many random sequences, align them to get optimal score and fit an EVD to them • u & λ can be obtained from mean and standard deviation of this set of scores

Parameters • By method of moments • λ = π/sqrt(6*V) • u = μ – γ/ λ where γ = Euler’s constant = 0.577

Histogram of optimal scores and fit EVD

Back to trace recalling • Secondary alignment is straightforward (DNA vs. DNA alignments) • Primary alignment is a bit trickier • As originally implemented λ and u depended on percentage of ambiguous positions in ambiguity sequence

Primary alignment parameters

Why was this happening? • Gave same score to match between ambiguous positions and non-ambiguous positions • Random alignments to 100% ambiguous sequence were ~2x as long and high scoring as 0% ambiguous sequence

How was it corrected? • Setting match score for ambiguous base at half score of non-ambiguous base fixed problem • This actually makes more sense anyway • No matter what percentage ambiguity positions get λ = 0.66 and k = 0.477

In conclusion • Have added this to trace recalling pipeline • Now in addition to checking for alternate splices by comparing alignments can screen out bad alignments by E-value

Statistical significance for trace recalling alignments

Statistical significance for trace recalling alignments

Presentation Transcript

Statistical vs. Practical Significance

Clinical, Practical or Mechanistic Significance vs Statistical Significance for POPULATION Effects

Statistical Significance Testing

Statistical vs. Practical Significance

Statistical Significance: Tests for Spatial Randomness

Statistical Significance for a two-way table

Statistical Significance

Determining Statistical Significance

Statistical Significance

Statistical vs Clinical Significance

Unit 4: Statistical Significance

Trace recalling on MGC traces

Statistical Significance

Statistical Significance

Statistical significance for genomewide studies

Statistical significance

TESTS OF STATISTICAL SIGNIFICANCE

Statistical Significance Testing

Statistical Significance

Statistical Significance Testing

Estimating Statistical Significance

Statistical Significance: Tests for Spatial Randomness