100 likes | 195 Views
Pairwise Sequence Analysis-V. Relatedness “Not just important, but everything” Modeling Alignment Scores Coin Tosses Unit Distributions Extreme Value Distribution Lambda and K revealed Loose Ends. Modeling Expectation. Reduced model: Coin tosses Given: N coin tosses
E N D
Pairwise Sequence Analysis-V • Relatedness • “Not just important, but everything” • Modeling Alignment Scores • Coin Tosses • Unit Distributions • Extreme Value Distribution • Lambda and K revealed • Loose Ends Lecture 6 CS566
Modeling Expectation • Reduced model: Coin tosses • Given: • N coin tosses • Probability of heads p • Problem: • What is the average number of longest run of heads? • Solution: • Experimental: Perform several repetitions and count • Theoretical: E(Runmax) = log1/pN • For example, for fair coin and 64 tosses, E(Runmax) = 6 Lecture 6 CS566
Random alignment as Coin tosses • Head = Match • Assume • Score = Run of matches • Maximum score = Longest run of matches • Therefore • Same model of expectation • For example: For DNA sequences of length N, • E(matchlengthmax) = Expected longest run of matches = log1/pN Lecture 6 CS566
Local alignment as Coin tosses • Assume • Score in local alignment = Run of matches • Maximum score = Longest run of matches • Therefore • Similar model of expectation • For DNA sequences of length n & m • E(Matchlengthmax) ~ log1/p(nm) (Why not just n or m?) ~ log1/p(K’nm) • Var(Matchlengthmax) = C (i.e., Independent of sample space) Lecture 6 CS566
Refining Model • S = AS matrix based scoring between unrelated sequences • E(S) ~ log1/p(K’nm) ~ [ln(Knm)]/ (where = loge 1/p) • Holy Grail: Need P(S > x), probability of a score between unrelated sequences exceeding x Lecture 6 CS566
Poisson distribution estimate of P(S > x) • Consider Coin Toss Example Given [x >> E(Runmax)] Define Success = (Runmax x) Define Pn = Probability of n successes Define y = E[Success],i.e., Average no. of successes Then, probability of n successes follows Poisson dist. Pn = (e-yyn)/n! Probability of 0 successes (No score exceeding x) is given by P0 = e-y. Then, probability of at least one score exceeding x, P(S > x) = i 0 Pi = (1 - P0) = 1 - e-y For Poisson distribution, y = Kmne-x. Therefore, P(S > x) = 1 – exp (-Kmne-x) Lecture 6 CS566
Unit Distributions • Normalize Gaussian and EVD • Area under curve = 1 • Curve maximum at 0 • Then • For Gaussian • Mean = 0; SD = 1 • P(S > x) = 1 – exp (-e-x) • For EVD • Mean = 0.577 (Euler cons); Variance = 2/6 = 1.645 • P(S > x) = 1 – exp (-e-(x-u)) • Z-score representation in terms of SDs • P (Z > z) = 1 – exp(-e-1.2825z – 0.5772) Lecture 6 CS566
Lambda and K • = Scale factor for scoring system • Effectively converts AS matrix values to actual natural log likelihoods • K = Scale factor that reduces search space to compensate for non-independence of local alignments • Esimated by fitting to Poisson approximation or equation for E(S) Lecture 6 CS566
Treasure Trove of Probabilities • Probability distribution of scores between unrelated sequences P(Sunrel) • Probability distribution of number of scores from P(Sunrel) exceeding some cut-off, mean represents number of scores exceeding cut-off observed on average • Probability of observing score x occurring between unrelated sequences P(S x) Lecture 6 CS566
Loose Ends • What about gap parameters? • Short answer: No formal theory • Long answer: • Found empirically • Choice of parameters can be used to convert local alignment algorithm into a global alignment • What about gapped alignment? • Not formally proven, but simulations show statistical behavior similar to ungapped alignment • Effective sequence length n’ = n – E(matchLengthmax) Lecture 6 CS566