200 likes | 289 Views
Motifs for Unknown Sites. Vasileios Hatzivassiloglou University of Texas at Dallas. Decomposing KL. Back to our profiles If we consider the independent marginal distributions P j and Q j for each of the n positions j , it can be shown that
E N D
Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas
Decomposing KL • Back to our profiles • If we consider the independent marginal distributions Pj and Qj for each of the n positions j, it can be shown that • so relative entropy can be measured by position
What is an acceptable KL value? • Assume perfect patterns and uniform background distribution • p=1 for exactly one nucleotide at each position • q=0.25 for each nucleotide everywhere • Then the relative entropy will be • log24=2 bits per position • 2n bits overall
Comparing with the background • Under the uniform assumption, the motif will appear by chance every 4n letters • Since we would expect that a real motif should not match at too many random places, so 4n≈ Γ (the length of the genome) for one random match • Therefore, Γ = 4n = 22n = 2D(P||Q) • So a good value for D(P||Q) is logΓ
Site selection by relative entropy • Consider the more general problem with unknown motif instance positions • Given k sequences and an integer n, find one length n substring of each of the k sequences such that for the induced profile A, the relative entropy D(A||B) is maximum. • This problem is provably NP-complete.
Assumptions • Same length n for all motif instances (no indels) • One instance per sequence (in reality, there maybe zero or more than one) • Example: Given all genes involved in digestion in yeast, use as input sequences the 1 kB upstream regions from each such gene • goal is to find transcription factor sites
Relative entropy implications • The number of sequences k • does not affect relative entropy • the proportion of matching nucleotides does • k indirectly affects our confidence in the estimate of the profile A • The length n • affects relative entropy • Because of the additive decomposition by position, and the fact that D(A||B)≥0, increasing n always increases relative entropy • Normalize by dividing by n
Addressing NP-completeness • As we mentioned earlier, we have to look at approximate solutions • Several broad strategies are available • greedy algorithms (local optimization) • statistical simulation (Gibbs sampling)
Greedy approach • Choose locally optimal candidate motif instance sets • Augment a starting point in the direction that locally seems the most promising • Keep a limited number of candidate solutions at each iteration
Hertz-Stormo (1999) algorithm • Carry around a set S of sets; each member Sij is a candidate solution • Start with each S1j being each substring of length n from one of the k input sequences (a singleton set) • At step i, add to each S(i-1)j each of the substrings of length n from input sequence i (alternatively, from all sequences ≥i)
Hertz-Stormo algorithm • After a new string has been added to Sij, recalculate the profile A and D(A||B) • Prune S by keeping only the d best scoring Sij’s (this is the heuristic step) • Repeat until i=k
Hertz-Stormo example • Input sequences ACTGA, TAGCG, CTTGC and n=4 • Start with S={{ACTG}, {CTGA}} • Calculate profile A and D(A||B) for each S1j and keep the d bestsets • Expanding S11={ACTG} produces S21={ACTG, TAGC} and S22={ACTG, AGCG} • Possibly also consider S23={ACTG, CTTG} and S24={ACTG, TTGC}
Complexity of Hertz-Stormo • There are k steps (assuming one instance per sequence) • At each step, at most d profiles are extended • Each extension involves m-n+1=O(m) new strings, where m is the length of the input sequences • Profiles and relative entropy can be updated in O(n) time • Total time is O(knmd)
Issues with the heuristic approach • Pruning is crucial to keep number of candidate sets manageable • Order of sets influences the incrementally constructed profiles and what is kept for later stages (randomization is an option) • How good is it? Hertz and Stormo tested it on 18 genes with 24 known sites, it found 19 and 3 overlaps
Statistical sampling • A very general method for solving difficult problems with many variables that cannot be solved directly, but where partial solutions can be “guessed” and improved • Commonly known as “Monte Carlo” methods (from the Monaco casino) because one of the pioneers of the technique liked gambling
Random walks • A randomwalk is a special kind of a stochastic process where the system moves from state to state according to a probability distribution • In optimization problems, we construct random walks where the system moves a marker (representing the current state) randomly, performing some calculations at each step (including where to go next)
Uniform random walks • Assume that the state corresponds to a position in physical space in d dimensions • Each step is of the same length (1), following one of the axes • The drunkard problem: If a drunk performs a random walk in a city, will he get back home? • Answer: Yes with probability 1
Gambler’s ruin • If a gambler wins or loses an individual round with probabilities p and q, each time gaining or losing the same amount, what is the probability of ruin (reaching 0 money)? • We assume an opponent with infinite money • This is a one-dimensional random walk • Answer: If p≤q, the probability of ruin is 1. If p>q, then the probability of ruin is q/p. • So, +10% odds have a probability of ruin of 91%, and 2:1 odds a probability of 50%.
What about a drunk bird? • Answer: No • The difference is that the uniform random walk is recurrent in 1 or 2 dimensions but transient in 3 or more dimensions • Recurrent: There exists a unique stationary distribution where the state of the process will converge • Note that a uniform random walk in one dimension corresponds to Brownian motion in physics