Motifs for Unknown Sites

Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas

Decomposing KL • Back to our profiles • If we consider the independent marginal distributions Pj and Qj for each of the n positions j, it can be shown that • so relative entropy can be measured by position

What is an acceptable KL value? • Assume perfect patterns and uniform background distribution • p=1 for exactly one nucleotide at each position • q=0.25 for each nucleotide everywhere • Then the relative entropy will be • log24=2 bits per position • 2n bits overall

Comparing with the background • Under the uniform assumption, the motif will appear by chance every 4n letters • Since we would expect that a real motif should not match at too many random places, so 4n≈ Γ (the length of the genome) for one random match • Therefore, Γ = 4n = 22n = 2D(P||Q) • So a good value for D(P||Q) is logΓ

Site selection by relative entropy • Consider the more general problem with unknown motif instance positions • Given k sequences and an integer n, find one length n substring of each of the k sequences such that for the induced profile A, the relative entropy D(A||B) is maximum. • This problem is provably NP-complete.

Assumptions • Same length n for all motif instances (no indels) • One instance per sequence (in reality, there maybe zero or more than one) • Example: Given all genes involved in digestion in yeast, use as input sequences the 1 kB upstream regions from each such gene • goal is to find transcription factor sites

Relative entropy implications • The number of sequences k • does not affect relative entropy • the proportion of matching nucleotides does • k indirectly affects our confidence in the estimate of the profile A • The length n • affects relative entropy • Because of the additive decomposition by position, and the fact that D(A||B)≥0, increasing n always increases relative entropy • Normalize by dividing by n

Addressing NP-completeness • As we mentioned earlier, we have to look at approximate solutions • Several broad strategies are available • greedy algorithms (local optimization) • statistical simulation (Gibbs sampling)

Greedy approach • Choose locally optimal candidate motif instance sets • Augment a starting point in the direction that locally seems the most promising • Keep a limited number of candidate solutions at each iteration

Hertz-Stormo (1999) algorithm • Carry around a set S of sets; each member Sij is a candidate solution • Start with each S1j being each substring of length n from one of the k input sequences (a singleton set) • At step i, add to each S(i-1)j each of the substrings of length n from input sequence i (alternatively, from all sequences ≥i)

Hertz-Stormo algorithm • After a new string has been added to Sij, recalculate the profile A and D(A||B) • Prune S by keeping only the d best scoring Sij’s (this is the heuristic step) • Repeat until i=k

Hertz-Stormo example • Input sequences ACTGA, TAGCG, CTTGC and n=4 • Start with S={{ACTG}, {CTGA}} • Calculate profile A and D(A||B) for each S1j and keep the d bestsets • Expanding S11={ACTG} produces S21={ACTG, TAGC} and S22={ACTG, AGCG} • Possibly also consider S23={ACTG, CTTG} and S24={ACTG, TTGC}

Complexity of Hertz-Stormo • There are k steps (assuming one instance per sequence) • At each step, at most d profiles are extended • Each extension involves m-n+1=O(m) new strings, where m is the length of the input sequences • Profiles and relative entropy can be updated in O(n) time • Total time is O(knmd)

Issues with the heuristic approach • Pruning is crucial to keep number of candidate sets manageable • Order of sets influences the incrementally constructed profiles and what is kept for later stages (randomization is an option) • How good is it? Hertz and Stormo tested it on 18 genes with 24 known sites, it found 19 and 3 overlaps

Statistical sampling • A very general method for solving difficult problems with many variables that cannot be solved directly, but where partial solutions can be “guessed” and improved • Commonly known as “Monte Carlo” methods (from the Monaco casino) because one of the pioneers of the technique liked gambling

Random walks • A randomwalk is a special kind of a stochastic process where the system moves from state to state according to a probability distribution • In optimization problems, we construct random walks where the system moves a marker (representing the current state) randomly, performing some calculations at each step (including where to go next)

Uniform random walks • Assume that the state corresponds to a position in physical space in d dimensions • Each step is of the same length (1), following one of the axes • The drunkard problem: If a drunk performs a random walk in a city, will he get back home? • Answer: Yes with probability 1

Gambler’s ruin • If a gambler wins or loses an individual round with probabilities p and q, each time gaining or losing the same amount, what is the probability of ruin (reaching 0 money)? • We assume an opponent with infinite money • This is a one-dimensional random walk • Answer: If p≤q, the probability of ruin is 1. If p>q, then the probability of ruin is q/p. • So, +10% odds have a probability of ruin of 91%, and 2:1 odds a probability of 50%.

What about a drunk bird? • Answer: No • The difference is that the uniform random walk is recurrent in 1 or 2 dimensions but transient in 3 or more dimensions • Recurrent: There exists a unique stationary distribution where the state of the process will converge • Note that a uniform random walk in one dimension corresponds to Brownian motion in physics

Motifs for Unknown Sites

Motifs for Unknown Sites

Presentation Transcript

MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Sequence motifs

Network Motifs

Mutiple Motifs

Finding Motifs

Hero Motifs

Motifs

Motifs

Protein Motifs

Motifs

Universal Motifs

Motifs

Regulatory Motifs

MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Universal Motifs

Sequence Motifs

Motifs, Motifs, Motifs

Linear motifs and phosphorylation sites

Regulatory Motifs

Linear motifs and phosphorylation sites

Motifs

Motifs