370 likes | 528 Views
Gibbs sampling for motif finding in biological sequences. Christopher Sheldahl. Biological Sequences. Proteins : “strings” of amino acids. 20 common types |alphabet| = 20 Nucleic acids : “strings” of nucleotides 4 common types for RNA; 4 for DNA |alphabet| = 4. Motif Finding .
E N D
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl
Biological Sequences • Proteins : “strings” of amino acids. 20 common types |alphabet| = 20 • Nucleic acids : “strings” of nucleotides 4 common types for RNA; 4 for DNA |alphabet| = 4
Motif Finding • Multiple sequences : How to detect related segments (or “motifs”)? 1: …ATACGTTGAGA… 2: …CGACGTTGCAA… 3: …CCACGCTGGAC…
Sequence Databases are Large • Genbank : NA : 108 million sequences • SwissProt : proteins : ~500,000 sequences
Growth of SwissProt http://www.expasy.org/sprot/relnotes/relstat.html
The basic problem • We have N biological sequences. • Assume that there is one evolutionarily related segment (substring) of known width W in each sequence. • How do we find this segment in each sequence? Remember that there might be mutations.
Protein case : We’d like to know the following: • Motif description (matrix): Probability of each of the 20 aa at each position in the motif of width w: qi,R : Probability ofR type aa at position i of motif. Matrix is W x 20. • Background probability (vector) of each of the 20 aa in sites not part of motif: p1….p20 • Ak : alignment vector of starting indices for the motif in each of the N sequences.
Scoring a given string • If we knew the motif description matrix and the background vector we could score any string of width W : L : score of segment wrt the motif. Ri : the aa type at position i of the segment.
Scoring an alignment of N sequences We can score an entire alignment :
What if we just had one new sequence? • What if we knew the motif description matrix and the background vector for a large number of sequences, and we want to quickly align a new (presumably related) sequence?
What if we just had one new sequence? • Score possible locations of the motif in the new sequence and pick one with a high (or max) value of L: • Note that the better our motif description and background model are, the better the new alignment will be.
Motif Finding Dilemma • If we knew the alignment, we could calculate the motif description matrix and the background vector. • If we knew the motif description and the background, we could calculate the alignment (by choosing high L positions). • We don’t know either of these - we’ve seen one way to approach this problem before...
Expectation Maximization or The Road Not Taken in this Lecture Start with Random motif description and background. Repeat until motif description converges: 1) E - step : Calculate alignment points (Ak) from Motif Description and Background. 2) M - step : Calculate Motif Description and background from alignments. Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biolpolymer Sequences”. Proteins.
Local minima • After initialization, simple EM algorithm is a deterministic process. • If the initialization is bad, can get trapped in local minima. • Heuristic : Try many different starting points.
Random Sampling • Note the heuristic of trying many random initial positions. Maybe we can overcome local minima if we employ randomness more thoroughly. • Can we take random sample alignments from the distribution : p(Align | S, W, Motif Matrix, Background) where Align means A1, A2…AN.
Markov Chain Monte Carlo • Markov Chain Monte Carlo methods generate a Markov chain of points that converges to a distribution of interest. • “Monte Carlo” : The methods employ randomness.
Metropolis-Hastings • Metropolis-Hastings is an MCMC model that can sample from any distribution P, using a proposal distribution Q(x’; x). • Initialize with random x. • Generate new x’ = • Proposal position according to • Q(x’; x) • Compute α = min( (P(x’) / P(x) ), 1) and accept change with probability α. Figure : Wikipedia
Gibbs Sampling • Gibbs sampling is a variety of Metropolis-Hastings sampling where the sampling step is always accepted. • For multivariate distributions, in Gibbs sampling only one parameter is changed at a time. • This makes Gibbs sampling particularly useful for multivariate distributions.
Motif Finding with Gibbs :Site Sampler • Site sampler : Sample starting points for motif in each sequence. • Start with random alignments, then use random sampling for one sequence at a time to gradually improve the alignments. Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science. 1993. 262(5131), 208-214
Gibbs sampling: Changing One Thing at a Time • N = 3 • We want to sample alignment points A1 for sequence 1, given the pattern description M, background model B, the sequences S and width W, and alignments A2 (for sequence 2) and A3 (for sequence 3). p( A1 | S, M, B,W, A2, A3)
Site Sampler Algorithm Initialize with random alignment points. While not converged: Do steps 1 and 2 for each of N sequences : 1) Predictive update step : Calculate motif matrix and background using all sequences but the currently selected one. 2) Sampling step. Calculate Lx = Qx / Px for all starting points x in this sequence. Choose one with a probability proportional to Lx. • The inner loop is an iteration of sequences. • In step 2 you sample a new value for Ak i.e. the alignment position for sequence k • Convergence : In theory you sample until alignment no longer changes
Example: Random Alignments Rouchka, 1997, “A Brief Overview of Gibbs Sampling” DNA: Alphabet = {A,C,G,T}
Example: Initial Counts New Counts Remove 1st sequence (ATTTAT) TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997.
Sequence 1: TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997.
Assumptions for Site Sampler • Assumption of one motif. • Assumption of one copy of motif in each sequence. • Assumption that we know the motif width.
Motif Sampler • Sample different motifs for each part of the sequence. Store motif descriptions and background vector for each motif. • Allows multiple motifs, and varying number of each motif to be present. • Still subject to assumption of known motif width. Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein repeats”. 1995. Protein Science. 4:1618-1632.
Null model • Null model has pattern description probabilities identical to the background probabilities for its constituent amino acids.
Widths and Copy Numbers • The motif sampler requires widths and expected numbers of each motif per sequence as input parameters. The widths are fixed during the algorithm, but the number of copies can change. • Construction of subfamilies (more specific models) favored by larger widths and smaller expected number of copies. • Construction of superfamilies (less specific models) favored by smaller widths and larger expected number of copies.
Motif Sampler Algorithm Initialize with Ei non-overlapping random segments for each motif i in each of the sequences. While not converged: Do steps 1 and 2 for successive segments in the biological sequences : 1) Predictive update step : If this segment is in an alignment remove it and recalculate motif matrixand background data excluding this segment. 2) Sampling step : Sample one of the motifs with a probability proportional to the score that the motif description assigns this segment. • In the inner loop the motif sampler iterates through segments (substrings) • Motif sampler samples among different motif models for each segment.
Modifications to the Samplers • There are methods to allow for gapped motifs. • High scoring alignments can be stored for later investigation.
Neuwald et al., 1995 Porin structures
Porin alignment Neuwald et al., 1995
Motivation for Gibbs sampling • We have a joint probability density f(x, y1, y2,…yp) • We want a marginal density, f(x). • The integrations are difficult. Casella and George, “Explaining the Gibbs Sampler”
Near-optimum Sampler • A particular motif may not be present in the best alignment found, even though its found in many other near-optimal alignments. • Sample among near-optimal alignments to capture all sites present.
Column sampler • The column sampler allows for a motif to shift: GCACCTG --> GCACCTG • It also allows for the development of motifs with gaps: GCACCTG --> GCACCTG