320 likes | 919 Views
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment. Lawrence et al. 1993. Presented By: Manish Agrawal Slides adapted from Prof Sinha’s notes. To define a motif, lets say we know where the motif starts in the sequence
E N D
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment Lawrence et al. 1993 Presented By: Manish Agrawal Slides adapted from Prof Sinha’s notes.
To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) A motif model Genes regulated by same transcription factor
a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A3 0 1 0 31 1 0 Matrix C24 0 0 14 0 0 G 0 1 4 0 0 0 31 T 0 0 0 5 1 0 14 _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s1, s2, …, st) Construct “position weight matrix” with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest frequency in column Motifs: Matrices and Consensus
Motif Finding Problem(Simplified) • Given a set of sequences, find the motif shared by all or most sequences, while its starting position in each sequence is unknown • Assumption: • Each motif appears exactly once in one sequence. • The motif has fixed length.
Generative Model • Suppose the sequences are aligned, the aligned regions are generated from a motif model. • Motif model is a PWM. A PWM is a position-specific multinomial distribution. • For each position i (from 1 to W), a multinomial distribution on amino acids, consisting of variables qi1, qi2,…..,qi20 • The unaligned regions are generated from a background model: p1,p2, ……, p20
Notations • Set of symbols: • Sequences: S = {S1, S2, …, SN} • Starting positions of motifs: A = {a1, a2, …, aN} • Motif model ( ) : qij = P(symbol at the i-th position = j) • Background model: pj = P(symbol = j) • Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region
Scoring Function • Maximize the log-odds ratio: • Is greater than zero if the data is a better match to the motif model than to the background model
Scoring function • A particular alignment “A” gives us the • counts cij. • In the scoring function “F”, use:
Scoring function • Thus, given an alignment A, we can calculate the scoring function F • We need to find A that maximizes this scoring function, which is a log-odds score
Optimization and Sampling • To maximize a function, f(x): • Brute force method: try all possible x • Sample method: sample x from probability distribution: p(x) ~ f(x) • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax
Markov Chain Sampling • To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy: • This would then imply:
Gibbs sampling to maximize F • Gibbs sampling is a special type of Markov chain sampling algorithm • Our goal is to find the optimal A = (a1,…aN) • The Markov chain we construct will only have transitions from A to alignments A’ that differ from A in only one of the ai • In round-robin order, pick one of the ai to replace • Consider all A’ formed by replacing ai with some other starting position ai’ in sequence Si • Move to one of these A’ probabilistically • Iterate the last three steps
Algorithm Randomly initialize A0; Repeat: (1) randomly choose a sequence z from S; A* = At \ az; compute θt from A*; (2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A* x; Select At that maximizes F; Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model
Algorithm Current solution At
Algorithm Choose one az to replace
Algorithm x For each candidate site xin sequence z, calculate Qx and Px: Probabilities of sampling x from motif model and background model resp.
Algorithm x Among all possible candidates, choose one (say x) with probability proportional to Qx/Px
Algorithm x Set At+1 = A* x
Algorithm x Repeat
Local optima • The algorithm may not find the “global” or true maximum of the scoring function • Once “At” contains many similar substrings, others matching these will be chosen with higher probability • Algorithm will “get locked” into a “local optimum” • all neighbors have poorer scores, hence low chance of moving out of this solution
Phase shifts • After every M iterations, compare the current At with alignments obtained by shifting every aligned substring ai by some amount, either to left or right
Pattern Width • The algorithm described so far requires pattern width(W) to be input. • We can modify the algorithm so that it executes for a range of plausible widths. • The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.
Pattern Width • Another function based on the incomplete-data log-probability ratio G can be used. • Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.
Examples • The algorithm was applied to locate helix-turn-helix (HTH) motif, which represent a large class of sequence-specific DNA binding structures involved in numerous cases of gene regulation. • Detection and alignment of HTH motifs is a well recognized problem because of the great sequence variation.
HTH Motif Complete Sequences Non-site seq Random seq
Time complexity analysis • For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence. • L*W multiplications are performed in Step2 of the algorithm. • Total multiplications to execute the algorithm = TNLavgW • Linear Time complexity has been observed in applications
Motif finding • The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences • Protein motifs represent common sequence patterns in proteins, that are related to certain structure and function of the protein • Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites