100 likes | 170 Views
Sampling Approaches to Pattern Extraction. (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 16, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Probabilistic Motif M = prob model p(S|M)
E N D
Sampling Approaches to Pattern Extraction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 16, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign
Probabilistic Motif M = prob model p(S|M) M matches every sequence but with different probabilities E.g., M={p(x|i)}, i=1,…, w (width) P(x|i)=prob. symbol x occurs in position i Task: Find best M, and the matching positions in each Si. Best= p(S|M) is the highest. Combinatorial Motif M = deterministic pattern M either matches a sequence or not E.g., M= AT..G Task: Find best M’s Best = highly frequent Pattern Extraction: Probabilistic vs. Combinatorial Problem: Find common patterns (motifs) in sequences S={ s1,…, sN}
Probabilistic Pattern Extraction • Motif M = prob. model of sequences p(Seq|M) • M matches every sequence but with different probabilities • E.g., M={p(x|i)}, i=1,…, w (width) • P(x|i)= prob symbol x occurs in position I • Task = Find best M and the matching positions in each Si; “best” = p(S|M) is the highest • Method = Search for the best model • Sampling is an efficient way of searching
Position Weighted Matrix (PWM) Position: 1 2 3 w-1 w 1.0 1.0 … 1.0 Essentially a simple linear HMM Parameters: qij=p(symbol j | position i) E.g., q1A=q1G=0.5; q1C=q1T=0 q9C=1.0; q9A=q9C=q9T=0 Covers a deterministic patter such as AT.G as a special case with the following Q matrix: 1 2 3 4 A 1.0 0 0.25 0 T 0 1.0 0.25 0 C 0 0 0.25 0 G 0 0 0.25 1.0
Discovering a PWM from Sequences • Given • a set of sequeces S={ s1,…, sN} • a pattern width w (e.g. 10) • Discover the most discriminative PWM M, i.e., the M that maximizes p(S|M)/p(S|Background) • P(S|M)=p(s1|M)…p(sN|M) ( roughly!) • Prior could be incorporated through maximizing posterior probability of M • How to discover it? • Using HMM training algorithm? (not all observations are relevant) • Gibbs Sampler
Gibbs Sampler: Basic Idea • Introduce an auxiliary variable akto record the position of the pattern in sequence sk • Randomly choose initial positions ak • Iterate with the following two steps • Predictive update: Using the current positions to estimate the model qij • Sampling: Using the current model to improve the position in one sequence (e.g., ak) • Take one sequence and compute the probability ratio of each position p(x|i)/p(x|Background) • Sample a position based on the ratio weight • In general, we get a high ratio position, but not always the highest • Observations • If a position is improved, then the model will be improved • If a model is improved then all the positions will also be improved
Gibbs Sampler: Details of One Iteration • At every step, take one sequence out (e.g., sequence z) for position improvement • Use the rest to estimate two models qij and pj (background) • qij is estimatd based on the matching segments at the current positions • pj is estimated based on all other regions of these sequences (negative model) • For each position i in the sequence taken out, compute the probability ratio • Normalize the ratios to get a probabilities and choose a position stochastically according to the probabilities • Change the current position for sequence z to the new position obtained
Estimation of qij and pj • qij is estimated based on the sequence segments at the current “matching positions” a1, …,aN • pj is estimated based on the “non-matching regions” of all the sequences (relevant frequency) • In general, smoothing is necessary Total counts of symbol j in relative position i Pseudocounts
Example of Estimating qij N=6, W=10, without smoothing q1A= 3/5, q2G = 2/5, … q1G= 0
Example of Computing the Ratios Ratio = Select a set of ak’s that maximizes the product of these ratios, or F F = Σ1≤i≤W Σj∈ {A,T,G,C} ci,jlog(qij/pj)