190 likes | 384 Views
EM Algorithm – Motif Finding Tutorial #09. The class has been edited from Bud Mishra’s lecture which is available at www.cs.nyu.edu/mishra/COURSES/02.COBIO . Changes made by Ydo Wexler. Expectation Maximization (EM). A general purpose method for learning from incomplete data Intuition:
E N D
EM Algorithm – Motif Finding Tutorial #09 The class has been edited from Bud Mishra’s lecture which is available at www.cs.nyu.edu/mishra/COURSES/02.COBIO . Changes made by Ydo Wexler. .
Expectation Maximization (EM) • A general purpose method for learning from incomplete data Intuition: • If we had access to counts, then we can estimate parameters • However, missing values do not allow to perform counts • “Complete” counts using current parameter assignment
Expected Counts P(Y=H|X=H,Z=T,) = 0.3 N (X,Y ) X Y # =1=0=1=1 +0.3+0.4+0.7+0.6 HHTT HTHT P(Y=H|X=T,) = 0.4 “Real” counts “missing” counts Expectation Maximization (EM) Data Y Z X HTHHT ??HTT T??TH 1.30.41.71.6 Current model
Reiterate Updated network (G,1) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation Reparameterize X1 X1 X2 X2 X3 X3 H (M-Step) (E-Step) H Y1 Y1 Y2 Y2 Y3 Y3 EM (cont.) Initial network (G,0) Training Data
Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem log P(x| ) E ’[log P(x,y| )]
MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem log P(x| ) E ’[log P(x,y| )] Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has a higher likelihood than the current point
Sequence Motifs • A Sequence of patterns of biological significance. • Examples: • DNA: Protein binding sites • (e.g. promoters, regulatory sequences) • Protein: sequences corresponding to conserved pieces of structure • (e.g. Local features, At various scales: blocks, domains & families)
Sequence Motifs - EM Algorithm • Use EM (Expectation Minimization) algorithm to find multiple motifs in a set of sequences. • Description of a motif: • W = (Fixed) width of a motif • P = [plc] = Matrix of probabilities that letter l occurs at position c = |S| xW matrix
Pr = Example • DNA motif of width • W = 3, • S = { A, T, C, G} • r = motif • Pr = 4 x 3 stochastic matrix
Computational Problem • Given: • A set of sequences, G • A width parameter W • Find: • Motifs of width W common to sequences G and present their probabilistic descriptions. • Assume: • One motif in each sequence • The probability that the motif will appear at certain location is uniform for all locations Note that motif start sites in each sequence are unknown (hidden).
Position 1 2 3 4 Basic EM Approach • Total number of sequences = m • Minimum length of a sequence = l • Z = matrix of probabilities • zij = Probability that the motif starts at position j in sequence i.
EM Algorithm • Set initial values for P • do • Re-estimate Z from P • Re-estimate P from Z • until change in P < ε • return P
if motif starts at pos. j in seq. i otherwise. • Pr (Si| Iij = 1 , ρ) = EM Algorithm • Some definitions: • si– the ith sequence • lkj = the char. at pos. (j-1)+k in seq. Si How well Si fits the motif in position j
Pr = Example Si = AGGCTGTAGACAC • Pr(TGT | Ii5 =1,ρ) = rT,1xrG,2xrT,3 = 0.2 x 0.1 x 0.1 = 2 x 10-3
Estimating Z =Pr(Iij = 1 | r, Si) • zij = Estimates the starting position in each Si zij = Pr ( Iij =1 | r, Si) = Pr( Si, Iij = 1 | r)/ Pr(Si | r) = Pr( Si | Iij = 1, r) Pr(Iij = 1)/ åk Pr( Si | Iik = 1, r) Pr(Iik = 1) = Pr( Si | Iij = 1, r) / åk Pr( Si | Iik = 1, r) • Follows from an application of the Bayes’ rule and the assumption that “it is equally likely that the motif will start in any position.” Pr(Iij = 1) = Pr(Iik=1)
is the expected number of occurrences of the character c at the kth position of a motif r (assuming that the motif “start position” is known.) Estimating Pr • Given Z, estimate the probability that the character c occurs at the kth position of a motif. The 1’s are added to avoid division by 0 • pck = (nck + 1)/ åd (ndk + 1)
Pr = Example Si = AGGCTGTAGACAC zi1 = 6 x 10-3/sum zi2 = 3 x 10-3/sum zi3 = 6 x 10-3/sum 0.1 x 0.1 x 0.6 0.3 x 0.1 x 0.1 0.3 x 0.2 x 0.1
Example • s1 : A C A G C A • s2 : A G G C A G • s3 : T C A G T C z1,1 z1,3 z2,1 z3,3 pA,1 = (z11 +z13+ z21 + z33 +1)/ (z11 + z12 + L+ z33 + z34 +4)