Directed Graphical Probabilistic Models:

Directed Graphical Probabilistic Models: the sequel William W. Cohen Machine Learning 10-601 Feb 22 2008 25

Directed Graphical Probabilistic Models:the son of the child of the bride of the sequel William W. Cohen Machine Learning 10-601 Feb 27 2008

Outline • Quick recap • An example of learning • Given structure, find CPTs from “fully observed” data • Some interesting special cases of this • Learning with hidden variables • Expectation-maximization • Handwave argument for why EM works

The story so far: Bayes nets First guess The money • Many problems can be solved using the joint probability P(X1,…,Xn). • Bayes nets describe a way to compactly write the joint. • For a Bayes net: A B Stick or swap? The goat C D E Second guess • Conditional independence:

The story so far: d-separation E Y X There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked…see there?  If X is d-separated from Y given E, then I<X,E,Y> Z Z Z

The story so far: “Explaining away” X Y E

Recap: Inference in linear chain networks E E X1 Xn ... ... Xj “backward” “forward” Instead of recursion you can use “message passing” (forward-backward, Baum-Welsh)….

Recap: Inference in polytrees • Reduce P(X|E) to the product of two recursively calculated parts: • P(X=x|E+) • i.e., CPT for X and product of “forward” messages from parents • P(E-|X=x) • i.e., combination of “backward” messages from parents, CPTs, and P(Z|EZ\Yk), a simpler instance of P(X|E) • This can also be implemented by message-passing (belief propagation)

Recap: Learning for Bayes nets • Input: • Sample of the joint: • Graph structure of the variables • for I=1,…,N, you know Xi and parents(Xi) • Output: • Estimated CPTs A B C D • Method (discrete variables): • Estimate each CPT independently • Use a MLE or MAP E …

Recap: Learning for Bayes nets • Method (discrete variables): • Estimate each CPT independently • Use a MLE or MAP • MAP: A B C D E …

Recap: A detailed example Z X Y D:

A detailed example Z X Y D:

A detailed example Z • Now we’re done learning: what can we do with this? • guess what your favorite professor is doing now? • given a new x,y compute P(prof|x,y), P(grad|x,y), P(ugrad|x,y)…using Bayes net inference • given a new x,y predict the most likely “label” Z X Y Of course we need to implement our Bayes net inference method first…

A more interesting example C Parameters are “shared” or “tied” W1 W2 … WN or C A “plate” … Wi N

Some special cases of Bayes net learning • Naïve Bayes • HMMs for biology and information extraction • Tree-augmented Naïve Bayes

Another interesting example • A phylogenomic analysis of the Actinomycetales mce operons

Another interesting example p1 p2 p3 p4 Z1 Z2 Z4 ... Z3 X1 X2 X4 X3 ... ...

Another interesting example

IE by text segmentation Title Journal Year Author Volume Page • Example: Addresses, bibliography records House number State Zip Building Road City 4089 Whispering Pines Nobel Drive San Diego CA 92122 P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237. Author, title, year, … are like “positions” in the previous example

IE with Hidden Markov Models Y A C X B Z A B C 0.1 0.1 0.8 0.4 0.2 0.4 0.6 0.3 0.1 Emission probabilities Transition probabilities 0.5 0.9 0.5 0.1 0.8 0.2 dddd dd 0.8 0.2 • HMMs for IE • Note: we know how to train this model from segmented citations Title Author Journal Year

Results: Comparative Evaluation The Nested model does best in all three cases (from Borkar et a, 2001)

Learning with hidden variables Z • Hidden variables: what if some of your data is not completely observed? • Method: • Estimate parameters somehow or other. • Predict unknown values from your estimate. • Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. • Re-estimate parameters using the extended dataset (real + pseudo-data). • Repeat starting at step 2…. X Y Expectation-maximization aka EM expectation maximization (MLE/MAP)

Learning with Hidden Variables: Example Z X Y

Learning with Hidden Variables: Example Z X Y .38 .35 .27

Learning with Hidden Variables: Example Z X Y .24 .32 .54

Learning with hidden variables Z • Hidden variables: what if some of your data is not completely observed? • Method: • Estimate parameters somehow or other. • Predict unknown values from your estimate. • Add pseudo-data corresponding to these predictions, weighting each example by confidence in its correctness. • Re-estimate parameters using the extended dataset (real + pseudo-data). • Repeat starting at step 2…. X Y

Why does this work? Ignore prior - MLE Q(z) > 0 Q(z) a pdf

Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 Initial estimate of θ

Jensen’s inequality Claim: log(q1x1+q2x2)≥q1log(x1)+q2log(x2) Holds for any downward-concave function, not just log(x) Further: log(EQ[X]) ≥ EQ[log(X)] log(q1x1+q2x2) log(x2) log(x) * * log(x1) q1log(x1)+q2log(x2) x1 x2 q1x1+q2x2 where q1+q2=1

Why does this work? Ignore prior Q(z) a pdf Q(z) > 0 since log(EQ[X]) ≥ EQ[log(X)] Q is any estimate of θ – say θ0 θ’ depends on X,Z but not directly on Q so… P(X,Z,θ’|Q)=P(θ’|X,Z,Q)*P(X,Z|Q) So, plugging in pseudo-data weighted by Q and finding MLE optimizes a lower bound on log-likelihood

Directed Graphical Probabilistic Models: