1 / 27

Learning Bayesian networks

This slide presentation by Nir Friedman discusses learning Bayesian networks using incomplete data and prior information. Topics covered include parameter estimation, learning from incomplete data, gradient ascent, and the Expectation Maximization (EM) algorithm.

lattimore
Download Presentation

Learning Bayesian networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Bayesian networks Slides by Nir Friedman .

  2. E B P(A | E,B) B E .9 .1 e b e b .7 .3 .8 .2 e b R A .99 .01 e b C Data + Prior information Learning Bayesian networks Inducer

  3. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B B E E ? ? e b A A e b ? ? ? ? e b ? ? e b Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> • Network structure is specified • Data contains missing values • We consider assignments to missing values Inducer

  4. Known Structure / Complete Data • Given a network structure G • And choice of parametric family for P(Xi|Pai) • Learn parameters for network from complete data Goal • Construct a network that is “closest” to probability distribution that generated the data

  5. L(:D) 0 0.2 0.4 0.6 0.8 1 Maximum Likelihood Estimation in Binomial Data • Applying the MLE principle we get (Which coincides with what one would expect) Example: (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6

  6. B E A C Learning Parameters for a Bayesian Network • Training data has the form:

  7. B E A C Learning Parameters for a Bayesian Network • Since we assume i.i.d. samples,likelihood function is

  8. B E A C Learning Parameters for a Bayesian Network • By definition of network, we get

  9. B E A C Learning Parameters for a Bayesian Network • Rewriting terms, we get

  10. General Bayesian Networks Generalizing for any Bayesian network: • The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

  11. General Bayesian Networks (Cont.) Complete Data  Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. (Not true in Genetic Linkage analysis).

  12. Bayesian (Dirichlet Prior) MLE Learning Parameters: Summary • For multinomial we collect sufficient statistics which are simply the counts N (xi,pai) • Parameter estimation • Bayesian methods also require choice of priors • Both MLE and Bayesian are asymptotically equivalent and consistent.

  13. E B P(A | E,B) .9 .1 e b e b .7 .3 .8 .2 e b .99 .01 e b E B P(A | E,B) B B E E ? ? e b A A e b ? ? ? ? e b ? ? e b Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> • Network structure is specified • Data contains missing values • We consider assignments to missing values Inducer

  14. X m Y|X=H X[m] Y|X=T Y[m] Learning Parameters from Incomplete Data Incomplete data: • Posterior distributions can become interdependent • Consequence: • ML parameters can not be computed separately for each multinomial • Posterior is not a product of independent posteriors

  15. H Y Learning Parameters from Incomplete Data (cont.). • In the presence of incomplete data, the likelihood can have multiple global maxima • Example: • We can rename the values of hidden variable H • If H has two values, likelihood has two global maxima • Similarly, local maxima are also replicated • Many hidden variables  a serious problem

  16. Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem L(|D) 

  17. Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring than the current point MLE from Incomplete Data • Finding MLE parameters: nonlinear optimization problem L(|D) 

  18. MLE from Incomplete Data Both Ideas: Find local maxima only. Require multiple restarts to find approximation to the global maximum.

  19. Gradient Ascent • Main result Theorem GA: Requires computation: P(xi,pai|o[m],) for all i, m Inference replaces taking derivatives.

  20. log P ( D | ) log P ( o [ m ] | ) ¶ Q ¶ Q å = ¶ q ¶ q m x , pa x , pa i i i i 1 P ( o [ m ] | ) ¶ Q å = P ( o [ m ] | ) Q ¶ q m x , pa i i How do we compute ? Gradient Ascent (cont) Proof:

  21. P ( x , pa , o | ) P ( o | ) ¶ Q ¶ Q å i i = ¶ q ¶ q x , pa x ' , pa ' x ' , pa ' i i i i i i P ( o | x , pa , o , ) P ( x | pa , ) P ( pa , o | ) d nd nd ¶ Q Q Q å i i i i i = ¶ q x , pa x ' , pa ' i i i i =1 P ( x ' | pa ' , ) Q P ( o | x ' , pa ' , o , ) P ( pa ' , o | ) d nd nd i i = Q Q i i i q x ' , pa ' i i P ( x ' , pa ' , o , ) Q i i = q x ' , pa ' i i Gradient Ascent (cont) Since:

  22. Gradient Ascent (cont) • Putting all together we get

  23. Expectation Maximization (EM) • A general purpose method for learning from incomplete data Intuition: • If we had access to counts, then we can estimate parameters • However, missing values do not allow to perform counts • “Complete” counts using current parameter assignment

  24. X Z Y Expectation Maximization (EM) Expected Counts Data P(Y=H|X=H,Z=T,) = 0.3 Y Z X N (X,Y ) HTHHT ??HTT TT?TH X Y # 1.30.41.71.6 Current model HHTT HTHT These numbers are placed for illustration; they have not been computed. P(Y=H|X=T, Z=T, ) = 0.4

  25. Reiterate Updated network (G,1) Expected Counts N(X1) N(X2) N(X3) N(H, X1, X1, X3) N(Y1, H) N(Y2, H) N(Y3, H) Computation Reparameterize X1 X1 X2 X2 X3 X3 H (M-Step) (E-Step) H Y1 Y1 Y2 Y2 Y3 Y3 EM (cont.) Initial network (G,0)  Training Data

  26. Expectation Maximization (EM) • In practice, EM converges rather quickly at start but converges slowly near the (possibly-local) maximum. • Hence, often EM is used few iterations and then Gradient Ascent steps are applied.

  27. Final Homework Question 1: Develop an algorithm that given a pedigree input, provides the most probably haplotype of each individual in the pedigree. Use the Bayesian network model of superlink to formulate the problem exactly as a query. Specify the algorithm at length discussing as many details as you can. Analyze its efficiency. Devote time to illuminating notation and presentation. Question 2: Specialize the formula given in Theorem GA for  in genetic linkage analysis. In particular, assume exactly 3 loci: Marker 1, Disease 2, Marker 3, with  being the recombination between loci 2 and 1 and 0.1-  being the recombination between loci 3 and 2. Specify the formula for a pedigree with two parents and two children. Extend the formula for arbitrary pedigrees. Note that  is the same in many local probability tables.

More Related