430 likes | 584 Views
Readings: K&F: 16.1, 16.2, 17.1, 17.2, 17.3.1, 17.4.1. Param. Learning (MLE) Structure Learning The Good. Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1 st , 2008. Learning the CPTs. For each discrete variable X i. Data. x (1) … x (m). Learning the CPTs.
E N D
Readings: K&F: 16.1, 16.2, 17.1, 17.2, 17.3.1, 17.4.1 Param. Learning (MLE)Structure LearningThe Good Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 1st, 2008 10-708 – Carlos Guestrin 2006-2008
Learning the CPTs For each discrete variable Xi Data x(1) … x(m) 10-708 – Carlos Guestrin 2006-2008
Learning the CPTs For each discrete variable Xi Data x(1) … x(m) WHY?????????? 10-708 – Carlos Guestrin 2006-2008
Maximum likelihood estimation (MLE) of BN parameters – example Flu Allergy Sinus • Given structure, log likelihood of data: Nose 10-708 – Carlos Guestrin 2006-2008
Maximum likelihood estimation (MLE) of BN parameters – General case • Data: x(1),…,x(m) • Restriction: x(j)[PaXi] ! assignment to PaXi in x(j) • Given structure, log likelihood of data: 10-708 – Carlos Guestrin 2006-2008
Taking derivatives of MLE of BN parameters – General case 10-708 – Carlos Guestrin 2006-2008
General MLE for a CPT • Take a CPT: P(X|U) • Log likelihood term for this CPT • Parameter X=x|U=u : 10-708 – Carlos Guestrin 2006-2008
Where are we with learning BNs? • Given structure, estimate parameters • Maximum likelihood estimation • Later Bayesian learning • What about learning structure? 10-708 – Carlos Guestrin 2006-2008
Possible structures Learning the structure of a BN Flu Allergy Data Sinus <x1(1),…,xn(1)> … <x1(m),…,xn(m)> Nose Headache • Constraint-based approach • BN encodes conditional independencies • Test conditional independencies in data • Find an I-map • Score-based approach • Finding a structure and parameters is a density estimation task • Evaluate model as we evaluated parameters • Maximum likelihood • Bayesian • etc. Learn structure and parameters 10-708 – Carlos Guestrin 2006-2008
Remember: Obtaining a P-map? • Given the independence assertions that are true for P • Obtain skeleton • Obtain immoralities • From skeleton and immoralities, obtain every (and any) BN structure from the equivalence class • Constraint-based approach: • Use Learn PDAG algorithm • Key question: Independence test 10-708 – Carlos Guestrin 2006-2008
Score-based approach Flu Allergy Data Sinus <x1(1),…,xn(1)> … <x1(m),…,xn(m)> Nose Headache Learn parameters Score structure Possible structures 10-708 – Carlos Guestrin 2006-2008
Information-theoretic interpretation of maximum likelihood Flu Allergy Sinus Nose Headache • Given structure, log likelihood of data: 10-708 – Carlos Guestrin 2006-2008
Information-theoretic interpretation of maximum likelihood 2 Flu Allergy Sinus Nose Headache • Given structure, log likelihood of data: 10-708 – Carlos Guestrin 2006-2008
Decomposable score • Log data likelihood • Decomposable score: • Decomposes over families in BN (node and its parents) • Will lead to significant computational efficiency!!! • Score(G : D) = i FamScore(Xi|PaXi : D) 10-708 – Carlos Guestrin 2006-2008
Announcements • Recitation tomorrow • Don’t miss it! • HW2 • Out today • Due in 2 weeks • Projects!!! • Proposals due Oct. 8th in class • Individually or groups of two • Details on course website • Project suggestions will be up soon!!! 10-708 – Carlos Guestrin 2006-2008
BN code release!!!! • Pre-release of a C++ library for probabilistic inference and learning • Features: • basic datastructures (random variables, processes, linear algebra) • distributions (Gaussian, multinomial, ...) • basic graph structures (directed, undirected) • graphical models (Bayesian network, MRF, junction trees) • inference algorithms (variable elimination, loopy belief propagation, filtering) • Limited amount of learning (IPF, Chow Liu, order-based search) • Supported platforms: • Linux (tested on Ubuntu 8.04) • MacOS X (tested on 10.4/10.5) • limited Windows support • Will be made available to the class early next week. 10-708 – Carlos Guestrin 2006-2008
How many trees are there? Nonetheless – Efficient optimal algorithm finds best tree 10-708 – Carlos Guestrin 2006-2008
Scoring a tree 1: I-equivalent trees 10-708 – Carlos Guestrin 2006-2008
Scoring a tree 2: similar trees 10-708 – Carlos Guestrin 2006-2008
Chow-Liu tree learning algorithm 1 • For each pair of variables Xi,Xj • Compute empirical distribution: • Compute mutual information: • Define a graph • Nodes X1,…,Xn • Edge (i,j) gets weight 10-708 – Carlos Guestrin 2006-2008
Chow-Liu tree learning algorithm 2 • Optimal tree BN • Compute maximum weight spanning tree • Directions in BN: pick any node as root, breadth-first-search defines directions 10-708 – Carlos Guestrin 2006-2008
Can we extend Chow-Liu 1 • Tree augmented naïve Bayes (TAN) [Friedman et al. ’97] • Naïve Bayes model overcounts, because correlation between features not considered • Same as Chow-Liu, but score edges with: 10-708 – Carlos Guestrin 2006-2008
Can we extend Chow-Liu 2 • (Approximately learning) models with tree-width up to k • [Chechetka & Guestrin ’07] • But, O(n2k+6) 10-708 – Carlos Guestrin 2006-2008
What you need to know about learning BN structures so far • Decomposable scores • Maximum likelihood • Information theoretic interpretation • Best tree (Chow-Liu) • Best TAN • Nearly best k-treewidth (in O(N2k+6)) 10-708 – Carlos Guestrin 2006-2008
Can we really trust MLE? m • What is better? • 3 heads, 2 tails • 30 heads, 20 tails • 3x1023 heads, 2x1023 tails • Many possible answers, we need distributions over possible parameters 10-708 – Carlos Guestrin 2006-2008
Bayesian Learning • Use Bayes rule: • Or equivalently: 10-708 – Carlos Guestrin 2006-2008
Bayesian Learning for Thumbtack • Likelihood function is simply Binomial: • What about prior? • Represent expert knowledge • Simple posterior form • Conjugate priors: • Closed-form representation of posterior (more details soon) • For Binomial, conjugate prior is Beta distribution 10-708 – Carlos Guestrin 2006-2008
Beta prior distribution – P() • Likelihood function: • Posterior: 10-708 – Carlos Guestrin 2006-2008
Posterior distribution • Prior: • Data: mH heads and mT tails • Posterior distribution: 10-708 – Carlos Guestrin 2006-2008
Conjugate prior • Given likelihood function P(D|) • (Parametric) prior of the form P(|) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as: • P(|’), for some new set of parameters ’ • Prior: • Data: mH heads and mT tails (binomial likelihood) • Posterior distribution: 10-708 – Carlos Guestrin 2006-2008
Using Bayesian posterior • Posterior distribution: • Bayesian inference: • No longer single parameter: • Integral is often hard to compute 10-708 – Carlos Guestrin 2006-2008
Bayesian prediction of a new coin flip • Prior: • Observed mH heads, mT tails, what is probability of m+1 flip is heads? 10-708 – Carlos Guestrin 2006-2008
Asymptotic behavior and equivalent sample size • Beta prior equivalent to extra thumbtack flips: • As m→1, prior is “forgotten” • But, for small sample size, prior is important! • Equivalent sample size: • Prior parameterized by H,T, or • m’ (equivalent sample size) and Fix m’, change Fix , change m’ 10-708 – Carlos Guestrin 2006-2008
Bayesian learning corresponds to smoothing m • m=0 ) prior parameter • m!1) MLE 10-708 – Carlos Guestrin 2006-2008
Bayesian learning for multinomial • What if you have a k sided coin??? • Likelihood function if multinomial: • Conjugate prior for multinomial is Dirichlet: • Observem data points, mi from assignment i, posterior: • Prediction: 10-708 – Carlos Guestrin 2006-2008
Bayesian learning for two-node BN • Parameters X, Y|X • Priors: • P(X): • P(Y|X): 10-708 – Carlos Guestrin 2006-2008
Very important assumption on prior:Global parameter independence • Global parameter independence: • Prior over parameters is product of prior over CPTs 10-708 – Carlos Guestrin 2006-2008
Global parameter independence, d-separation and local prediction Flu Allergy Sinus Nose Headache • Independencies in meta BN: • Proposition: For fully observable data D, if prior satisfies global parameter independence, then 10-708 – Carlos Guestrin 2006-2008
Within a CPT • Meta BN including CPT parameters: • Are Y|X=t and Y|X=f d-separated given D? • Are Y|X=t and Y|X=f independent given D? • Context-specific independence!!! • Posterior decomposes: 10-708 – Carlos Guestrin 2006-2008
Priors for BN CPTs (more when we talk about structure learning) • Consider each CPT: P(X|U=u) • Conjugate prior: • Dirichlet(X=1|U=u,…, X=k|U=u) • More intuitive: • “prior data set” D’ with m’ equivalent sample size • “prior counts”: • prediction: 10-708 – Carlos Guestrin 2006-2008
An example 10-708 – Carlos Guestrin 2006-2008
What you need to know about parameter learning • MLE: • score decomposes according to CPTs • optimize each CPT separately • Bayesian parameter learning: • motivation for Bayesian approach • Bayesian prediction • conjugate priors, equivalent sample size • Bayesian learning ) smoothing • Bayesian learning for BN parameters • Global parameter independence • Decomposition of prediction according to CPTs • Decomposition within a CPT 10-708 – Carlos Guestrin 2006-2008