550 likes | 619 Views
Structure Learning in Bayesian Networks. Eran Segal Weizmann Institute. Structure Learning Motivation. Network structure is often unknown Purposes of structure learning Discover the dependency structure of the domain
E N D
Structure Learningin Bayesian Networks Eran Segal Weizmann Institute
Structure Learning Motivation • Network structure is often unknown • Purposes of structure learning • Discover the dependency structure of the domain • Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations • Set expectations: at best, we can recover the structure up to the I-equivalence class • Density estimation • Estimate a statistical model of the underlying distribution and use it to reason with and predict new instances
Advantages of Accurate Structure X1 X2 Y Spurious edge Missing edge X1 X2 X1 X2 Y Y • Increases number of fitted parameters • Wrong causality and domain structure assumptions • Cannot be compensated by parameter estimation • Wrong causality and domain structure assumptions
Structure Learning Approaches • Constraint based methods • View the Bayesian network as representing dependencies • Find a network that best explains dependencies • Limitation: sensitive to errors in single dependencies • Score based approaches • View learning as a model selection problem • Define a scoring function specifying how well the model fits the data • Search for a high-scoring network structure • Limitation: super-exponential search space • Bayesian model averaging methods • Average predictions across all possible structures • Can be done exactly (some cases) or approximately
Constraint Based Approaches • Goal: Find the best minimal I-Map for the domain • G is an I-Map for P if I(G)I(P) • Minimal I-Map if deleting an edge from G renders it not an I-Map • G is a P-Map for P if I(G)=I(P) • Strategy • Query the distribution for independence relationships that hold between sets of variables • Construct a network which is the best minimal I-Map for P
Constructing Minimal I-Maps • Reverse factorization theorem • G is an I-Map of P • Algorithm for constructing a minimal I-Map • Fix an ordering of nodes X1,…,Xn • Select parents of Xi as minimal subset of X1,…,Xi-1,such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi)) • (Outline of) Proof of minimal I-map • I-map since the factorization above holds by construction • Minimal since by construction, removing one edge destroys the factorization Limitations • Independence queries involve a large number of variables • Construction involves a large number of queries (2i-1 subsets) • We do not know the ordering and network is sensitive to it
Constructing P-Maps • Simplifying assumptions • Network has bound in-degree d per node • Oracle can answer Ind. queries of up to 2d+2 variables • Distribution P has a P-Map • Algorithm • Step I: Find skeleton • Step II: Find immoral set of v-structures • Step III: Direct constrained edges
Step I: Identifying the Skeleton • For each pair X,Y query all Z for Ind(X;Y | Z) • X–Y is in skeleton if no Z is found • If graph in-degree bounded by d running time O(n2d+2) • Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y)) • Reminder • If there is no Z for which Ind(X;Y | Z) holds,then XY or YX in G* • Proof: Assume no Z exists, and G* does not have XY or YX • Then, can find a set Z such that the path from X to Y is blocked • Then, G* implies Ind(X;Y | Z) and since G* is a P-Map • Contradiction
Step II: Identifying Immoralities • For each pair X,Y query candidate triplets X,Y,Z • XZY if no W is found that contains Z and Ind(X;Y | W) • If graph in-degree bounded by d running time O(n2d+3) • If W exists, Ind(X;Y|W), and XZY not immoral, then ZW • Reminder • If there is no W such that Z is in W and Ind(X;Y | W), then XZY is an immorality • Proof: Assume no W exists but X–Z–Y is not an immorality • Then, either XZY or XZY or XZY exists • But then, we can block X–Z–Y by Z • Then, since X and Y are not connected, can find W that includes Z such that Ind(X;Y | W) • Contradiction
Answering Independence Queries • Basic query • Determine whether two variables are independent • Well studied question in statistics • Common to frame query as hypothesis testing • Null hypothesis is H0 • H0: Data was sampled from P*(X,Y)=P*(X)P*(Y) • Need a procedure that will Accept or Reject the hypothesis • 2 test to assess deviance of data from hypothesis • Alternatively, use mutual information between X and Y
Structure Based Approaches • Strategy • Define a scoring function for each candidate structure • Search for a high scoring structure • Key: choice of scoring function • Likelihood based scores • Bayesian based scores
Likelihood Scores • Goal: find (G,) that maximize the likelihood • ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G • Find G that maximizes ScoreL(G:D)
Example X X Y Y
General Decomposition • The Likelihood score decomposes as: • Proof:
General Decomposition • The Likelihood score decomposes as: • Second term does not depend on network structure and thus is irrelevant for selecting between two structures • Score increases as mutual information, or strength of dependence between connected variable increases • After some manipulation can show: • To what extent are the implied Markov assumptions valid
Limitations of Likelihood Score X X Y Y G0 G1 • Since IP(X,Y)0 ScoreL(G1:D)ScoreL(G0:D) • Adding arcs always helps • Maximal scores attained for fully connected network • Such networks overfit the data (i.e., fit the noise in the data)
Avoiding Overfitting • Classical problem in machine learning • Solutions • Restricting the hypotheses space • Limits the overfitting capability of the learner • Example: restrict # of parents or # of parameters • Minimum description length • Description length measures complexity • Prefer models that compactly describes the training data • Bayesian methods • Average over all possible parameter values • Use prior knowledge
Bayesian Score: Bayesian Score Prior over structures Marginal likelihood Marginal probability of Data P(D) does not depend on the network
Bayesian Score: Marginal Likelihood of Data Given G Prior over parameters Likelihood Note similarity to maximum likelihood score, but with the key difference that ML finds maximum of likelihood and here we compute average of the terms over parameter space
Marginal Likelihood: Binomial Case • Assume a sequence of m coin tosses • By the chain rule for probabilities • Recall that for Dirichlet priors • Where MmH is number of heads in first m examples
Marginal Likelihood: Binomial Case Simplify using(x+1)=x(x) For multinomials with Dirichlet prior
Marginal Likelihood Example • Actual experiment with P(H) = 0.25 -0.6 -0.7 -0.8 -0.9 (log P(D)) / M -1 Dirichlet(.5,.5) -1.1 Dirichlet(1,1) -1.2 Dirichlet(5,5) -1.3 0 5 10 15 20 25 30 35 40 45 50 M
X H 1 T 2 T 3 H 4 T 5 H 6 H 7 Y H T H H T T H Marginal Likelihood: BayesNets • Network structure determines form ofmarginal likelihood Network Network 1: Two Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],…,Y[7]) X Y
X H 1 T 2 T 3 H 4 T 5 H 6 H 7 Y H T H H T T H Marginal Likelihood: BayesNets • Network structure determines form ofmarginal likelihood Network Network 2: Three Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1]Y[4]Y[6]Y[7]) P(Y[2]Y[3]Y[5]) X Y
Idealized Experiment • P(X = H) = 0.5 • P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p -1.3 -1.35 -1.4 -1.45 (logP(D))/M -1.5 -1.55 -1.6 Independent -1.65 P = 0.05 P = 0.10 -1.7 P = 0.15 -1.75 P = 0.20 -1.8 1 10 M 100 1000
Marginal Likelihood: BayesNets The marginal likelihood has the form: where • M(..) are the counts from the data • (..) are hyperparameters for each family Dirichlet Marginal Likelihood For the sequence of values of Xi when Xi’s parents have a particular value
Bayesian Score: Priors • Structure prior P(G) • Uniform prior: P(G) constant • Prior penalizing number of edges: P(G) c|G| (0<c<1) • Normalizing constant across networks is similar and can thus be ignored
Bayesian Score: Priors • Parameter prior P(|G) • BDe prior • M0: equivalent sample size • B0: network representing the prior probability of events • Set (xi,paiG) = M0 P(xi,paiG| B0) • Note: paiG are not the same as parents of Xi in B0 • Compute P(xi,paiG| B0) using standard inference in B0 • BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’
Bayesian Score: Asymptotic Behavior • For M, a network G with Dirichlet priors satisfies • Approximation is called BIC score • Score exhibits tradeoff between fit to data and complexity • Mutual information grows linearly with M while complexity grows logarithmically with M • As M grows, more emphasis is given to the fit to the data Dim(G): number of independent parameters in G
Bayesian Score: Asymptotic Behavior • For M, a network G with Dirichlet priors satisfies • Bayesian score is consistent • As M, the true structure G* maximizes the score • Spurious edges will not contribute to likelihood and will be penalized • Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity
Summary: Network Scores • Likelihood, MDL, (log) BDe have the form • BDe requires assessing prior network • Can naturally incorporate prior knowledge • BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL • All are score-equivalent • G I-equivalent to G’Score(G) = Score(G’)
Optimization Problem Input: • Training data • Scoring function (including priors, if needed) • Set of possible structures • Including prior knowledge about structure Output: • A network (or networks) that maximize the score Key Property: • Decomposability: the score of a network is a sum of terms.
Learning Trees • Trees • At most one parent per variable • Why trees? • Elegant math • we can solve the optimization problem efficiently(with a greedy algorithm) • Sparse parameterization • avoid overfitting while adapting to the data
Learning Trees • Let p(i) denote parent of Xi, or 0 if Xi has no parent • We can write the score as • Score = sum of edge scores + constant Improvement over “empty” network Score of “empty” network
Learning Trees • Algorithm • Construct graph with vertices: 1,...n • Set w(ij) = Score(Xj | Xi ) - Score(Xj) • Find tree (or forest) with maximal weight • This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion(Kruskal’s maximum spanning tree algorithm) • Theorem: Procedure finds the tree with maximal score • When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method
MINVOLSET MINVOLSET KINKEDTUBE KINKEDTUBE PULMEMBOLUS INTUBATION PULMEMBOLUS INTUBATION VENTMACH DISCONNECT VENTMACH DISCONNECT PAP SHUNT VENTLUNG PAP SHUNT VENTLUNG VENITUBE VENITUBE PRESS PRESS MINOVL MINOVL VENTALV VENTALV FIO2 FIO2 PVSAT ANAPHYLAXIS PVSAT ANAPHYLAXIS ARTCO2 ARTCO2 EXPCO2 EXPCO2 SAO2 TPR SAO2 INSUFFANESTH TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR LVEDVOLUME STROEVOLUME ERRCAUTER ERRBLOWOUTPUT HR HISTORY ERRBLOWOUTPUT HISTORY CO CVP PCWP CO HREKG CVP PCWP HRSAT HREKG HRSAT HRBP HRBP BP BP Learning Trees: Example Tree learned from data of Alarm network Correct edges Spurious edges Not every edge in tree is in the the original network Tree direction is arbitrary --- we can’t learn about arc direction
Beyond Trees • Problem is not easy for more complex networks • Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network • In fact, no efficient algorithm exists • Theorem: • Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1
Fixed Ordering • For any decomposable scoring function Score(G:D) and ordering the maximal scoring network has: • For fixed ordering we have independent problems • If we bound the in-degree per variable by d, then complexity is exponential in d (since choice at Xi does not constrain other choices)
Heuristic Search We address the problem by using heuristic search • Define a search space: • nodes are possible structures • edges denote adjacency of structures • Traverse this space looking for high-scoring structures • Search techniques: • Greedy hill-climbing • Best first search • Simulated Annealing • ...
Heuristic Search • Typical operations: A B C Add B D A B D C Reverse CB Delete BC A B D A B C C D D
Exploiting Decomposability • Caching:To update the score after a local change, we only need to rescore the families that were changed in the last move A B A B C C A B D D A B C C D D
Greedy Hill Climbing • Simplest heuristic local search • Start with a given network • empty network • best tree • a random network • At each iteration • Evaluate all possible changes • Apply change that leads to best improvement in score • Reiterate • Stop when no modification improves score • Each step requires evaluating O(n) new changes
Greedy Hill Climbing Pitfalls • Greedy Hill-Climbing can get stuck in: • Local Maxima • All one-edge changes reduce the score • Plateaus • Some one-edge changes leave the score unchanged • Happens because equivalent networks received the same score and are neighbors in the search space • Both occur during structure search • Standard heuristics can escape both • Random restarts • TABU search
Equivalence Class Search • Idea • Search the space of equivalence classes • Equivalence classes can be represented by PDAGs (partially ordered graph) • Advantages • PDAGs space has fewer local maxima and plateaus • There are fewer PDAGs than DAGs
Equivalence Class Search • Evaluating changes is more expensive • In addition to search, need to score a consistent network • These algorithms are more complex to implement Original PDAG X Y Z New PDAG Add Y—Z X Y Z X Y Z Consistent DAG Score
Learning Example: Alarm Network 2 True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10 1.5 1 KL Divergence 0.5 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 M
Model Selection • So far, we focused on single model • Find best scoring model • Use it to predict next example • Implicit assumption: • Best scoring model dominates the weighted sum • Valid with many data instances • Pros: • We get a single structure • Allows for efficient use in our tasks • Cons: • We are committing to the independencies of a particular structure • Other structures might be as probable given the data
Model Selection • Density estimation • One structure may suffice, if its joint distribution is similar to other high scoring structures • Structure discovery • Define features f(G) (e.g., edge, sub-structure, d-sep query) • Compute • Still requires summing over exponentially many structures
Model Averaging Given an Order • Assumptions • Known total order of variables • Maximum in-degree for variables d • Marginal likelihood Using decomposability assumption on prior P(G|) Since given ordering , parent choices are independent Cost per family: O(nk) Total cost: O(nk+1)
Model Averaging Given an Order • Posterior probability of a general feature • Posterior probability of particular choice of parents • Posterior probability of particular edge choice All terms cancel out