Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Learning Bayesian Network Structure from Massive Datasets:The ``Sparse Candidate'' Algorithm Nir Friedman Dana Pe'er Iftach Nachman Institute of Computer Science Hebrew University Jerusalem

Data Inducer B E R A C Learning Bayesian Network Structure(Complete Data) • Set a scoring function that evaluates networks • Find the highest scoring network • This optimization problem is NP-hard[Chickering] use heuristic search .

Our Contribution • We suggest a new heuristic • Builds on simple ideas • Easy to implement • Can be combined with existing heuristic search procedures • Reduces learning time significantly • Also gain some insight on the complexity of learning problem .

Learning Bayesian Network Structure: Score • Various variants of scores • We focus here on Bayesian score [Cooper & Hershkovitz; Heckerman, Geiger & Chickering] • Key property for search: • The score decomposes:where N (Xi,PaiG) is a vector of counts of joint values of Xi and its parents in G the data .

A B Add A B A B C Remove B C C Reverse B C A B A B C C Heuristic Search in Learning Networks • Search over network structures • Standard operations: add, delete, reverse • Need to check acyclicty • Use standard search method in this space:greedy hill climbing, simulated annealing, etc. .

Computational Problem Cost of evaluating a single move • Collecting counts N (xi,pai)is O(M) (M = # of examples) • Using caching we can save some of these computations Number of possible moves • Number of possible moves is O(N2) (N = number of vars.) • After performing a move, O(N) new moves to be evaluated Total • Each iteration of greedy HC costs O(MN) Most of the time spent on evaluating irrelevant moves .

C(A) = { B } C(B) = {A} C(C) = {A, B} A B C Idea #1: Restrict to Few Candidates • For each X, select a small set of candidatesC(X) • Consider arcs Y X only if Y is inC(X) BA  CA X A->C  C->B X • If we restrict to k candidate for each variable, then • only O(kN) possible moves for each network • in greedy HC, only O(k) new moves to evaluate in each iteration • Cost of each iteration is O(M k) .

How to Select Candidates? • Simple proposal: • Rank candidates by mutual informationto X • This measures how many bits, we can save in encoding of X if we take Y into account • Select top k ranking variables for C(X) .

HC k=15 k=10 k=5 C+L Computation of all pairwise statistics Empty Effect of Candidate Number on Search -52 -52.5 -53 -53.5 -54 Score (BDe/M) -54.5 HC -55 HC k=5 HC k=10 -55.5 HC k=15 C+L -56 -56.5 0 200 400 600 800 1000 1200 Time (sec) Text domain with 100 vars, 10,000 instances .

Problems with Candidate Selection INTUBATION • Fragment of “alarm” network VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ANAPHYLAXIS ARTCO2 TPR EXPCO2 SAO2 INSUFFANESTH LVFAILURE HYPOVOLEMIA CATECHOL STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HRBP BP CO HREKG HRSAT .

INTUBATION VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ARTCO2 EXPCO2 SAO2 Idea #2: Iteratively Improve Candidates • Once we have partial understanding of the domain, we might use it select new candidates: • “current” parents + • most promising candidates given the current structure • If INTUBATION isparent of SHUNT, thenMINVOL is less informativeabout SHUNT .

Comparing Potential Candidates Intuition: • X should be Markov shielded by its parents PaX • Shielding: use conditional information • Does adding Y to X’s parents improves prediction? • I(X;Y|PaX) = 0 iff X is independent from Y given PaX • Score: use difference in score • Use Score(X|Y) as an estimate of -H(X|Y)in generating distribution .

INTUBATION VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ANAPHYLAXIS ARTCO2 INSUFFANESTH TPR EXPCO2 SAO2 LVFAILURE HYPOVOLEMIA CATECHOL STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HRBP BP CO HREKG HRSAT “Alarm” example revisited .

Alternative Criterion: Discrepancy • Idea: • Measure how well the network models the joint P(X,Y) • We can improve this prediction by making X a candidate parent of Y • Natural definition: • Note, if PB(X,Y) = P(X)P(Y), then d(X,Y|B) = I(X;Y) .

Text with 100 words -52 -52.5 Score (BDe/M) -53 -53.5 Greedy HC Disc k=15 Score k=15 Shld k=15 -54 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (sec) .

Text with 200 words -82 -82.2 -82.4 -82.6 Score (BDe/L) -82.8 -83 -83.2 Greedy HC Disc k=15 Score k=15 -83.4 Shld k=15 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (sec) .

-414 -415 -416 -417 -418 4000 6000 8000 Greedy HC Disc k=20 Score k=20 Shld k=20 Cell Cycle (800 vars) -410 -420 -430 -440 -450 Score (BDe/L) -460 -470 -480 -490 -500 0 5,000 10,000 15,000 20,000 Time (sec) .

Complexity of Structure Learning Without restriction of the candidate sets: • Restricting |Pai|  1  Problem is easy[Chow+Liu; Heckerman+al] • No restriction  Problem is NP-Hard[Chickering] • Even when restricting |Pai|  2 • We do not know of interesting intermediate problems • Such behavior is often called the “exponential cliff” .

Complexity with Small Candidate Sets In each iteration, we solve an optimization problem: • Given candidate sets C(X1),…, C(XN), find best scoring network that respects these candidates Is this problem easier than unconstrained structure learning? .

Complexity with Small Candidate Sets Theorem: If |C(Xi) | > 1 finding best-scoring structure is NP-Hard But… • The complexity function is gradually growing • There is a parameter c, s.t. time complexity is • Exponential in c • Linear in N • Fix d. There is polynomial procedure that can solve all instances with c < d • Similar situation in inference: exponential in the size of largest clique in triangulated graph, linear in N .

B,E,G B,E A,B,E,F A,B A,B,C,D Complexity Proof Outline In fact, the algorithm is motivated by inference • Define the “candidate graph” where Y X if Y C(X) • Then, create a clique tree (moralize & triangulate) • We then define a dynamic programming algorithm for constructing the best scoring structure • Messages assign values to different ordering of variables in a separator • Ordering ensures acyclicity of the network .

Future Work • Quadratic cost of candidate selection • Initial step requires O(N2) pairwise statistics • Can we select candidates by looking at smaller number, e.g., O(N log N), of pairwise statistics • Choice of number of candidates • We used a fixed number of candidates • Can we decide on candidate number more intelligently? • Deal with variables that have large in+out degree • Combine candidates with PDAG search .

Summary • Heuristic for structure search • Incorporates understanding of BNs into blind search • Drastically reduces the size of the search space  faster search that requires fewer statistics • Empirical Evaluation • We present evaluation on several datasets • Variants of the algorithm used in • [Boyen,Friedman&Koller] for temporal models with SEM • [Friedman,Getoor,Koller&Pfeffer] for relational models • Complexity Analysis • Computational subproblem where structure search might be tractable even beyond trees .

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm