1 / 24

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm. Nir Friedman Dana Pe'er Iftach Nachman Institute of Computer Science Hebrew University Jerusalem. Data. Inducer. B. E. R. A. C. Learning Bayesian Network Structure (Complete Data).

yanka
Download Presentation

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Bayesian Network Structure from Massive Datasets:The ``Sparse Candidate'' Algorithm Nir Friedman Dana Pe'er Iftach Nachman Institute of Computer Science Hebrew University Jerusalem

  2. Data Inducer B E R A C Learning Bayesian Network Structure(Complete Data) • Set a scoring function that evaluates networks • Find the highest scoring network • This optimization problem is NP-hard[Chickering] use heuristic search .

  3. Our Contribution • We suggest a new heuristic • Builds on simple ideas • Easy to implement • Can be combined with existing heuristic search procedures • Reduces learning time significantly • Also gain some insight on the complexity of learning problem .

  4. Learning Bayesian Network Structure: Score • Various variants of scores • We focus here on Bayesian score [Cooper & Hershkovitz; Heckerman, Geiger & Chickering] • Key property for search: • The score decomposes:where N (Xi,PaiG) is a vector of counts of joint values of Xi and its parents in G the data .

  5. A B Add A B A B C Remove B C C Reverse B C A B A B C C Heuristic Search in Learning Networks • Search over network structures • Standard operations: add, delete, reverse • Need to check acyclicty • Use standard search method in this space:greedy hill climbing, simulated annealing, etc. .

  6. Computational Problem Cost of evaluating a single move • Collecting counts N (xi,pai)is O(M) (M = # of examples) • Using caching we can save some of these computations Number of possible moves • Number of possible moves is O(N2) (N = number of vars.) • After performing a move, O(N) new moves to be evaluated Total • Each iteration of greedy HC costs O(MN) Most of the time spent on evaluating irrelevant moves .

  7. C(A) = { B } C(B) = {A} C(C) = {A, B} A B C Idea #1: Restrict to Few Candidates • For each X, select a small set of candidatesC(X) • Consider arcs Y X only if Y is inC(X) BA  CA X A->C  C->B X • If we restrict to k candidate for each variable, then • only O(kN) possible moves for each network • in greedy HC, only O(k) new moves to evaluate in each iteration • Cost of each iteration is O(M k) .

  8. How to Select Candidates? • Simple proposal: • Rank candidates by mutual informationto X • This measures how many bits, we can save in encoding of X if we take Y into account • Select top k ranking variables for C(X) .

  9. HC k=15 k=10 k=5 C+L Computation of all pairwise statistics Empty Effect of Candidate Number on Search -52 -52.5 -53 -53.5 -54 Score (BDe/M) -54.5 HC -55 HC k=5 HC k=10 -55.5 HC k=15 C+L -56 -56.5 0 200 400 600 800 1000 1200 Time (sec) Text domain with 100 vars, 10,000 instances .

  10. Problems with Candidate Selection INTUBATION • Fragment of “alarm” network VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ANAPHYLAXIS ARTCO2 TPR EXPCO2 SAO2 INSUFFANESTH LVFAILURE HYPOVOLEMIA CATECHOL STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HRBP BP CO HREKG HRSAT .

  11. INTUBATION VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ARTCO2 EXPCO2 SAO2 Idea #2: Iteratively Improve Candidates • Once we have partial understanding of the domain, we might use it select new candidates: • “current” parents + • most promising candidates given the current structure • If INTUBATION isparent of SHUNT, thenMINVOL is less informativeabout SHUNT .

  12. Comparing Potential Candidates Intuition: • X should be Markov shielded by its parents PaX • Shielding: use conditional information • Does adding Y to X’s parents improves prediction? • I(X;Y|PaX) = 0 iff X is independent from Y given PaX • Score: use difference in score • Use Score(X|Y) as an estimate of -H(X|Y)in generating distribution .

  13. INTUBATION VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ANAPHYLAXIS ARTCO2 INSUFFANESTH TPR EXPCO2 SAO2 LVFAILURE HYPOVOLEMIA CATECHOL STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HRBP BP CO HREKG HRSAT “Alarm” example revisited .

  14. INTUBATION VENTLUNG PULMEMBOLUS MINVOL FIO2 VENTALV PAP SHUNT PVSAT ANAPHYLAXIS ARTCO2 INSUFFANESTH TPR EXPCO2 SAO2 LVFAILURE HYPOVOLEMIA CATECHOL STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HRBP BP CO HREKG HRSAT “Alarm” example revisited .

  15. Alternative Criterion: Discrepancy • Idea: • Measure how well the network models the joint P(X,Y) • We can improve this prediction by making X a candidate parent of Y • Natural definition: • Note, if PB(X,Y) = P(X)P(Y), then d(X,Y|B) = I(X;Y) .

  16. Text with 100 words -52 -52.5 Score (BDe/M) -53 -53.5 Greedy HC Disc k=15 Score k=15 Shld k=15 -54 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (sec) .

  17. Text with 200 words -82 -82.2 -82.4 -82.6 Score (BDe/L) -82.8 -83 -83.2 Greedy HC Disc k=15 Score k=15 -83.4 Shld k=15 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (sec) .

  18. -414 -415 -416 -417 -418 4000 6000 8000 Greedy HC Disc k=20 Score k=20 Shld k=20 Cell Cycle (800 vars) -410 -420 -430 -440 -450 Score (BDe/L) -460 -470 -480 -490 -500 0 5,000 10,000 15,000 20,000 Time (sec) .

  19. Complexity of Structure Learning Without restriction of the candidate sets: • Restricting |Pai|  1  Problem is easy[Chow+Liu; Heckerman+al] • No restriction  Problem is NP-Hard[Chickering] • Even when restricting |Pai|  2 • We do not know of interesting intermediate problems • Such behavior is often called the “exponential cliff” .

  20. Complexity with Small Candidate Sets In each iteration, we solve an optimization problem: • Given candidate sets C(X1),…, C(XN), find best scoring network that respects these candidates Is this problem easier than unconstrained structure learning? .

  21. Complexity with Small Candidate Sets Theorem: If |C(Xi) | > 1 finding best-scoring structure is NP-Hard But… • The complexity function is gradually growing • There is a parameter c, s.t. time complexity is • Exponential in c • Linear in N • Fix d. There is polynomial procedure that can solve all instances with c < d • Similar situation in inference: exponential in the size of largest clique in triangulated graph, linear in N .

  22. B,E,G B,E A,B,E,F A,B A,B,C,D Complexity Proof Outline In fact, the algorithm is motivated by inference • Define the “candidate graph” where Y X if Y C(X) • Then, create a clique tree (moralize & triangulate) • We then define a dynamic programming algorithm for constructing the best scoring structure • Messages assign values to different ordering of variables in a separator • Ordering ensures acyclicity of the network .

  23. Future Work • Quadratic cost of candidate selection • Initial step requires O(N2) pairwise statistics • Can we select candidates by looking at smaller number, e.g., O(N log N), of pairwise statistics • Choice of number of candidates • We used a fixed number of candidates • Can we decide on candidate number more intelligently? • Deal with variables that have large in+out degree • Combine candidates with PDAG search .

  24. Summary • Heuristic for structure search • Incorporates understanding of BNs into blind search • Drastically reduces the size of the search space  faster search that requires fewer statistics • Empirical Evaluation • We present evaluation on several datasets • Variants of the algorithm used in • [Boyen,Friedman&Koller] for temporal models with SEM • [Friedman,Getoor,Koller&Pfeffer] for relational models • Complexity Analysis • Computational subproblem where structure search might be tractable even beyond trees .

More Related