Genetic Algorithm for Learning Bayesian Network: A Thesis Defense Overview

Ben Perry – M.S. Thesis Defense A Genetic Algorithm for Learning Bayesian Network Adjacency Matrices from Data Benjamin B. Perry Laboratory for Knowledge Discovery in Databases Kansas State University http://www.kddresearch.org http://www.cis.ksu.edu/~bbp9857

Overview • Bayesian Network • Definitions and examples • Inference and learning • Genetic Algorithms • Structure Learning Background • Problem • K2 algorithm • Sparse Candidate • Improving K2: Permutation Genetic Algorithm (GASLEAK) • Shortcoming: greedy, sensitive to ordering • Permutation GA • Master’s thesis: Adjacency Matrix GA (SLAM GA) • Rationale • Evaluation with Known Bayesian Networks • Summary

Bayesian Belief Networks (BBNS):Definition • Bayesian Network • Directed acyclic graph • Vertices (nodes): denote events, or states of affairs (each a random variable) • Edges (arcs, links): denote conditional dependencies, causalities • Model of conditional dependence assertions (or CI assumptions) • Example (“Ben’s Presentation” BBN) (sprinkler) • General Product (Chain) Rule for BBNs` Appearance:Good, Bad Ben is nervous: Extremely, Yes, No Sleep: Narcoleptic Well Bad All-nighter X2 X1 X4 X5 Ben’s presentation: Good, Not so good, Failed miserably X3 Memory: Elephant, Good, Bad, None P(Well, Good, Good, No, Good) = P(G) · P(G | W) · P(G | W) · P(N | G, G) · P(G | N)

Graphical Modelsof Probability Distributions • Idea • Want: model that can be used to perform inference • Desired properties • Correlations among variables • Ability to represent functional, logical, stochastic relationships • Probability of certain events • Inference: Decision Support Problems • Diagnosis (medical, equipment) • Pattern recognition (image, speech) • Prediction • Want to Learn: Most Likely Model that Generates Observed Data • Under certain assumptions (Causal Markovity), it has been shown that we can do it • Given: data D (tuples or vectors containing observed values of variables) • Return: directed graph (V, E) expressing target CPTs • NEXT: Genetic algorithms

Genetic Algorithms • Idea • Emulate natural process of survival of the fittest (Example: Roaches adapt) • Each generation has many diverse individuals • Each individual competes for the chance to survive • Most common approach: best individuals live to the next generation and mate • Produce children with traits from both parents • If parents are strong, children might be stronger • Major components (operators) • Fitness function • Chromosome manipulation • Cross-over (Not the “John Edward” type!), mutation • From (Educated?) Guess to Gold • Initial population typically random or not much better than random – bad scores • Performs well with a non-deceptive search space and good genetic operators • Ability to escape local optima with mutations. • Not guaranteed to get the best answer, but usually gets close

10 21 22 13 20 19 23 16 15 36 6 5 4 27 11 31 32 34 35 37 17 12 29 28 24 18 25 26 7 8 9 33 14 30 1 2 3 Learning Structure:K2 Algorithm • Algorithm Learn-BBN-Structure-K2 (D, Max-Parents) FOR i 1 to n DO // arbitrary ordering of variables {x1, x2, …, xn} WHILE (Parents[xi].Size < Max-Parents) DO // find best candidate parent Best argmaxj>i (P(D | xjParents[xi]) // max Dirichlet score IF (Parents[xi] + Best).Score> Parents[xi].Score) THEN Parents[xi] += Best RETURN ({Parents[xi] | i {1, 2, …, n}}) • ALogical Alarm Reduction Mechanism [Beinlich et al, 1989] • BBN model for patient monitoring in surgical anesthesia • Vertices (37): findings (e.g., esophageal intubation), intermediates, observables • K2: found BBN different in only 1 edge from gold standard (elicited from expert)

Learning Structure:K2 downfalls • Greedy (may fall into local maxima) • Highly dependent upon node ordering • Optimal node ordering must be given • If optimal order is already known, an expert could probably create the network • Number of orderings consistent with DAGs is exponential (n!)

Learning Structure:Sparse Candidate • General Idea: • Inspect k-best parent candidates at a time. (K2 only inspects one) • k is typically very small ~ 5 ≤ k ≤ 15 • Exponential to the order of k • Algorithm: Loop until no improvements or iteration limit exceeds: For each node, select the top k parent candidates (mutual information or m_disc) [Restrict]Build a network by manipulating parents (add, remove, reverse from candidate set for each node) . Only accept changes that maximizes the network score (Minimum Descriptor Length) [Maximize phase] • Must handle cycles.. expensive. • K2 gives this to us for free • Next: Improving K2

Genetic Algorithm for Structure Learning from Evidence, AIS, and K2 [2] Representation Evaluator for Bayesian Network Structure Learning Problems Dtrain(Structure Learning) D: Training Data Dval(Inference) : Evidence Specification f(α) Ordering Fitness α Candidate Ordering [1] Permutation Genetic Algorithm Optimized Ordering GASLEAK:A Permutation GA for Variable Ordering

Properties of the Genetic Algorithm • Elitist • Chromosome representation • Integer permutation ordering • Sample chromosome in a BBN of 5 nodes might look like: 3 1 2 0 4 • Seeding • Random shuffle • Operators • Order crossover • Swap mutation • Fitness • RMSE • Job farm • Java-based; Utilize many machines regardless of OS

Histogram of estimated fitness for all 8! = 40320 permutations of Asia variables. GASLEAK results • Not encouraging • Bad fitness functionor bad evidence b.v. • Many graph errors

Master’s Thesis: SLAM GA • SLAM GA – Structure Learning Adjacency Matrix Genetic Algorithm • Initial population- tried several approaches: • Completely Random Bayesian Networks (Box-Muller, Max parents) • Many illegal structures; wrote fixCycles algorithm. • Random networks generated from parents pre-selected by the Restrict phase of Sparse Candidate • Performed better than random • Aggregate of k learned networks from K2 given random orderings (cycles eliminated) – Best approach

K2 Random Order Aggregator Instantiater Training Data K2 Manager BBN K2 Random Order D 1 K2 Random Order BBN 2 Aggregator Aggregate BBN . . . . BBN BBN k • For small networks, k=1 is best. For larger networks, k=2 is best.

SLAM GA • Chromosome representation • Edge matrix – n^2 bits • Each bit represents a parent edge to node. • 1 = parent, 0 = not parent • Operators • Crossover: Swap parents, fix cycles.

SLAM GA: Crossover

SLAM GA • Chromosome representation • Edge matrix – n^2 • Each bit represents a parent edge to node. • 1 = parent, 0 = not parent • Operators • Crossover: Swap parents, fix cycles. • Mutation: Reverse, delete, or add a random number of edges. Fix cycles. • Fitness • Total Bayesian Dirichlet equivalencescore for each node

Learned network 1 Graph Error Results - Asia Best of first generation Actual 15 Graph Errors

Results – Asia

Learned network 2 Graph Errors Results - Poker Best of first generation Actual 11 Graph Errors

Results - Poker

Learned network 4 Graph Errors Results - Golf Best of first generation Actual 11 Graph Errors

Results - Golf

Learned network Results – Boerlage92 Initial Actual

Results - Boerlage92

Results - Alarm

Final Fitness Values

K2 vs. SLAM GA • K2: • Very good if ordering is known • Ordering is often not known • Greedy, very dependent on ordering. • SLAM GA • Stochastic; falls out of local optima trap • Can improve on bad structures learned by K2 • Takes much longer than K2

GASLEAK vs. SLAM GA • GASLEAK: • Gold network never recovered • Much more computationally-expensive • K2 is run on each [new] individual each generation • Each chromosome must be scored • Final network has many graph errors • SLAM GA • For small networks, gold standard network often recovered. • Relatively few graph errors for final network. • Less computationally intensive • Initial population most expensive • Each chromosome must be scored

SLAM GA: Ramifications • Effective structure learning algorithm • Ideal for small networks • Improvement over GASLEAK • SLAM GA faster in spite of same GA parameters • SLAM GA more accurate • Improvement over K2 • Aggregate algorithm produces better initial population • Parent-swapping crossover technique effective • Diversifies search space while retaining past information

SLAM GA: Future Work • Parameter tweaking • Better fitness function • Several ‘bad’ structures score better than gold standard • GA works fine • ‘Intelligent’ mutation operator • Add edges from pre-qualified set of candidate parents • New instantiation methods • Use GASLEAK • Other structure-learning algorithms • Scalability • Job farm

Summary • Bayesian Network • Genetic Algorithms • Learning Structure: K2, Sparse Candidate • GASLEAK • SLAM GA

Genetic Algorithm for Learning Bayesian Network: A Thesis Defense Overview