CS b553: Algorithms for Optimization and Learning

CS b553: Algorithms for Optimization and Learning Structure Learning

Agenda • Learning probability distributions from example data • To what extent can Bayes net structure be learned? • Constraint methods (inferring conditional independence) • Scoring methods (learning => optimization)

Basic Question • Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G*?

Basic Question • Given examples drawn from a distribution P* with independence relations given by the Bayesian structure G*, can we recover G* construct a network that encodes the same independence relations as G*? G* G1 G2  

Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y

Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11

Learning in the face of Noisy Data • Ex: flip two independent coins • Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT Model 1 Model 2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Errors are likely to be larger!

Principle • Learning structure must trade off fit of data vs. complexity of network • Complex networks • More parameters to learn • More data fragmentation = greater sensitivity to noise

Approach #1: Constraint-based learning • First, identify an undirected skeleton of edges in G* • If an edge X-Y is in G*, then no subset of evidence variables can make X and Y independent • If X-Y is not in G*, then we can find evidence variables to make X and Y independent • Then, assign directionality to preserve independences

Build-Skeleton algorithm • Given X={X1,…,Xn}, query Independent?(X,Y,U) • H = complete graph over X • For all pairs Xi, Xj, test separation as follows: • Enumerate all possible separating sets U • If Independent?(Xi,Xj,U) then remove Xi—Xj from H • In practice: • Must restrict to bounded size subsets |U|d (i.e., assume G* has bounded degree). O(n2(n-2)d) tests • Independence can’t be tested exactly

Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle X Z Y Directionality is irrelevant

Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle Y separates X, Z X Z X Z Y Y Directionality is irrelevant Not a v-structure

Assigning Directionality • Note that V-structures XYZ introduce a dependency between X and Z given Y • In structures XYZ, XYZ, and XYZ, X and Z are independent given Y • In fact Y must be given for X and Z to be independent • Idea: look at separating sets for all triples X-Y-Z in the skeleton without edge X-Z Triangle Y separates X, Z YU separates X, Z X Z X Z X Z Y Y Y Directionality is irrelevant Not a v-structure A v-structure

Statistical Independence Testing • Question: are X and Y independent? • Null hypothesis H0: X and Y are independent • Alternative hypothesis HA: X and Y are not independent

Statistical Independence Testing • Question: are X and Y independent? • Null hypothesis H0: X and Y are independent • Alternative hypothesis HA: X and Y are not independent • 2 test: use the statistic withthe empirical probability of X • Can compute (table lookup) the probability of getting a value at least this extreme if H0 is true (p-value) • If p < some threshold, e.g. 1-0.95, H0 is rejected

Approach #2: Score-based Methods • Learning => optimization • Define scoring function Score(G;D) that evaluates quality of structure G, and optimize it • Combinatorial optimization problem • Issues: • Choice of scoring function: maximum likelihood score, Bayesian score • Efficient optimization techniques

Maximum-Likelihood scores • ScoreL(G;D) = likelihood of the BN with the most likely parameter settings under structure G • Let L(G,G;D) be the likelihood of data using parameters G with structure G • Let G* = arg maxL(,G;D) as described in last lecture • Then ScoreL(G;D) = L(G*,G;D)

Issue with ML score

Issue with ML Score • Independent coin example G1 G2 X Y X Y ML parameters P(X=H) = 9/20 P(Y=H) = 8/20 P(X=H) = 9/20 P(Y=H|X=H) = 3/9 P(Y=H|X=T) = 5/11 Likelihood score log L(G1*,G1;D)= 9 log(9/20) + 11 log(11/20)+ 8 log (8/20) + 12 log (12/20) log L(G2*,G2;D)= 9 log(9/20) + 11 log(11/20)+ 3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)

Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– [3 log (3/9) + 6 log (6/9) + 5 log (5/11) + 6 log(6/11)]

Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)]

Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]

Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]=

Issue with ML Score G1 G2 X Y X Y Likelihood score log L(G1*,G1;D)-log L(G2*,G2;D) =8 log (8/20) + 12 log (12/20)– 8 [3/8 log (3/9) +5/8 log (5/11) ] – 12 [6/12 log (6/9) + 6/12 log(6/11)] = 8 [log (8/20) - 3/8 log (3/9) + 5/8 log (5/11) ] + 12 [log (12/20) - 6/12 log (6/9) + 6/12 log(6/11)]= =

Mutual Information Properties (the mutual information between X and Y) with Q(x,y) = P(x)P(y)  0 by nonnegativity of KL divergence Implication: ML scores do not decrease for more connected graphs => Overfitting to data!

Possible solutions • Fix complexity of graphs (e.g., bounded in-degree) • See HW7 • Penalize complex graphs • Bayesian scores

Idea of Bayesian Scoring • Note that parameters are uncertain • Bayesian approach: put a prior on parameter values and marginalize them out • P(D|G) = • For example, use Beta/Dirichlet priors => marginal is manageable to compute • E.g., uniform hyperparameter over network • Set virtual counts to  2^-|PaXi|

Large Sample Approximation • log P(D|G) = log L(G*;D) – ½ log M Dim[G] + O(1) • With M the number of samples, Dim[G] the number of free parameters of G • Bayesian Information Criterion (BIC) score: • ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G]

Large Sample Approximation • log P(D|G) = log L(G*;D) – ½ log M Dim[G] + O(1) • With M the number of samples, Dim[G] the number of free parameters of G • Bayesian Information Criterion (BIC) score: • ScoreBIC(G;D) = log L (G*;D) – ½ log M Dim[G] Fit data set Prefer simple models

Structure Optimization, Given a Score… • The problem is well-defined, but combinatorially complex! • Superexponential in # of variables • Idea: search locally through the space of graphs using graph operators • Add edge • Delete edge • Reverse edge

Search Strategies • Greedy • Pick operator that leads to greatest  score • Local minima? Plateaux? • Overcoming plateaux • Search with basin flooding • Tabu search • Perturbation methods (similar to simulated annealing, except on data weighting) • Implementation details: • Evaluate ’s between structures quickly (local decomposibility)

Recap • Bayes net structure learning: from equivalence class of networks that encode the same conditional independences • Constraint-based methods • Statistical independence tests • Score-based methods • Learning => optimization

CS b553: Algorithms for Optimization and Learning