480 likes | 508 Views
Learning Bayesian Networks. (From David Heckerman’s tutorial). X 1 true false false true. X 2 1 5 3 2. X 3 0.7 -1.6 5.9 6.3. Learning Bayes Nets From Data. Bayes net(s). data. X 1. X 2. Bayes-net learner. X 3. X 4. X 5. X 6. X 7. + prior/expert information. X 8. X 9.
E N D
Learning Bayesian Networks (From David Heckerman’s tutorial)
X1 true false false true X2 1 5 3 2 X3 0.7 -1.6 5.9 6.3 ... . . . . . . Learning Bayes Nets From Data Bayes net(s) data X1 X2 Bayes-net learner X3 X4 X5 X6 X7 + prior/expert information X8 X9
Overview • Introduction to Bayesian statistics:Learning a probability • Learning probabilities in a Bayes net • Learning Bayes-net structure
tails heads Learning Probabilities: Classical Approach Simple case: Flipping a thumbtack True probabilityq is unknown Given iid data, estimate q using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)
tails heads p(q) q 0 1 Learning Probabilities: Bayesian Approach True probability q is unknown Bayesian probability density forq
Bayesian Approach: use Bayes' rule to compute a new density for q given data prior likelihood posterior
The Likelihood “binomial distribution”
Example: Application of Bayes rule to the observation of a single "heads" p(q) p(heads|q)= q p(q|heads) q q q 0 1 0 1 0 1 prior likelihood posterior
Q X1 X2 XN ... toss 1 toss 2 toss N A Bayes net for learning probabilities
Sufficient statistics (#h,#t) are sufficient statistics
Prior Distributions for q • Direct assessment • Parametric distributions • Conjugate distributions (for convenience) • Mixtures of conjugate distributions
Conjugate Family of Distributions Beta distribution: Properties:
Intuition • The hyperparameters ah and at can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" • Equivalent sample size = ah + at • The larger the equivalent sample size, the more confident we are about the true probability
Beta Distributions Beta(0.5, 0.5) Beta(1, 1) Beta(3, 2) Beta(19, 39)
Assessment of a Beta Distribution Method 1:Equivalent sample - assess ah and at - assess ah+at and ah/(ah+at) Method 2:Imagined future samples
Generalization to m discrete outcomes("multinomial distribution") Dirichlet distribution: Properties:
More generalizations(see, e.g., Bernardo + Smith, 1994) Likelihoods from the exponential family • Binomial • Multinomial • Poisson • Gamma • Normal
Overview • Intro to Bayesian statistics:Learning a probability • Learning probabilities in a Bayes net • Learning Bayes-net structure
Q X1 X2 XN ... toss 1 toss 2 toss N From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails
tails heads X Y heads/tails heads/tails “heads” “tails” The next simplest Bayes net
X Y heads/tails heads/tails QX X1 X2 XN The next simplest Bayes net ? QY case 1 Y1 case 2 Y2 YN case N
X Y heads/tails heads/tails QX X1 X2 XN The next simplest Bayes net "parameter independence" QY case 1 Y1 case 2 Y2 YN case N
X Y heads/tails heads/tails QX X1 X2 XN The next simplest Bayes net "parameter independence" QY case 1 Y1 ß case 2 Y2 two separate thumbtack-like learning problems YN case N
X Y heads/tails heads/tails A bit more difficult... Three probabilities to learn: • qX=heads • qY=heads|X=heads • qY=heads|X=tails
X Y heads/tails heads/tails A bit more difficult... QY|X=heads QY|X=tails QX heads X1 Y1 case 1 tails X2 Y2 case 2
X Y heads/tails heads/tails A bit more difficult... QY|X=heads QY|X=tails QX X1 Y1 case 1 X2 Y2 case 2
X Y heads/tails heads/tails A bit more difficult... ? ? QY|X=heads QY|X=tails QX ? X1 Y1 case 1 X2 Y2 case 2
X Y heads/tails heads/tails A bit more difficult... QY|X=heads QY|X=tails QX X1 Y1 case 1 X2 Y2 case 2 3 separate thumbtack-like problems
In general… Learning probabilities in a BN is straightforward if • Local distributions from the exponential family (binomial, poisson, gamma, ...) • Parameter independence • Conjugate priors • Complete data
X Y heads/tails heads/tails Incomplete data makes parameters dependent QY|X=heads QY|X=tails QX X1 Y1 case 1 X2 Y2 case 2
Overview • Intro to Bayesian statistics:Learning a probability • Learning probabilities in a Bayes net • Learning Bayes-net structure
Learning Bayes-net structure Given data, which model is correct? X Y model 1: X Y model 2:
Bayesian approach Given data, which model is correct? more likely? X Y model 1: Datad X Y model 2:
Bayesian approach: Model Averaging Given data, which model is correct? more likely? X Y model 1: Datad X Y model 2: average predictions
Bayesian approach: Model Selection Given data, which model is correct? more likely? X Y model 1: Datad X Y model 2: Keep the best model: - Explanation - Understanding - Tractability
To score a model, use Bayes rule Given data d: model score "marginal likelihood" likelihood
Thumbtack example X heads/tails conjugate prior
X Y heads/tails heads/tails More complicated graphs 3 separate thumbtack-like learning problems X Y|X=heads Y|X=tails
Computation of Marginal Likelihood Efficient closed form if • Local distributions from the exponential family (binomial, poisson, gamma, ...) • Parameter independence • Conjugate priors • No missing data (including no hidden variables)
Practical considerations The number of possible BN structures for n variables is super exponential in n • How do we find the best graph(s)? • How do we assign structure and parameter priors to all possible graph?
initialize structure score all possible single changes perform best change any changes better? yes no return saved structure Model search • Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) • Heuristic methods • Greedy • Greedy with restarts • MCMC methods
Structure priors 1. All possible structures equally likely 2. Partial ordering, required / prohibited arcs 3. p(m) a similarity(m, prior BN)
Parameter priors • All uniform: Beta(1,1) • Use a prior BN
Parameter priors Recall the intuition behind the Beta prior for the thumbtack: • The hyperparameters ah and at can be thought of as imaginary counts from our prior experience, starting from "pure ignorance" • Equivalent sample size = ah + at • The larger the equivalent sample size, the more confident we are about the long-run fraction
x1 x2 x3 x4 x5 x6 x7 x8 x9 Parameter priors imaginary count for any variable configuration equivalent sample size + parameter modularity parameter priors for any BN structure for X1…Xn
x1 x2 x3 x4 x5 x6 x1 x2 x7 x3 x4 x8 x5 x9 x6 x7 x1 true false false true x2 false false false true x3 true true false false x8 x9 ... . . . . . . Combine user knowledge and data prior network+equivalent sample size improved network(s) data