150 likes | 400 Views
Advances in Bayesian Learning Learning and Inference in Bayesian Networks. Irina Rish IBM T.J.Watson Research Center rish@us.ibm.com. “Road map”. Introduction and motivation: What are Bayesian networks and why use them? How to use them Probabilistic inference How to learn them
E N D
Advances in Bayesian LearningLearning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center rish@us.ibm.com
“Road map” • Introduction and motivation: • What are Bayesian networks and why use them? • How to use them • Probabilistic inference • How to learn them • Learning parameters • Learning graph structure • Summary
Smoking lung Cancer Bronchitis X-ray Dyspnoea Bayesian Networks P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
cause cause • Classification: P(class|data) Text Classification Medicine symptom Bio-informatics Speech recognition symptom Computer troubleshooting Stock market What are they good for? • Diagnosis: P(cause|symptom)=? • Prediction: P(symptom|cause)=? • Decision-making (given a cost function)
P(S) P(C|S) P(B|S) • C B D=0 D=1 • 0 0 0.1 0.9 • 0 1 0.7 0.3 • 1 0 0.8 0.2 • 1 1 0.9 0.1 CPD: P(X|C,S) P(D|C,B) Conditional Independencies Efficient Representation Bayesian Networks: Representation Smoking lung Cancer Bronchitis X-ray Dyspnoea P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
S C B X D “Moral” graph P(s)P(c|s)P(b|s)P(x|c,s)P(d|c,b)= P(b|s) P(c|s)P(x|c,s)P(d|c,b) Variable Elimination W*=4 ”induced width” (max clique size) Complexity: Bayesian networks: inferenceP(X|evidence)=? P(s|d=1) C B X D P(s) Efficient inference: variable orderings, conditioning, approximations
“Road map” • Introduction and motivation: • What are Bayesian networks and why use them? • How to use them • Probabilistic inference • Why and how to learn them • Learning parameters • Learning graph structure • Summary
Combining domain expert knowledge with data • Incremental learning: P(H) or <9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. S C • Learning causal relationships: Why learn Bayesian networks? • Efficient representation and inference • Handling missing data: <1.3 2.8 ?? 0 1 >
Known graph – learn parameters • Complete data: P(S) S parameter estimation (ML, MAP) • Incomplete data: P(C|S) P(B|S) B C non-linear parametric optimization (gradient descent, EM) P(X|C,S) P(D|C,B) D X • Unknown graph – learn graph and parameters • Complete data: optimization (search in space of graphs) S S C B B C • Incomplete data: structural EM, mixture models D D X X Learning Bayesian Networks
- decomposable! Multinomial counts C B X • MAP-estimate (Bayesian statistics) Conjugate priors - Dirichlet Equivalent sample size (prior knowledge) Learning Parameters:complete data • ML-estimate:
Non-decomposablemarginal likelihood (hidden nodes) Initial parameters Expectation Inference: P(S|X=0,D=1,C=0,B=1) Expected counts Current model S X D C B <? 0 1 0 1> <1 1 ? 0 1> <0 0 0 ??> <? ? 0 ? 1> ……… S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 00 1 0 0 0 1 ……….. Data Maximization Update parameters (ML, MAP) Learning Parameters:incomplete data EM-algorithm: iterate until convergence
Find NP-hard optimization S Add S->B S B C C B Delete S->B S Reverse S->B B C S B C Learning graph structure • Heuristic search: • Greedy local search • Best-first search • Simulated annealing Complete data – local computations Incomplete data (score non-decomposable): Structural EM • Constrained-based methods • Data impose independence relations (constrains)
<9.7 0.6 8 14 18> <0.2 1.3 5 ?? ??> <1.3 2.8 ?? 0 1 > <?? 5.6 0 10 ??> ………………. Scoring functions:Minimum Description Length (MDL) • Learning data compression • Other: MDL = -BIC (Bayesian Information Criterion) • Bayesian score (BDe) - asymptotically equivalent to MDL DL(Data|model) DL(Model)
Summary • Bayesian Networks – graphical probabilistic models • Efficient representation and inference • Expert knowledge + learning from data • Learning: • parameters (parameter estimation, EM) • structure (optimization w/ score functions – e.g., MDL) • Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI)) • Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.