180 likes | 281 Views
“Lost in the Middle of Nowhere” Graduate Student Presentation. M. J. Gravier. Learning Bayesian Network Structure from Distributed Data. R. Chen, K. Sivakumar, H. Kargupta SIAM International Conference on Data Mining 2003. Overview. What is a Bayesian network? What problem is addressed?
E N D
“Lost in the Middle of Nowhere”Graduate Student Presentation M. J. Gravier
Learning Bayesian Network Structure from Distributed Data R. Chen, K. Sivakumar, H. Kargupta SIAM International Conference on Data Mining 2003
Overview • What is a Bayesian network? • What problem is addressed? • What is the contribution?
Bayesian Networks • “...state-of-the-art representation of probabilistic knowledge.” • Graphical diagrams • Probabilistic degrees of dependency • Efficient representation of a joint probability distribution Sun-Me Lee and Patricia Abbott, “Bayesian networks for knowledge discovery in large datasets: basics for nurse researchers,” Journal of Biomedical Informatics, 36 (2003):389-399.
Simple Bayesian Network Day after rock concert (X1) Poor exam grade (X2) Mega headache (X3) “Structure Learning”: discovering relationships by - a dependence analysis method (constraint satisfaction problem, often based on hypothesis testing) - a search and score method (basically an optimization problem)
Advantages of BN • Domain expert knowledge • Simple to understand • Captures interactions • Flexible re: missing information • Less influenced by sample size • Need conditional probabilities • Lack of software • Computational complexity Disadvantages of BN
Typical Centralized Data Site 2 Site 1 Database Site 5 Site 3 Site 4
What if its Decentralized? Different data at each site How do you create your Bayesian network model in this environment? Site 2 Site 1 Site 5 Site 3 Issues: - variable data can all be in one site - variable data may be in two or more sites - bandwidth Site 4
Collective Learning • Local Learning • Sample selection • Cross learning • Combination of the results
1. Local Learning • Local variable: since all the information is available locally, the normal local scoring method works • But what about non-local variables?
Cross Variables • Some local and some non-local parents • local links can be found • problem with cross links UlocalYlocalinstead of UlocalZnon-localYlocal U Z Y Site 2 Site 1
2. Sample Selection • Rank-base local models • low probabilities evidence of cross relationships • Send “keys” for models ranked below threshold ρfrom each site to a central site
3. Cross Learning • Keys from step 2 used to create a BN of cross relationships • ρselection is critical • try two different levels and retain common cross links as a noise reduction method • Cross learning eliminates hidden variables
4. Combination • Combine local & cross load BNs • All BNlocal assembled, then cross links added with cross load BN • Finds missing cross links for cross variables • Eliminates extra local links (hidden variable problem)
ALARM network model on-line monitoring of ICU patients widely used BN benchmark Characteristics 37 nodes 5 cross variables 15,000 samples Experimental Validation
Experimental Results • Learned correct structure • All cross links detected • ~10% of all samples transmitted
Conclusion • Collective learning method learned same BN as centralized method • Small data transmission requirement • First approach to learn BN structure from heterogeneous data