230 likes | 446 Views
Data Mining Cardiovascular Bayesian N etworks. Charles Twardy † , Ann Nicholson † , Kevin Korb † , John McNeil ‡ (Danny Liew ‡ , Sophie Rogers ‡ , Lucas Hope † ). † School of Computer Science & Software Engineering ‡ Dept. of Epidemilogy & Preventive Medicine Monash University
E N D
Data Mining CardiovascularBayesian Networks Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡ (Danny Liew‡, Sophie Rogers‡, Lucas Hope†) †School of Computer Science & Software Engineering ‡Dept. of Epidemilogy & Preventive Medicine Monash University www.datamining.monash.edu.au/bnepi
2. Data Mining Busselton Study data 2 epidemiological models Bayesian network software (Netica) Medical Experts Causal discovery (CaMML) + Other learners 3. Evaluation Overview Problem: assessment of risk for coronary heart disease (CHD) 1. Knowledge Engineering
Knowledge Engineering BNs from the medical literature • The Australian Busselton Study • every 3 years, 1966-1981, > 8,000 participants • mortality followup via WA death register + manually • Cox proportional-hazards model, 2,258 from 1978 cohort • CHD event base rates: 23% for men, 14% for women • The German PROCAM Study • 1979-1985, followup every 2 years, > 25,000 participants • Scoring model (based on Cox), ~5,000 men • CHD event base rates: ~6% General question: are models transferable across populations?
P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) BNs summarize the joint distribution The Busselton BN: arcs uninformative All nodes have an associated conditional prob. distribution predictor variables 10-year risk of CHD event
binary nodes discretization choices The Busselton BN: discretization
Normal Bad cholesterol Heavy smoking The Busselton BN: reasoning
More risk factors ! The Busselton BN: reasoning
A risk assessment tool for clinicians • Previous tool: TAKEHEART • Combine risk assessment (probability) with costs.
Young, predictor not observed – don’t treat Young, predictor observed – don’t treat old, predictor not observed – treat Not so old, predictor not observed – treat Risk Assessment Tool: example
CaMML: a causal learner • Developed at Monash University • Data mines BNs from epidemiological data • Minimum message length (MML) metric: Trades-off complexity vs goodness of fit • MCMC search over model space
Evaluation • Predicting 10 year risk of CHD using Busselton data • Metrics: • ROC Curves (area under curve) • Bayesian Information Reward (BIR) • Experiment 1: • Compare Busselton, PROCAM and CaMML BNs • Experiment 2 • Compare CaMML and other standard machine learners (from Weka) • 90-10 training/testing split, 10-fold crossvalidation
Everyone at risk! Area under curve (AUC) priors No-one at risk! Experiment 1: ROC Results Extremes:
Summary of Results Experiment I (Models of whole data) • PROCAM model does at least as well as Busselton • On Busselton data • For both "relative" (ROC) and "absolute" (BIR) risk • CaMML Models do as well • But much simpler: only 4 nodes matter to CHD10! Experiment II (Cross-validation of learners) • Logistic regression does best on both metrics • Statistically powerful: only 1 parameter per arc • No search required: structure is given • No discretization necessary
Conclusions • Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature • CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help.
Future directions • Improve data mining by • Adding prior knowledge to search • Assessing whether data sources can be combined; if so, do so • Investigate combination of continuous and discrete variables in data mining and modeling • Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining) • with intervention modeling (Causal Reckoner) • with decision support • with GUI, usable by clinicians
References • G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002. • M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998. • C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999. www.datamining.monash.edu.au/software/camml • C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.