390 likes | 689 Views
Dental Data Mining: Practical Issues and Potential Pitfalls. Stuart A. Gansky University of California, San Francisco Center to Address Disparities in Children’s Oral Health Support: US DHHS/NIH/NIDCR U54 DE14251. What is K nowledge D iscovery and D ata Mining (KDD)?.
E N D
Dental Data Mining: Practical Issues and Potential Pitfalls Stuart A. Gansky University of California, San Francisco Center to Address Disparities in Children’s Oral Health Support: US DHHS/NIH/NIDCR U54 DE14251
What is Knowledge Discovery and Data Mining (KDD)? • “Semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures in data” – MIT Tech Review (2001) • Interface of • Artificial Intelligence – Machine Language • Computer Science – Engineering – Statistics • Association for Computing Machinery Special Interest Group on Knowledge Discovery in Data and Data Mining (ACM SIGKDD sponsors KDD Cup)
Pb Au Data Mining as Alchemy
Some Potential KDD Applications in Oral Health Research • Large surveys (eg NHANES) • Longitudinal studies (eg VA Aging Study) • Disease registries (eg SEER) • Digital diagnostics (radiographic & others) • Molecular biology (eg PCR, microarrays) • Health services research / claims data • Provider and workforce databases
Supervised Learning Regression k nearest neighbor Trees (CART, MART, boosting, bagging) Random Forests Multivariate Adaptive Regression Splines (MARS) Neural Networks Support Vector Machines Unsupervised Learning Hierarchical clustering k-means
Collect & Store Pre- Process Analyze Validate Act Sample Merge Warehouse Clean Impute Transform Standardize Register Supervised Unsupervised Visualize Internal Split Sample Cross-validate Bootstrap External Intervene Set Policy KDD Steps
Example – Caries • Predicting disease with traditional logistic regression may have modelling difficulties: nonlinearity (ANN better) & interactions (CART better)(Kattan et al, Comp Biomed Res, ’98) • Want to compare the performance of logistic regression to popular data mining techniques – tree and artificial neural network models in dental caries data • CART in caries (Stewart & Stamm, JDR, ’91)
Example study – child caries • Background: ~20% of children have ~80% of caries (tooth decay) • University of Rochester longitudinal study (Leverett et al, J Dent Res, 1993) • 466 1st-2nd graders caries-free at baseline • Saliva samples & exams every 6 months • Goal: Predict 24 month caries incidence (output)
18-month Predictors (Inputs) • Salivary bacteria • Mutans Streptococci (log10 CFU/ml) • Lactobacilli (log10 CFU/ml) • Salivary chemistry • Fluoride (ppm) • Calcium (mmol/l) • Phosphate (ppm)
Modeling Methods Logistic Regression Neural Networks Decision Trees
Logistic Regression Models Logit (Primary Dentition Caries) Schematic Surface log10 Mutans Streptococci Fluoride (F) ppm
Tree Models Logit (Primary Dentition Caries) Schematic Surface log10 Mutans Streptococci Fluoride (F) ppm
Artificial Neural Networks Logit (Primary Dentition Caries) Schematic Surface log10 Mutans Streptococci Fluoride (F) ppm
Artificial Neural Network (p-r-1) wij x1 wj h1 x2 h2 y hr xp inputs hidden layer (neurons) output
Common Mistakes with ANN (Scwartzer et al, StatMed, 2000) • Too many parameters for sample size • No validation • No model complexity penalty (eg Akaike Information Criterion (AIC)) • Incorrect misclassification estimation • Implausible function • Incorrectly described network complexity • Inadequate statistical competitors • Insufficiently compared to stat competitors
Validation • Split sample (70% training/30% validation) Validation estimates unbiased misclassification • K-fold Cross Validation Mean squared error (Brier Score)
Why Validate? Example: Overfitting in 2 Dimensions
Caries Example Model Settings • Logit • Stepwise selection • Alpha=.05 to enter, alpha=.20 to stay • AIC to judge additional predictors • Tree • Splitting criterion: Gini index • Pruning: Proportion correctly classified
ANN Settings • Artifical Neural Network (5-3-1 = 22 df) • Multilayer perceptron • 5 Preliminary runs • Levenberg-Marquardt optimization • No weight decay parameter • Average error selection • 3 Hidden nodes/neurons • Activation function: hyperbolic tangent
ANN Sensitivity Analyses • Random seeds: 5 values • No differences • Weight decay parameters: 0, .001, .005, .01, .25 • Only slight differences for .01 and .25 • Hidden nodes/neurons: 2, 3, 4 • 3 seems best
Prevalence: Node > Overall (15%) Overall Primary Caries 15% N=322 Training N=144 Validation Prevalence: Node < Overall (15%) log10 MS <7.08 15% log10 MS 7.08 91% log10 LB <3.05 10% F .110 0% log10 LB 3.05 23% F < .110 100% log10 MS <3.91 3% log10 MS 3.91 14% F < .056 22% F .056 25% Tree Model
Logistic Regression Beta Std Err Odds Ratio 95% CI log10 MS .238 .072 1.27 1.10 – 1.46 log10 LB .311 .070 1.36 1.19 – 1.57
Predicted Quintiles 2 1 0 Standard LOGMS4 -1 -2 0 1 4 2 3 Rank for Variable PR_ANN
Predicted Quintiles 2 1 0 Standard LOGLB4 -1 -2 0 4 1 3 2 Rank for Variable PR_ANN
5-fold CV Results Logit Tree ANN RMS error .365 .363 .362 AUC .680 .553 .707
Summary • Data quality and study design are paramount • Utilize multiple methods • Be sure to validate • Graphical displays help interpretations • KDD methods may provide advantages over traditional statistical models in dental data
Prediction as good as the data and model