160 likes | 344 Views
Computational Intelligence in Biomedical and Health Care Informatics HCA 590 (Topics in Health Sciences). Rohit Kate. Support Vector Machines: A Sample Medical Application. Reading.
E N D
Computational Intelligence in Biomedical and Health Care InformaticsHCA 590 (Topics in Health Sciences) Rohit Kate Support Vector Machines: A Sample Medical Application
Reading • Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabeteshttp://www.biomedcentral.com/content/pdf/1472-6947-10-16.pdf Wei Yu, Tiebin Liu, Rodolfo Valdez, Marta Gwinn and Muin J. Khoury. BMC Medical Informatics and Decision Making 2010, 10:16.
Aim of the Authors • Predict (i) diabetes and (ii) pre-diabetes from easily available variables • Why? • In the U.S., 23.6 million people have diabetes and one third of them are unaware • Another 57 million people have pre-diabetes (risk of developing diabetes, heart disease and stroke) and with lifestyle changes their onset of diabetes can be prevented • Authors build SVM models to predict diabetes and pre-diabetes
Training Data Source • Used a 1999-2004 data set from National Health and Nutrition Examination Survey (NHANES) • NHANES (http://www.cdc.gov/nchs/nhanes.htm): • Demographic, health history and behavioral information through home surveys • Ongoing cross-sectional probability sample survey of the U.S. population • Some data is restricted and some is publicly available
Four Categories • Diagnosed diabetes (1266): Answered “Yes” to “Have you ever been told by a doctor or health professionals that you had diabetes?” • Undiagnosed diabetes (195): Answered “No” and have fasting plasma glucose level >= 126 mg/dl • Pre-diabetes (1576): Fasting plasma glucose level is between 100 to 125 mg/dl • No diabetes (3277): Fasting plasma glucose level < 100 mg/dl
Two Classifiers • Scheme I: Classify between persons with diabetes (1 & 2) and without diabetes (3 & 4) • Scheme II: Classify between undiagnosed diabetes or pre-diabetes (2 & 3) and no diabetes (4) • Two separate SVM models are built for each of the above
Features • The features or variables should not require lab testing so that the model can be easily used for prediction • Do not include plasma glucose level • Selected 14 simple variables: family history of diabetes, age, gender, race & ethnicity, weight, height, waist circumference, body mass index (BMI), hypertension, physical activity level, smoking, alcohol use, education and household income
SVM Models • They used LibSVM package for SVM, freely available • All feature values were normalized to take values from -1 to +1 • Not essential, but helps SVM to consider them equally important to begin with • Discrete features were assigned numerical values (-1, 0.5, 0. 0.5, 1 etc.) • SVM packages can do the above automatically • The best value for C (noise parameter) and the best kernel were determined by cross-validation • Used a version of SVM that can output a “confidence” besides the class (based on distance from the separating hyperplane)
Evaluation • 10-fold cross-validation with the training data • Randomly divide data into 10 equal parts • Use 9 parts for training and the remaining for testing • Do the previous step 10 times with a different part for testing • Combine the 10 testing results • In general it is called n-fold cross-validation • The extreme case is “leave-one-out” cross-validation
Evaluation Measures • True positive: Sick people correctly identified as sick • False positive: Healthy people incorrectly identified as sick • True negative: Healthy people correctly identified as healthy • False negative: Sick people incorrectly identified as healthy
Evaluation Measures • Sensitivity = True Positives/(True positives + False negatives) • Probability that a sick person will be correctly identified as sick • Specificity = True Negatives/(True negatives + False positives) • Probability that a healthy person will be correctly identified as healthy Note: There are typos in the paper in the above definitions. • Often there is a trade-off between sensitivity and specificity of a test • The more aggressive a test is at finding the disease (high sensitivity), the more likely it will be at finding false cases (low specificity) • Call everyone sick: 100% sensitivity but very low specificity • Call everyone healthy: 100% specificity but very low sensitivity
Evaluation Measures • Using output confidences, one can get a range of sensitivity and specificity values (trade-off): • If output confidence > 0.9 then it is positive • High sensitivity, low specificity (will misclassify many as negatives) • If output confidence > 0.5 then it is positive • Low sensitivity, high specificity (will misclassify many as positives) • An entire graph can be plotted with varying sensitivity and (1-specificity) values called an ROC curve • Area under this curve is considered a good overall evaluation measure
Evaluation Measures • Positive predictive value (PPV) = True Positives/(True positives + False positives) • Probability that a person identified as sick is really sick • Negative predictive value (NPV) = True Negatives/(True Negatives + False Negatives) • Probability that a person identified as healthy is really healthy
Results Area under the curve • For Scheme I, 8 features that gave the best performance: family history, age, race & • ethnicity, weight, height, waist circumference, BMI and hypertension. • For Scheme II, 10 features that gave the best performance: eight above, gender • and physical activity level. • The results were comparable to the ones obtained by an alternate machine learning • technique of logistic regression (also called maximum entropy classifier) from a • Previous study • SVM might have shown better performance than logistic regression if there • were many more features involved
Web-based Tool • The authors built a web-based tool based on their two learned models http://www.hugenavigator.net/DiseaseClassificationPortal/startPageDiabetes.do • Users can input their feature values • The first model predicts whether someone is with diabetes/without diabetes (Scheme I) • The second model predicts diabetes/no diabetes (Scheme II) • One can set the “cutoff score” (different from probabilistic confidence) to select a particular sensitivity and specificity level
Other Possibilities • They could have built a single three-way classifier for each of the three classes (undiagnosed diabetes, pre-diabetes, no diabetes) to help someone who has not been already diagnosed with diabetes (SVMs with multiple classes: Internally there will be three “one vs. remaining” classifiers) • Unlike rule-based machine learning models, SVM models are not much human interpretable, we do not know why the model is making its predictions • It is not necessary to do feature selection with SVMs, its built-in mechanism lets it ignore irrelevant features even in the presence of thousands of features