Rohit Kate

Computational Intelligence in Biomedical and Health Care InformaticsHCA 590 (Topics in Health Sciences) Rohit Kate Support Vector Machines: A Sample Medical Application

Reading • Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabeteshttp://www.biomedcentral.com/content/pdf/1472-6947-10-16.pdf Wei Yu, Tiebin Liu, Rodolfo Valdez, Marta Gwinn and Muin J. Khoury. BMC Medical Informatics and Decision Making 2010, 10:16.

Aim of the Authors • Predict (i) diabetes and (ii) pre-diabetes from easily available variables • Why? • In the U.S., 23.6 million people have diabetes and one third of them are unaware • Another 57 million people have pre-diabetes (risk of developing diabetes, heart disease and stroke) and with lifestyle changes their onset of diabetes can be prevented • Authors build SVM models to predict diabetes and pre-diabetes

Training Data Source • Used a 1999-2004 data set from National Health and Nutrition Examination Survey (NHANES) • NHANES (http://www.cdc.gov/nchs/nhanes.htm): • Demographic, health history and behavioral information through home surveys • Ongoing cross-sectional probability sample survey of the U.S. population • Some data is restricted and some is publicly available

Four Categories • Diagnosed diabetes (1266): Answered “Yes” to “Have you ever been told by a doctor or health professionals that you had diabetes?” • Undiagnosed diabetes (195): Answered “No” and have fasting plasma glucose level >= 126 mg/dl • Pre-diabetes (1576): Fasting plasma glucose level is between 100 to 125 mg/dl • No diabetes (3277): Fasting plasma glucose level < 100 mg/dl

Two Classifiers • Scheme I: Classify between persons with diabetes (1 & 2) and without diabetes (3 & 4) • Scheme II: Classify between undiagnosed diabetes or pre-diabetes (2 & 3) and no diabetes (4) • Two separate SVM models are built for each of the above

Features • The features or variables should not require lab testing so that the model can be easily used for prediction • Do not include plasma glucose level • Selected 14 simple variables: family history of diabetes, age, gender, race & ethnicity, weight, height, waist circumference, body mass index (BMI), hypertension, physical activity level, smoking, alcohol use, education and household income

SVM Models • They used LibSVM package for SVM, freely available • All feature values were normalized to take values from -1 to +1 • Not essential, but helps SVM to consider them equally important to begin with • Discrete features were assigned numerical values (-1, 0.5, 0. 0.5, 1 etc.) • SVM packages can do the above automatically • The best value for C (noise parameter) and the best kernel were determined by cross-validation • Used a version of SVM that can output a “confidence” besides the class (based on distance from the separating hyperplane)

Evaluation • 10-fold cross-validation with the training data • Randomly divide data into 10 equal parts • Use 9 parts for training and the remaining for testing • Do the previous step 10 times with a different part for testing • Combine the 10 testing results • In general it is called n-fold cross-validation • The extreme case is “leave-one-out” cross-validation

Evaluation Measures • True positive: Sick people correctly identified as sick • False positive: Healthy people incorrectly identified as sick • True negative: Healthy people correctly identified as healthy • False negative: Sick people incorrectly identified as healthy

Evaluation Measures • Sensitivity = True Positives/(True positives + False negatives) • Probability that a sick person will be correctly identified as sick • Specificity = True Negatives/(True negatives + False positives) • Probability that a healthy person will be correctly identified as healthy Note: There are typos in the paper in the above definitions. • Often there is a trade-off between sensitivity and specificity of a test • The more aggressive a test is at finding the disease (high sensitivity), the more likely it will be at finding false cases (low specificity) • Call everyone sick: 100% sensitivity but very low specificity • Call everyone healthy: 100% specificity but very low sensitivity

Evaluation Measures • Using output confidences, one can get a range of sensitivity and specificity values (trade-off): • If output confidence > 0.9 then it is positive • High sensitivity, low specificity (will misclassify many as negatives) • If output confidence > 0.5 then it is positive • Low sensitivity, high specificity (will misclassify many as positives) • An entire graph can be plotted with varying sensitivity and (1-specificity) values called an ROC curve • Area under this curve is considered a good overall evaluation measure

Evaluation Measures • Positive predictive value (PPV) = True Positives/(True positives + False positives) • Probability that a person identified as sick is really sick • Negative predictive value (NPV) = True Negatives/(True Negatives + False Negatives) • Probability that a person identified as healthy is really healthy

Results Area under the curve • For Scheme I, 8 features that gave the best performance: family history, age, race & • ethnicity, weight, height, waist circumference, BMI and hypertension. • For Scheme II, 10 features that gave the best performance: eight above, gender • and physical activity level. • The results were comparable to the ones obtained by an alternate machine learning • technique of logistic regression (also called maximum entropy classifier) from a • Previous study • SVM might have shown better performance than logistic regression if there • were many more features involved

Web-based Tool • The authors built a web-based tool based on their two learned models http://www.hugenavigator.net/DiseaseClassificationPortal/startPageDiabetes.do • Users can input their feature values • The first model predicts whether someone is with diabetes/without diabetes (Scheme I) • The second model predicts diabetes/no diabetes (Scheme II) • One can set the “cutoff score” (different from probabilistic confidence) to select a particular sensitivity and specificity level

Other Possibilities • They could have built a single three-way classifier for each of the three classes (undiagnosed diabetes, pre-diabetes, no diabetes) to help someone who has not been already diagnosed with diabetes (SVMs with multiple classes: Internally there will be three “one vs. remaining” classifiers) • Unlike rule-based machine learning models, SVM models are not much human interpretable, we do not know why the model is making its predictions • It is not necessary to do feature selection with SVMs, its built-in mechanism lets it ignore irrelevant features even in the presence of thousands of features

Rohit Kate

Rohit Kate

Presentation Transcript

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Khokher

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate