240 likes | 381 Views
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining. Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden.
E N D
A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden
A Data Pre-processing Method in Data Mining • Outline • Introduction • Dataset and variables • Data pre-processing • Data mining Algorithm (DTI) • Result • Discussion
Introduction • Abundance of data in medicine and availability of comprehensive registers • Difficulty in analysing huge amount of data with traditional methods • Efficient data mining methods
Introduction • Applying data mining methods to breast cancer register • Pre-processing is an essential part of knowledge discovery in databases • Finding an efficient pre-processing approach is essential for a successful data mining
Methods • Dataset • Data pre-processing • Data combination and selection • Cleaning data • Replacing missing values • Dimension reduction • Decision Tree Induction (DTI) • Performance comparison
Dataset • 3949 female patients, 1986 to 1995, follow up to 2003 • Data from three registers: regional, tumour marker and death registers, overall more than 150 variables
Data Pre-processing – Data Selection • After combining data from different registers, important variables (predictors/outcomes) were selected after consulting with domain experts: • Number of predictors were reduced from +150 • Chosing four important outcomes for breast cancer
Data Pre-processing – Cleaning Data • Cleaning the data from outliers and errors, for example: • Duration between diagnosis of the disease and the recurrence • Age
Data Pre-processing - Replacing Missing Values • EM (expectation maximization) algorithm • Dempster et al., 1977 • A two step iterative approach that estimates the parameters of a model starting from an initial guess. Each iteration consists of two steps: • An expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters. • A maximization step that substitutes the missing data with the expected value.
Data Pre-processing - Dimension Reduction • Canonical Correlation Analysis (CCA) • It investigates the relationship between two sets of variables. • A canonical correlation is the correlation of two canonical variates, one representing a set of independent variables, the other a set of dependent variables. • A canonical variate, is a linear combination of a set of original variables.
Data Pre-processing - Dimension Reduction • The aim is to create a number of canonical solutions each consisting of a linear combination of one set of variables: Ui = a1 X1 + a2 X2 + … + am Xm and a linear combination of the other set of variables: Vi = b1 Y1 + b2 Y2 + … + bn Yn • The goal is to determine the coefficients (a’s and b’s) that maximize the correlation between canonical variates Ui and Vi.
Data Pre-processing - Dimension Reduction • For finding important variables in each set (predictors and outcomes) magnitude of loadings were used. • Variables with the absolute value of loadings more than or equal to 0.3 were assumed important and entered into the next step for data mining. • Loading shows how each original variable contribute towards each canonical variate.
Data Pre-processing - Dimension Reduction • Variables with their loadings
Data Mining Algorithm • Decision Tree Induction (DTI) • A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision. • Each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node.
Performance comparison • Sensitivity = • Specificity = • Accuracy = • Number of leaves and tree size TP, TN, FP and FN denotes true positive, true negatives, false positives and false negatives, respectively
Performance Comparison • Comparing different approaches
Discussion • Effective data pre-processing is a very important step in knowledge discovery • Real word data are usually • Incomplete • Noisy • Inconsistent • Are not collected for data mining
Discussion • Replacing missing values before dimension reduction • Providing more information to CCA for dimension reduction • Running CCA prior to DTI • Reducing the number of variables while increasing accuracy of classification • Considerable increase in the interpretability of DTI
Discussion • In medical studies often no pre-processing is done before DTI • Proper pre-processing including consulting with domain experts, replacing missing values and dimension reduction prepares the data for a better data mining by DTI • Increasing the accuracy and interpretability of DTI are the result of our approach
Future Works • Increase the efficiency of knowledge discovery of medical registers. • Validate the result of our methodology (pre-processing prior to data mining ) with domain experts for the prediction of recurrence of cancer. • How to use the discovered knowledge and integrate it with clinical workflow. • Improve the quality of registers with adding and completing important predictors.
Thanks for your attention amira@imt.liu.se