1 / 23

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining. Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden.

abba
Download Presentation

A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining Amir R Razavi, Hans Gill, Hans Åhlfeldt, Nosrat Shahsavar Department of Biomedical Engineering, Division of Medical Informatics Linköpings universitet, Linköping, Sweden

  2. A Data Pre-processing Method in Data Mining • Outline • Introduction • Dataset and variables • Data pre-processing • Data mining Algorithm (DTI) • Result • Discussion

  3. Introduction • Abundance of data in medicine and availability of comprehensive registers • Difficulty in analysing huge amount of data with traditional methods • Efficient data mining methods

  4. Introduction • Applying data mining methods to breast cancer register • Pre-processing is an essential part of knowledge discovery in databases • Finding an efficient pre-processing approach is essential for a successful data mining

  5. Methods • Dataset • Data pre-processing • Data combination and selection • Cleaning data • Replacing missing values • Dimension reduction • Decision Tree Induction (DTI) • Performance comparison

  6. Dataset • 3949 female patients, 1986 to 1995, follow up to 2003 • Data from three registers: regional, tumour marker and death registers, overall more than 150 variables

  7. Variables

  8. Data Pre-processing – Data Selection • After combining data from different registers, important variables (predictors/outcomes) were selected after consulting with domain experts: • Number of predictors were reduced from +150 • Chosing four important outcomes for breast cancer

  9. Data Pre-processing – Cleaning Data • Cleaning the data from outliers and errors, for example: • Duration between diagnosis of the disease and the recurrence • Age

  10. Data Pre-processing - Replacing Missing Values • EM (expectation maximization) algorithm • Dempster et al., 1977 • A two step iterative approach that estimates the parameters of a model starting from an initial guess. Each iteration consists of two steps: • An expectation step that finds the distribution for the missing data based on the known values for the observed variables and the current estimate of the parameters. • A maximization step that substitutes the missing data with the expected value.

  11. Data Pre-processing - Dimension Reduction • Canonical Correlation Analysis (CCA) • It investigates the relationship between two sets of variables. • A canonical correlation is the correlation of two canonical variates, one representing a set of independent variables, the other a set of dependent variables. • A canonical variate, is a linear combination of a set of original variables.

  12. Data Pre-processing - Dimension Reduction • The aim is to create a number of canonical solutions each consisting of a linear combination of one set of variables: Ui = a1 X1 + a2 X2 + … + am Xm and a linear combination of the other set of variables: Vi = b1 Y1 + b2 Y2 + … + bn Yn • The goal is to determine the coefficients (a’s and b’s) that maximize the correlation between canonical variates Ui and Vi.

  13. Data Pre-processing - Dimension Reduction • For finding important variables in each set (predictors and outcomes) magnitude of loadings were used. • Variables with the absolute value of loadings more than or equal to 0.3 were assumed important and entered into the next step for data mining. • Loading shows how each original variable contribute towards each canonical variate.

  14. Data Pre-processing - Dimension Reduction • Variables with their loadings

  15. Data Mining Algorithm • Decision Tree Induction (DTI) • A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision. • Each internal node denotes a test on variables, each branch stands for an outcome of the test, leaf nodes represent an outcome, and the uppermost node in a tree is the root node.

  16. Resulted Decision Tree

  17. Performance comparison • Sensitivity = • Specificity = • Accuracy = • Number of leaves and tree size TP, TN, FP and FN denotes true positive, true negatives, false positives and false negatives, respectively

  18. Performance Comparison • Comparing different approaches

  19. Discussion • Effective data pre-processing is a very important step in knowledge discovery • Real word data are usually • Incomplete • Noisy • Inconsistent • Are not collected for data mining

  20. Discussion • Replacing missing values before dimension reduction • Providing more information to CCA for dimension reduction • Running CCA prior to DTI • Reducing the number of variables while increasing accuracy of classification • Considerable increase in the interpretability of DTI

  21. Discussion • In medical studies often no pre-processing is done before DTI • Proper pre-processing including consulting with domain experts, replacing missing values and dimension reduction prepares the data for a better data mining by DTI • Increasing the accuracy and interpretability of DTI are the result of our approach

  22. Future Works • Increase the efficiency of knowledge discovery of medical registers. • Validate the result of our methodology (pre-processing prior to data mining ) with domain experts for the prediction of recurrence of cancer. • How to use the discovered knowledge and integrate it with clinical workflow. • Improve the quality of registers with adding and completing important predictors.

  23. Thanks for your attention amira@imt.liu.se

More Related