1 / 25

Exploration Mining in Diabetic Patients Databases: Findings and Conclusions

This seminar explores the findings and conclusions from the exploration and mining of diabetic patients' databases. It discusses the motivation, objectives, related works, semi-automatic data cleaning, exploratory mining of diabetic data, conclusions, and personal opinions.

joecook
Download Presentation

Exploration Mining in Diabetic Patients Databases: Findings and Conclusions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploration Mining in Diabetic Patients Databases:Findings and Conclusions Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar

  2. Outline • Motivation • Objective • Related Works • Semi-Automatic Data Cleaning • Exploratory Mining of Diabetic Data • Conclusions • Personal Opinion IDSL

  3. Motivation • In Singapore, about 10% of the population is diabetic. • A regular screening program for the diabetic patients has gathered a wealth of information. • The application of data mining techniques to health sector was few. IDSL

  4. Objective • Find rules that can be used by the medical doctors to improve their daily tasks. • To overcome the problem of noisy data, we have developed a semi-automatic cleaning system. • Apply a user-oriented approach in order to better understand the discovered patterns. IDSL

  5. Related Works • Past research in dealing with the problem of association rules can be describe with: • Discover all rules • Use constraints • Find unexpected rules • All the previous work in medical domain focus on the rule generation phase. • Lack of the pre-processing stage and the post-processing stage. IDSL

  6. The Diabetic Patient Database • It contains about 200,000 screening records captured from 1992-1996. • Each record has 60 fields. • ID, race, sex, birthday, … , and eye examination. • A few observations were made in the cleaning process: • 4 different kinds of database file formats • Inconsistencies: “dd/mm/yy” vs “mm/dd/yy” • Abbreviations: “ONE” vs “1” • A large number of missing values in the database IDSL

  7. Semi-Automatic Data Cleaning • This system allows a user to define • mappings between attributes in different formats • the encoding schemes used • whether the attributes are to be kept • Each of the database files are transformed accordingly into the standardized format schema. IDSL

  8. Sorted Neighborhood Method(SNM) • SNM is proposed to remove duplicate records • Step 1. Choose one or more fields that uniquely identify each record in the database. • Step 2. Sort the database according to the chosen fields. • Step 3. Compare the chosen fields of the records within a sliding window. Possible duplicates are output to a file. • Step 4. User verifies the duplicates detected and true duplicates are removed from the database. IDSL

  9. Comparison With a Sliding Window • The key we choose consists of the patient’s identification number(NRIC) and the date of screening(DOS). • Pair wise comparison of nearby records is carried out. • There are three types of duplicate records: • true duplicates • false duplicates • duplicates that require further investigation IDSL

  10. same True Duplicates IDSL

  11. different False Duplicates • An exceptional case IDSL

  12. same Duplicates that require investigation • Clarification from the expert is needed IDSL

  13. Exploratory Mining of Diabetic Data • A state-of-art data mining tool that integrates classification with association rule mining(CBA) is used. • Approximately 700 rules are generated(support of 1%, confidence of 50%) • Some kind of post-processing is needed to help the doctors understand these rules. IDSL

  14. The First Stage • Basic demographic information about the patients will be presented to the doctors. • The goal is to allow the doctors to have some basic idea about the impact of various attributes. • The information is presented in histogram graphs for easy digestion. • The absolute proportion • The relative proportion IDSL

  15. The First Stage(cont’d) IDSL

  16. The Second Stage • The doctors are interested in knowing which set of symptoms often occur together (ex:eye disease). • Such correlations can be found using association rule mining. • An association rule is an implication of the form X  Y, where X  I, Y  I, and X  Y = . • The rule X  Y holds in D with confidence c% and support s% in D, if • c% of data cases in D that support X also support Y • s% of the data case in D contains X  Y IDSL

  17. The Second Stage(cont’d) • These rules only indicate that there exist statistical relationships between the attributes • the existence of A and B implies the existence of C. • they do not specify whether the presence of A and B causes C • Further analysis is needed to determine the potential factors of diabetic eye disease. IDSL

  18. The Third Stage • How causal structures can be determined from association rules? • A LCD algorithm is proposed for this purpose • Which is a polynomial time, constraint-based algorithm • It uses tests of variable dependence, independence, and conditional independence to restrict the possible causal relationships between variables. • Underlying this technique is the Markov condition. IDSL

  19. Storm BusTourGroup Lightning Campfire Thunder ForestFire Definition 1: Markov Condition • Let A be a node in a causal Bayesian network, and let B be any node that is not a descendant of A in the causal network. Then the Markov condition holds if A and B are independent, conditioned on the parents of A. S,B S,-B -S,B -S,-B C 0.4 0.1 0.8 0.2 -C 0.6 0.9 0.2 0.8 Campfire IDSL

  20. The Modified LCD Algorithm IDSL

  21. The Definitions of Conditional Independence • Dependence test • 1. the value of support(S) exceeds s; • 2. the X2 value for the set of items S exceeds the X2 value at confidence level c • Independence test • 1. the value of support(S) exceeds s; • 2. the X2 value for the set of items S does not exceed the X2 value at confidence level c • Conditional independence test • Variable A and B are independent conditioned on C if p(AB|C) = p(A|C)p(B|C) • If X2(AB|C=0) + X2(AB|C=1) is greater than the threshold value X2, then A and B are dependent given condition C. IDSL

  22. Results of Multiple Factors Analysis IDSL

  23. The Exception Rule Mining Program • The exception rule mining program takes as input the pruned tree generated by C4.5 and performs the following check: • Chi-square test for each node of the tree IDSL

  24. Conclusions • Preprocessing and postprocessing are critical to the success of any real-life application. • Our doctors confirm that many of rules discovered conform the the trends. • They are surprised by the exception rules and express interest in investigating them further. IDSL

  25. Personal Opinion • Step-by-step exploration of the data. • Exception rules are helpful in some cases. IDSL

More Related