1 / 28

Temporal Pattern Discovery in Course-of-Disease Data

This paper presents a system, TEMPADIS, that utilizes event-set sequencing for knowledge discovery within a database of HIV patients. The objective is to understand the overall process of KDD in a medical database environment. Results are presented for a database of HIV patients.

tives
Download Presentation

Temporal Pattern Discovery in Course-of-Disease Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temporal Pattern Discovery in Course-of-Disease DataA System that Utilizes Event-Set Sequencing for Knowledge Discovery within a Database of HIV Patients Author: Jorge C.G., Ramirez et al. Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar

  2. Outline • Motivation • Objective • The KDD Process • The GSP Algorithm • TEMPADIS • Conclusions • Personal Opinion IDSL

  3. Motivation • The recent explosion of research in the area of knowledge discovery in databases(KDD) • Data mining is the application of specific algorithms for extracting patterns from data. IDSL

  4. Objective • Use TEMPADIS(temporal pattern discovery system) to help understand the overall process of KDD in a medical database environment. • Results are presented for a database of human immunodeficiency virus(HIV) patients. IDSL

  5. The KDD Process(Fayyad U., et al.) • 1.Understanding application domain/identifying goals of the KDD process • 2.Creating a target data set • 3.Data cleaning and preprocessing • 4.Data reduction and projection • 5.Matching goals to a particular data mining method • 6.Exploratory analysis/model and hypothesis selection • 7.Data mining • 8.Interpreting mined patterns • 9.Acting on the discovered knowledge IDSL

  6. The Steps in the knowledge discovery process using TEMPADIS IDSL

  7. Identifying the Goal • We are interested in discovering patterns in the data that show that groups of patients had a similar experience during the course of the disease. IDSL

  8. Understanding the Domain • Our domain is HIV disease • The HIV Clinical Research Database contains data for over 8,500 patients of the AIDS clinic • Data collected from : • The hospital charge system • The pharmacy system • Laboratory information system • We are studying methods of discovering useful patterns in temporal, nonstandard form, variable data field medical data. IDSL

  9. Creating a Target Data Set • The objective of this phase is • Select a subset of the patients • Approximately 1100 of the patients have been monitored for at least four years, with a minimum of 30 distinct dates when at least one type of event • Select a subset of the available variables • All encounters with patients • A subset of the laboratory results • A subset of the pharmacy data IDSL

  10. Data Cleaning and Preprocessing • The purpose of this step is • To remove from noise in the data • For handing missing data • To make necessary changes • That could be cleaned up with • SQL statements => easy • Manual processing => time-consuming • It took approximately three man-months to clean up 400 patients’ data IDSL

  11. Data Reduction and Projection • The purpose of this step is to find useful features to represent the data, depending on the goal. • Using dimensionality reduction or transformation methods to reduce the effective number of variables • Re-examined the lab test result, six variables were chosen : WBC,HCT,PLT,CD4A,CD4P,LMPH • All data were normalized to a range of integers from –4 to +4, with 0 normal, and both –4 and +4 indicative of severe illness. • The drugs were grouped into 10 categories according to the reason they were being prescribed. IDSL

  12. Data Reduction and Projection IDSL

  13. Data Reduction and Projection-Health Status • The diagnosis data were incomplete • Use the pharmacy data to learn about the current state of a patient’s health • Use the decision-tree induction(C4.5) to develop rules for determining the health status value(HS) for any given patient on any given day. IDSL

  14. Data Reduction and Projection-Recovery Time • We need a measure that gave us a feel for how long the patient might remain in that state • We chose a neural network to learn the recovery time function IDSL

  15. Data Reduction and Projection-Recovery Time • Use the NevProp3 neural net software • Randomly selected six days from each of 50 patients • We used a scale of 0 to 5 represented estimated weeks to recovery IDSL

  16. Matching Goals to a Particular Data Mining Method • The purpose of this step is to select methods to be used for searching for patterns in the data • Goal : we are trying to discover patterns in sequences of events across patients in a database • There were only a few data mining methods relevant for our goal • We chose Srikant and Agrawal’s general sequential patterns (GSP) algorithm as the basis we would use IDSL

  17. An Example of The Type of Patterns IDSL

  18. The GSP algorithm • The GSP algorithm uses atomic events as the basis for building up sequences • Only these events that meet the support threshold are “supported” by the database • Those atomic events that survived are then combined as pairs, both as sequences and as concurrent occurrences • These sequences are checked for supported by the database • Only those with enough support contribute to what we called candidate sequences for the next iteration • The GSP algorithm also provides the windowing concept IDSL

  19. Exploratory Analysis/Model and Hypothesis Selection • The purpose of this step is to evaluate the model and data mining method selections • This can result in modifications and refinements to the original selections • The original GSP algorithm implementation was insufficient • We propose event-set sequence approach and a further modification to the GSP algorithm. • TEMPADIS(temporal pattern discovery system) IDSL

  20. TEMPADIS • The concept of an event-set is based on the idea • Some type of visit to a medical facility was made • Laboratory tests were performed • Prescriptions were dispensed • Generally, those events that happen on the same day or on days very close together are all related • Therefore, we have incorporated the time-windowing technique from the GSP algorithm IDSL

  21. TEMPADIS Algorithm • 1. Read database • 2. Get unique event-sets from database • 3. curSeqs = GenNewSeqs from unique event-sets • 4. While curSeqs • 4a. CalcSupportInDatabase for curSeqs • 4b. supportedSeqs = ExtractBestSuprtedSeqs from curSeqs • 4c. curSeqs = GenNewSeqs from supportedSeqs • endwhile IDSL

  22. TEMPADIS Algorithm • Step 4a determine the support for each sequence under consideration in the database • For the data that are present, we use a partial match system • TEMPADIS uses a weakest-link/average-link method for determining whether or not a sequence under consideration is supported by a given patient’s data • Step 4b limits the number of sequences that can be carried over to the next iteration • Step 4c generates the new set of sequences for consideration on the next iteration IDSL

  23. Data Mining • We begin using TEMPADIS in the data mining step. • We use multiple methods of search control to reduce the computational complexity IDSL

  24. Interpreting Mined Patterns • The purpose of this step is to look at what was found and make some sense of it • The clinicians can examine the patterns for significance and meaning • The director of the HIV Clinical Research Group observed that the patients have “poor or no anti-retroviral suppression of their viral loads” • We can look at the specific patients who supported that pattern and then carry out the various analyses IDSL

  25. Pattern Discovered by TEMPADIS IDSL

  26. Flatness of The Variables IDSL

  27. Conclusions • Once we have carried out the KDD process, we can evaluate the results • The concept of interestingness is an overall measure of the pattern value • Validity, novelty, usefulness and simplicity • TEMPADIS cannot be used to discover whatever patterns might exist • However, TEMPADIS can be used to discover meaningful patterns in areas of specific research interest IDSL

  28. Personal Opinion • Step-by-step exploration of the data. • It have not yet addressed the issue of missing data IDSL

More Related