Temporal Pattern Discovery in Course-of-Disease Data

Temporal Pattern Discovery in Course-of-Disease DataA System that Utilizes Event-Set Sequencing for Knowledge Discovery within a Database of HIV Patients Author： Jorge C.G., Ramirez et al. Advisor： Dr. Hsu Graduate：Min-Hong Lin IDSL seminar

Outline • Motivation • Objective • The KDD Process • The GSP Algorithm • TEMPADIS • Conclusions • Personal Opinion IDSL

Motivation • The recent explosion of research in the area of knowledge discovery in databases(KDD) • Data mining is the application of specific algorithms for extracting patterns from data. IDSL

Objective • Use TEMPADIS(temporal pattern discovery system) to help understand the overall process of KDD in a medical database environment. • Results are presented for a database of human immunodeficiency virus(HIV) patients. IDSL

The KDD Process(Fayyad U., et al.) • 1.Understanding application domain/identifying goals of the KDD process • 2.Creating a target data set • 3.Data cleaning and preprocessing • 4.Data reduction and projection • 5.Matching goals to a particular data mining method • 6.Exploratory analysis/model and hypothesis selection • 7.Data mining • 8.Interpreting mined patterns • 9.Acting on the discovered knowledge IDSL

The Steps in the knowledge discovery process using TEMPADIS IDSL

Identifying the Goal • We are interested in discovering patterns in the data that show that groups of patients had a similar experience during the course of the disease. IDSL

Understanding the Domain • Our domain is HIV disease • The HIV Clinical Research Database contains data for over 8,500 patients of the AIDS clinic • Data collected from : • The hospital charge system • The pharmacy system • Laboratory information system • We are studying methods of discovering useful patterns in temporal, nonstandard form, variable data field medical data. IDSL

Creating a Target Data Set • The objective of this phase is • Select a subset of the patients • Approximately 1100 of the patients have been monitored for at least four years, with a minimum of 30 distinct dates when at least one type of event • Select a subset of the available variables • All encounters with patients • A subset of the laboratory results • A subset of the pharmacy data IDSL

Data Cleaning and Preprocessing • The purpose of this step is • To remove from noise in the data • For handing missing data • To make necessary changes • That could be cleaned up with • SQL statements => easy • Manual processing => time-consuming • It took approximately three man-months to clean up 400 patients’ data IDSL

Data Reduction and Projection • The purpose of this step is to find useful features to represent the data, depending on the goal. • Using dimensionality reduction or transformation methods to reduce the effective number of variables • Re-examined the lab test result, six variables were chosen : WBC,HCT,PLT,CD4A,CD4P,LMPH • All data were normalized to a range of integers from –4 to +4, with 0 normal, and both –4 and +4 indicative of severe illness. • The drugs were grouped into 10 categories according to the reason they were being prescribed. IDSL

Data Reduction and Projection IDSL

Data Reduction and Projection-Health Status • The diagnosis data were incomplete • Use the pharmacy data to learn about the current state of a patient’s health • Use the decision-tree induction(C4.5) to develop rules for determining the health status value(HS) for any given patient on any given day. IDSL

Data Reduction and Projection-Recovery Time • We need a measure that gave us a feel for how long the patient might remain in that state • We chose a neural network to learn the recovery time function IDSL

Data Reduction and Projection-Recovery Time • Use the NevProp3 neural net software • Randomly selected six days from each of 50 patients • We used a scale of 0 to 5 represented estimated weeks to recovery IDSL

Matching Goals to a Particular Data Mining Method • The purpose of this step is to select methods to be used for searching for patterns in the data • Goal : we are trying to discover patterns in sequences of events across patients in a database • There were only a few data mining methods relevant for our goal • We chose Srikant and Agrawal’s general sequential patterns (GSP) algorithm as the basis we would use IDSL

An Example of The Type of Patterns IDSL

The GSP algorithm • The GSP algorithm uses atomic events as the basis for building up sequences • Only these events that meet the support threshold are “supported” by the database • Those atomic events that survived are then combined as pairs, both as sequences and as concurrent occurrences • These sequences are checked for supported by the database • Only those with enough support contribute to what we called candidate sequences for the next iteration • The GSP algorithm also provides the windowing concept IDSL

Exploratory Analysis/Model and Hypothesis Selection • The purpose of this step is to evaluate the model and data mining method selections • This can result in modifications and refinements to the original selections • The original GSP algorithm implementation was insufficient • We propose event-set sequence approach and a further modification to the GSP algorithm. • TEMPADIS(temporal pattern discovery system) IDSL

TEMPADIS • The concept of an event-set is based on the idea • Some type of visit to a medical facility was made • Laboratory tests were performed • Prescriptions were dispensed • Generally, those events that happen on the same day or on days very close together are all related • Therefore, we have incorporated the time-windowing technique from the GSP algorithm IDSL

TEMPADIS Algorithm • 1. Read database • 2. Get unique event-sets from database • 3. curSeqs = GenNewSeqs from unique event-sets • 4. While curSeqs • 4a. CalcSupportInDatabase for curSeqs • 4b. supportedSeqs = ExtractBestSuprtedSeqs from curSeqs • 4c. curSeqs = GenNewSeqs from supportedSeqs • endwhile IDSL

TEMPADIS Algorithm • Step 4a determine the support for each sequence under consideration in the database • For the data that are present, we use a partial match system • TEMPADIS uses a weakest-link/average-link method for determining whether or not a sequence under consideration is supported by a given patient’s data • Step 4b limits the number of sequences that can be carried over to the next iteration • Step 4c generates the new set of sequences for consideration on the next iteration IDSL

Data Mining • We begin using TEMPADIS in the data mining step. • We use multiple methods of search control to reduce the computational complexity IDSL

Interpreting Mined Patterns • The purpose of this step is to look at what was found and make some sense of it • The clinicians can examine the patterns for significance and meaning • The director of the HIV Clinical Research Group observed that the patients have “poor or no anti-retroviral suppression of their viral loads” • We can look at the specific patients who supported that pattern and then carry out the various analyses IDSL

Pattern Discovered by TEMPADIS IDSL

Flatness of The Variables IDSL

Conclusions • Once we have carried out the KDD process, we can evaluate the results • The concept of interestingness is an overall measure of the pattern value • Validity, novelty, usefulness and simplicity • TEMPADIS cannot be used to discover whatever patterns might exist • However, TEMPADIS can be used to discover meaningful patterns in areas of specific research interest IDSL

Personal Opinion • Step-by-step exploration of the data. • It have not yet addressed the issue of missing data IDSL

Temporal Pattern Discovery in Course-of-Disease Data