290 likes | 437 Views
Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery. 2005/01/21 董原賓. Outline. Issues with data stream Will old data really help? Optimal models
E N D
Systematic Data Selection to Mine Concept-Drifting Data StreamsWei FanProceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery 2005/01/21 董原賓
Outline • Issues with data stream • Will old data really help? • Optimal models • Computing optimal models • Cross Validation Decision Tree Ensemble • Experiment • Conclusion
Issues with data stream • Concept drift • FOi(x) : the optimal model at time stamp i • There is concept drift from time stamp i-1 to time stamp i, if there are inconsistencies between FOi-1(x) and FOi(x) • Data sufficient • A data set is considered sufficient if adding more data will not increase the generalization accuracy • Determining the sufficient amount can be formidably expensive • Even if the dataset is insufficient, still need to train a model that can best fit the changing data.
Will old data really help? Definition: • Si: most recent data chunk • SP: previous data chunks, SP= S1U…U Si-1 • y = f(x): the underlying true model that we aim to model
Will old data really help? (Cont.) Three types of SP: • 1. FOi (x) ≠ FOi-1(x), obviously it will only cancel out the changing concept • 2. FOi (x) = FOi-1(x) ≠ y, both models agree that their predictions are wrong, can’t determine if they will help or not • 3. FOi (x) = FOi-1(x) = y, both models make the correct prediction, it is the only portion that may help
Optimal Models The two main themes of our comparison is on possible data insufficiency and concept drift Four situations: • 1. New data is sufficient by itself and there is no concept drift, the optimal model should be the one trained from the new data itself.
Optimal Models (cont.) • 2. New data is sufficient by itself and there is concept drift, the optimal model should be the one trained from the new data itself. • 3. New data is insufficient by itself and there is no concept drift. If the previous data is sufficient, the optimal model should be the existing model. Otherwise, train a new model from new data plus existing data
Optimal Models (cont.) • 4. New data is insufficient by itself and there is concept drift. Choose previous data chunks that have consistent concept with the new data chunk and combine them with the new data • We will usually never know if the data is indeed sufficient. Ideally, we should compare a few sensible choices if the training cost is affordable
Computing optimal models Definition: • Di-1:the dataset that trained the most recent optimal model FOi-1(x) and is collectively iteratively throughout the streaming mining process • si-1:examples selected from Di-1
Computing optimal models (cont.) Steps: • 1. Train a model FNi(x) from the new data chunk Si only. • 2. Select examples from Di-1 that both the trained new model FNi(x) and the recent optimal model FOi-1(x) make the correct prediction. si-1= { ∀(x,y) ∈ Di-1, such that, (FNi(x) = y) ∧ (FOi-1(x) = y) }
Computing optimal models (cont.) • 3. Train a model FNi+(x) from the data data plus the selected data in the last step (Si U si-1) • 4. Update the most recent model FOi-1(x)with Si and call this model FOi-1+(x). When update the model, keep the structure of the model and update its internal statistics
Computing optimal models (cont.) • 5. Using “cross-validation“ to compare the accuracy of FNi(x), FOi-1(x), FNi+(x) and FOi-1+(x). Choose the one that is the most accurate and name it FOi(x) • 6. Di is the training set that computes FOi(x) and is one of Si, Di-1, Si U si-1, and Si U Di-1
Computing optimal models (cont.) Address how the above framework finds the optimal model under all four previously discussed situations: • 1. New data is sufficient by itself and there is no concept drift. Conceptually FNi(x) should be the optimal model. FOi-1(x), FNi+(x) and FOi-1+(x) could be its close match
Computing optimal models (cont.) • 2. New data is sufficient by itself and there is concept drift. FNi(x) should be the optimal model. FNi+(x) could be very similar in performance to FNi(x) • 3. New data is insufficient by itself and there is no concept drift. The optimal model should be either FOi-1(x) or FOi-1+(x) • 4. New data is insufficient by itself and there is concept drift. The optimal model should be either FNi(x) or FNi+(x)
Cross Validation Decision Tree Ensemble Decision tree ensemble • Step1: sequentially scans the complete dataset once and finds out all features with information gain • Step2: When building trees, at each step choose a remaining feature randomly • Step3: The tree stop growing a branch if there are no more examples passing through that branch
Cross Validation Decision Tree Ensemble (cont.) • Rule1: Features without information gain will never be used • Rule2: Each discrete feature can be used at most once in a particular decision path • Rule3: Each continuous feature can be chosen multiple times on the same decision path
Cross Validation Decision Tree Ensemble (cont.) • Rule4: The splitting threshold is a random value within the max and min of that feature • Rule5: In the training data, each example x is assign an initial weight 1.0 When missing feature value is encountered, the current weight of x is distributed across its children nodes. If the prior distribution of known values are given, the weight is distributed in proportion to this distribution. Otherwise, it is equally divided among the children nodes
Cross Validation Decision Tree Ensemble (cont.) Cross validation: • n: the size of the training set • n-fold cross validation: leaves one example x out and use the remaining n – 1 examples to train a model and classify on the left-out example x
Cross Validation Decision Tree Ensemble (cont.) • Compute the probability of x: Assume that we have two class labels, either fraud or non fraud, the probability if x being fraudulent is
Experiment (streaming data) • Synthetic data:Data with drifting concepts based on a moving hyperplane Roughly half of the examples are positive, and the other half negative • Credit card fraud data:Real life credit card transaction for cost sensitive mining. One year period and 5million transactions. Investigating a fraud transaction cost = $90 • Donation dataset:The famous donation dataset that first appeared in KDDCUP’98 competition. 95412 known records. 96367 unknown records for test
Conclusion • Using old data unselectively is like gambling • Proposed a cross-validation-based framework to choose data and compare sensible choices • Proposed an implementation of this framework using cross-validation decision tree ensemble