200 likes | 323 Views
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams. Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International Conference on Data Mining (ICDM ’ 05). Speaker: Yu Jiun Liu Date : 2006/9/26. Introduction. State of the art
E N D
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International Conference on Data Mining (ICDM’05) Speaker: Yu Jiun Liu Date : 2006/9/26
Introduction • State of the art • The incrementally updated classifiers. • The ensemble classifiers. • Model Granularity • Traditional : monolithic • This paper : semantic decomposition
Motivation • The model is decomposable into smaller components. • The decomposition is semantic-aware in the sense.
Monolithic Models • Stream : • Attributes : • Class Label : • Window : • Model (Classifier) :Ci
Rule-based Models • A rule form : • minsup = 0.3 and minconf = 0.8 • Valid rules of W1 are: • Valid rules of W3 are:
Algorithm • Phase 1 : Initialization • Use the first w records to train all valid rules for window W1. • Construct the RS-tree and REC-tree. • Phase 2 : Update • When record arrives, insert it into the REC-tree and update the sup. and conf. of the rules matched by it. • Delete oldest record and update the value matched by it.
RS-Tree • A prefix tree with attribute order • Each node N represents a unique rule R : P Ci • N’ (P’ Cj) is a child node of N, iff:
REC-Tree • Each record r as a sequence • Node N points to rule in the RS-tree if :
Detecting Concept Drifts • percentage V.S. the distribution of the misclassified records. The percentage approach cannot tell us which part of the classifier gives rise to the inaccuracy.
Experiments • CPU : 1.7 GHz • Memory : 256MB • Datasets : synthetic and real life dataset. • Synthetic : • Real life dataset : • 10,344 recodes and 8 dimensions.
Synthetic 10 dimensions Window size 5000 4 dimensions changing Effect of model updating
Accuracy and Time • Window size : 10,000 • EC : 10 classifiers, each trained on 1000 records. • Synthetic data.
Conclusion • Overcome the effects of concept drifts. • By reducing granularity, change detection and model update can be more efficient without compromising classification accuracy.