On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams

On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International Conference on Data Mining (ICDM’05) Speaker: Yu Jiun Liu Date : 2006/9/26

Introduction • State of the art • The incrementally updated classifiers. • The ensemble classifiers. • Model Granularity • Traditional : monolithic • This paper : semantic decomposition

Motivation • The model is decomposable into smaller components. • The decomposition is semantic-aware in the sense.

Monolithic Models • Stream : • Attributes : • Class Label : • Window : • Model (Classifier) :Ci

Rule-based Models • A rule form : • minsup = 0.3 and minconf = 0.8 • Valid rules of W1 are: • Valid rules of W3 are:

Algorithm • Phase 1 : Initialization • Use the first w records to train all valid rules for window W1. • Construct the RS-tree and REC-tree. • Phase 2 : Update • When record arrives, insert it into the REC-tree and update the sup. and conf. of the rules matched by it. • Delete oldest record and update the value matched by it.

Data Structure

RS-Tree • A prefix tree with attribute order • Each node N represents a unique rule R : P  Ci • N’ (P’  Cj) is a child node of N, iff:

REC-Tree • Each record r as a sequence • Node N points to rule in the RS-tree if :

Detecting Concept Drifts • percentage V.S. the distribution of the misclassified records. The percentage approach cannot tell us which part of the classifier gives rise to the inaccuracy.

Definition

Finding Rule Algorithm

Update Algorithm

Experiments • CPU : 1.7 GHz • Memory : 256MB • Datasets : synthetic and real life dataset. • Synthetic : • Real life dataset : • 10,344 recodes and 8 dimensions.

Synthetic 10 dimensions Window size 5000 4 dimensions changing Effect of model updating

The relation of concept drifts and

Effect of rule composition

Accuracy and Time • Window size : 10,000 • EC : 10 classifiers, each trained on 1000 records. • Synthetic data.

Real life data

Conclusion • Overcome the effects of concept drifts. • By reducing granularity, change detection and model update can be more efficient without compromising classification accuracy.

On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams

On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams

Presentation Transcript

Data Mining on Streams

Data Mining in Streams and Graphs

Mining Data Streams

Systematic Data Selection to Mine Concept Drifting Data Streams

Mining High-Speed Data Streams

Frequent Pattern Mining in Data Streams

Mining High-Speed Data Streams

Mining Data Streams

Active Mining of Data Streams

Data Mining: Concepts and Techniques Mining data streams

Mining Data Streams (Part 1)

Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams

Mining Data Streams

Mining High-Speed Data Streams

Dynamic Classifier Selection for Effective Mining from Noisy Data Streams

Data Mining on Streams

Mining Data Streams

Data Mining: Concepts and Techniques Mining data streams

Data Mining for Data Streams

Mining Data Streams

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Data Mining Concept