200 likes | 214 Views
This paper introduces a framework for mining concept-drifting data streams with skewed distributions using sampling and ensemble techniques, providing accurate and efficient classification models. It also discusses general issues in stream classification and showcases experiments on synthetic and real data.
E N D
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡ †University of Illinois at Urbana-Champaign ‡IBM T. J. Watson Research Center
1 1 0 0 1 0 1 1 1 0 1 Introduction (1) • Data Stream • Continuously arriving data flow • Applications: network traffic, credit card transaction flow, phone calling records, etc.
Introduction (2) • Stream Classification • Construct a classification model based on past records • Use the model to predict labels for new data • Help decision making Classification model Fraud? Labeling Fraud
Framework ? ……… ……… Classification Model Predict
Concept Drifts • Changes in P(x,y) • P(x,y)=P(y|x)P(x) x-feature vector, y-class label • No Change, Feature Change, Conditional Change, Dual Change • Expected error is not a good indicator of concept drifts • Training on the most recent data could help reduce expected error Time Stamp 1 Time Stamp 11 Time Stamp 21
Issues in Stream Classification(1) • Generative Model • P(y|x) follows some distribution • Descriptive Model • Let data decides • Stream Data • Distribution unknown and evolving
Issues in Stream Classification(2) • Label Prediction • Classify x into one class • Probability Estimation • x is assigned to all classes with different probabilities • Stream Applications • Stochastic, prediction confidence information is needed
Mining Skewed Data Stream - • Skewed Distribution • Credit card frauds, network intrusions • Existing Stream Classification Algorithms • Evaluated on balanced data • Problems • Ignore minority examples • The cost of misclassifying minority examples is usually huge + Classify every leaf node as negative
Stream Ensemble Approach (1) ? ……… ……… Training set?Insufficient positive examples! Step 1 Sampling
Stream Ensemble Approach (2) 1 2 …… k C1 C2 …… Ck Step 2 Ensemble
Why this approach works? • Incorporation of old positive examples • increase the training size, reduce variance • negative examples reflect current concepts, so the increase in boundary bias is small • Ensemble • reduce variance caused by single model • disjoint sets of negative examples—the classifiers will make uncorrelated errors • Bagging & Boosting • running cost is much higher • cannot generate reliable probability estimates for skewed distributions
Analysis • Error Reduction • Sampling • Ensemble • Efficiency Analysis • Single model • Ensemble • Ensemble is more efficient
Experiments • Measures • Mean Squared Error • ROC Curve • Recall-Precision Curve • Baseline Methods • NS: No sampling +Single Model • SS: Sampling + Single Model • SE: Sampling + Ensemble
Experimental Results (1) Feature Change only P(x) changes Dual Change both P(x) and P(y|x) changes Conditional Change only P(y|x) changes Mean Squared Error on Synthetic Data
Experimental Results (2) Mean Squared Error on Real Data
Experimental Results (3) Plots on Synthetic Data ROC Curve Recall-Precision Plot
Experimental Results (4) Plots on Real Data ROC Curve Recall-Precision Plot
Experimental Results (5) Training Time
Conclusions • General issues in stream classification • concept drifts • descriptive model • probability estimation • Mining skewed data streams • sampling and ensemble techniques • accurate and efficient • Wide applications • graph data • airforce data
Thanks! • Any questions?