300 likes | 423 Views
Classification and Novel Class Detection in Data Streams. Mehedy Masud 1 , Latifur Khan 1 , Jing Gao 2 , Jiawei Han 2 , and Bhavani Thuraisingham 1 1 Department of Computer Science, University of Texas at Dallas
E N D
Classification and Novel Class Detection in Data Streams Mehedy Masud1, Latifur Khan1, Jing Gao2, Jiawei Han2, and Bhavani Thuraisingham1 1Department of Computer Science,University of Texas at Dallas 2Department of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by
Presentation Overview • Stream Mining Background • Novel Class Detection– Concept Evolution
Data Streams • Continuous flows of data • Examples: Network traffic Sensor data Call center records Data streams are:
Data Stream Classification Expert analysis and labeling Block and quarantine Model update Network traffic Attack traffic Firewall Benign traffic Classification model Server Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making
Challenges Introduction Infinite length Concept-drift Concept-evolution (emergence of novel class) Recurrence (seasonal) class ICDM 2012, Brussels, Belgium
Infinite Length 1 0 0 1 1 1 0 0 0 1 1 0 • Impractical to store and use all historical data • Requires infinite storage • And running time
Concept-Drift Current hyperplane Previous hyperplane A data chunk Negative instance Instances victim of concept-drift Positive instance
Concept-Evolution Novel class y y D D • - - - - - • - - - - - - - - - - • - - - - - • - - - - - - - - - - C X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X y1 y1 C A A • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 y2 B B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x1 x x1 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances
Background: Ensemble of Classifiers C1 + C2 + + x,? C3 - input Individual outputs voting Ensemble output Classifier
Background: Ensemble Classification of Data Streams D1 D2 D5 D3 D4 C5 C4 C3 C2 C1 Prediction Note: Di may contain data points from different classes D5 D6 D4 Labeled chunk Data chunks Unlabeled chunk Addresses infinite length and concept-drift C5 C4 Classifiers C1 C4 C2 C5 C3 Ensemble • Divide the data stream into equal sized chunks • Train a classifier from each data chunk • Keep the best L such classifier-ensemble • Example: for L= 3
Examples of Recurrence and Novel Classes Introduction • Twitter Stream – a stream of messages • Each message may be given a category or “class” • based on the topic • Examples • “Election 2012”, “London Olympic”, “Halloween”, “Christmas”, “Hurricane Sandy”, etc. • Among these • “Election 2012” or “Hurricane Sandy” are novel classes because they are new events. • Also • “Halloween” is recurrence class because it “recurs” every year. ICDM 2012, Brussels, Belgium
Concept-Evolution and Feature Space Introduction Novel class y y D D • - - - - - • - - - - - - - - - - • - - - - - • - - - - - - - - - - C X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X y1 y1 C A A • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - • - - - - - - - • - - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - - - - - - - - - - • - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 y2 B B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + x1 x x1 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Existing classification models misclassify novel class instances ICDM 2012, Brussels, Belgium
Novel Class Detection – Prior Work Prior work • Three steps: • Training and building decision boundary • Outlier detection and filtering • Computing cohesion and separation ICDM 2012, Brussels, Belgium
Training: Creating Decision Boundary Prior work • Training is done chunk-by-chunk (One classifier per chunk) • An ensemble of classifiers are used for classification Pseudopoints Raw training data y Clusters are created D y • - - - - • - - • - - - - - - - D y1 C y1 A C A • - - - - - - - - - - • - - - - - - - - - - - • - - - - - - - - - - - • - - - - - - - - - - - ++++ ++ + + + + +++ ++ + + + + + ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ y2 B y2 B +++ + + + + + + + + + + + + + x1 x x1 x Addresses Infinite length problem ICDM 2012, Brussels, Belgium
Outlier Detection and Filtering Prior work Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Test instance x y x Ensemble of L models D y1 M1 ML M2 . . . C A Routlier Routlier Routlier x X is an existing class instance AND False y2 True B X is a filtered outlier (Foutlier) (potential novel class instance) x1 x Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. ICDM 2012, Brussels, Belgium
Computing Cohesion & Separation a(x) = mean distance from an Foutlierx to the instances in o,q(x) bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): Prior work o,5(x) a(x) x -,5(x) +,5(x) b+(x) b-(x) • - - • - - + + + + + • - • - - + + + + • If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. ICDM 2012, Brussels, Belgium
Limitation: Recurrence Class Prior work Recurrence chunk101 chunk102 chunk149 chunk150 Novel chunk51 chunk52 chunk99 chunk100 Stream chunk0 chunk1 chunk49 chunk50 ICDM 2012, Brussels, Belgium
Why Recurrence Classes are Forgotten? Prior work D1 D2 D4 D3 D5 C5 C3 C4 C2 C1 Prediction D6 D5 D4 Labeled chunk Data chunks Unlabeled chunk C5 C4 Classifiers Ensemble C1 C4 C2 C5 C3 Addresses infinite length and concept-drift • Divide the data stream into equal sized chunks • Train a classifier from whole data chunk • Keep the best L such classifier-ensemble • Example: for L= 3 • Therefore, old models are discarded • Old classes are “forgotten” after a while ICDM 2012, Brussels, Belgium
Proposed method CLAM: The Proposed Approach CLAss Based Micro-Classifier Ensemble Stream Latest Labeled chunk Training New model Update Latest unlabeled instance Outlier detection Ensemble (M) (keeps all classes) Classify using M Not outlier Outlier (Existing class) Buffering and novel class detection ICDM 2012, Brussels, Belgium
Proposed method Training and Updating • Each chunk is first separated into different classes • A micro-classifier is trained from each class’s data • Each micro-classifier replaces one existing micro-classifier • A total of L micro-classifiers make a Micro-Classifier Ensemble (MCE) • C such MCE’s constitute the whole ensemble, E ICDM 2012, Brussels, Belgium
Proposed method CLAM: The Proposed Approach CLAss Based Micro-Classifier Ensemble Stream Latest Labeled chunk Training New model Update Latest unlabeled instance Outlier detection Ensemble (M) (keeps all classes) Classify using M Not outlier Outlier (Existing class) Buffering and novel class detection ICDM 2012, Brussels, Belgium
Proposed method Outlier Detection and Classification • A test instance x is first classified with each micro-classifier ensemble • Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean) • If all ensembles flags x as outlier, then it is buffered and sent to novel class detector • Otherwise, the partial outputs are combined and a class label is predicted ICDM 2012, Brussels, Belgium
Evaluation Evaluation • Competitors: • CLAM (CL) – proposed work • SCANR (SC) [1] – prior work • ECSMiner (EM) [2] – prior work • Olindda [3]-WCE [4] (OW) – another baseline • Datasets: Synthetic, KDD Cup 1999 & Forest covertype 1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181. 2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011). 3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008. 4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM. ICDM 2012, Brussels, Belgium
Evaluation Overall Error Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD ICDM 2012, Brussels, Belgium
Evaluation Number of Recurring Classes vs Error ICDM 2012, Brussels, Belgium
Evaluation Error vs Drift and Chunk Size ICDM 2012, Brussels, Belgium
Evaluation Summary Table ICDM 2012, Brussels, Belgium
Conclusion • Detect Recurrence • Improved Accuracy • Running Time • Reduced Human Interaction • Future work: use other base learners ICDM 2012, Brussels, Belgium
Questions ? ICDM 2012, Brussels, Belgium
Thanks ICDM 2012, Brussels, Belgium