1 / 12

Data Stream Classification: Training with Limited Amount of Labeled Data

Data Stream Classification: Training with Limited Amount of Labeled Data. Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at Dallas. Jing Gao Jiawei Han University of Illinois at Urbana-Champaign.

lynsey
Download Presentation

Data Stream Classification: Training with Limited Amount of Labeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Stream Classification:Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan BhavaniThuraisingham University of Texas at Dallas Jing Gao Jiawei Han University of Illinois at Urbana-Champaign To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008 Funded by: Air Force

  2. Data Stream Classification Techniques • Data stream classification is a challenging task because of two reasons • Infinite length – can’t use all historical data for training • Concept-drift – old models become outdated • Solutions: • Single model classification with incremental learning • Ensemble classification • Ensemble techniques can be updated more efficiently, and handles concept-drift more effectively. • Our solution: ensemble approach

  3. Ensemble Classification • Ensemble techniques build an ensemble of M classifiers. • The data stream is divided into equal-sized chunks • A new classifier is trained from each labeled chunk • The new classifier replaces one old classifier (if required) Data stream Last labeled data chunk Last unlabeled data chunk 1 2 Ensemble L1 Train new Classifier L2 Ensemble classification Update ensemble LM

  4. Limited Labeled Data Problem • Ensemble techniques assume that the entire data chunk is labeled during training. • This assumption is impractical because • Data labeling is both time consuming and costly • It may not be possible to label all the data • Specially in an streaming environment, where data is being produced at a high speed • Our solution: • Train classifiers with limited amount of labeled data • Assuming a fraction of the data chunk is labeled • We obtain better result compared to other techniques that train classifiers with fully labeled data chunk

  5. Training With Partially-Labeled Chunk • Train new classifier using semi-supervised clustering • If a new class has arrived in the stream, refine the existing classifiers • Update the ensemble Data stream Last partially -labeled chunk Last unlabeled data chunk 2 1 Ensemble Train a classifier using Semi-supervised Clustering L1 L2 Ensemble classification 3 refine ensemble LM 4 Update ensemble

  6. Semi-Supervised Clustering • Overview: • Only a few instances in the training data are labeled • We apply impurity-based clustering • The goal is to minimize cluster impurity • A cluster is completely pure if all the labeled data in that cluster is from the same class • Objective function: K: number of clusters, Xi: data points belonging to cluster i, Li: labeled data points belonging to cluster i Impi: Impurity of cluster i = Entropy * dissimilarity count

  7. Semi-Supervised Clustering Using E-M • Constrained initialization: • Initialize K seeds • For each class Cj, • Select kj seeds from the labeled instances of Cj using farthest-first traversal heuristic • where kj = (Nj/N) * K • Nj= number of instances in class Cj, N = total number of labeled instances • Repeat E-step and M-step until convergence: • E-step • Assign clusters to each instance • So that the objective function is minimized • Apply Iterative Conditional Mode (ICM) until convergence • M-step • Re-compute cluster centroids

  8. Saving Cluster Summary as Micro-Cluster • For each of the K clusters created using the semi-supervised clustering, save the followings • Centroid • n: total number of points • L: total number of labeled points • Lt[ ]: array containing the total number of labeled points belonging to each class. • e.g. : Lt[j]: total number of data points belonging to class Cj • Sum[ ]: array containing the sum of the attribute values of each dimension of all the data points • e.g. : Sum[r]: sum of the rth dimension of all data points in the cluster

  9. Using the Micro-Clusters as Classification Model • We remove all the raw data points after saving the micro-clusters • The set of K such micro-clusters built from a data chunk serves as a classification model • To classify a test data point x using the model: • Find the Q nearest micro-clusters (by computing the distance between x and their centroids) • For each class Cj, Compute the “cumulative normalized frequency (CNFrq[j])”, where • CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters • Output the class Cj with the highest CNFrq[j]

  10. Experiments • Data sets: • Synthetic data – simulates evolving data streams • Real data – botnet data, collected from real botnet traffic generated in a controlled environment • Baseline: • On-demand Stream • C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, 2006. • For training, we use 5% labeled data in each chunk. • So, if there are 100 instances in a chunk • Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training • On Deman Stream uses 100 labeled instances for training • Environment • S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core, 2GB RAM

  11. Results • Each time unit = 1,000 data points (botnet) • and 1,600 data points (synthetic)

  12. Conclusion • We address a more practical approach in classifying evolving data streams – training with limited amount of labeled data. • Our technique applies semi-supervised clustering to train classification models. • Our technique outperforms other state-of-the-art stream classification techniques, which use 20 times more labeled data for training than our technique.

More Related