1 / 12

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data. Hung- Leng Chen, Ming- Syan Chen, and Su-Chen Lin TKDE, Vol.21, No. 5, 2009, pp. 652-665. Presenter : Wei- Shen Tai 200 9 / 7/1. Outline. Introduction Preliminaries Node Importance Representative

orsin
Download Presentation

Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Catching the Trend- A Framework for Clustering Concept-Drifting Categorical Data Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin TKDE, Vol.21, No. 5, 2009, pp. 652-665. Presenter : Wei-Shen Tai 2009/7/1

  2. Outline • Introduction • Preliminaries • Node Importance Representative • Drifting concept detection • Clustering relationship analysis • Experimental results • Conclusion • Comments

  3. Motivation • Find concept drifting with time in categorical domain. • For example, the buying preferences of customers may change with time.

  4. Objective • A framework for performing clustering on the categorical time-evolving data • Detects concept drifting and analyzes relationship between drifting-concepts.

  5. Node importance representative • NIR • Represents a cluster as the distribution of the attribute values, which are called “nodes” (e.g. [age = 50-59]). • Importance of node Iir in cluster ci. • Similar to TFIDF and Entropy

  6. Drifting concept detection • DCD • Detect the difference of cluster distributions between the current subset Stand the last clustering result C[te, t-1]. Data labeling

  7. Data labeling and outlier detection • Resemblance of input and cluster can be directly obtained by summing up the nodes’ importance in the NIR P 7 C1, 0.029 P 7 C2, (0.5+0.029+1)=1.529  = 0.5

  8. Cluster distributions comparison • Clustering results are said to be different according to the following two criteria. 1. If quite a large number of outliers are found. 2. If quite a large number of clusters are varied in the ratio of data points. (0.4) (0.5) (0.3) outlier, 1/5 = 0.2 C1, |2/5 – 4/5| = 0.4 Diff of results , 2/2 = 1 C2, |3/5 – 0/5| = 0.6

  9. Clustering relationship analysis • CRA • Explains the drifting concepts based on the evolving clustering results. • Node importance vector • Cluster distance using cosine measure A,B,X,Y

  10. Experimental results • Scalability • Accuracy

  11. Discussion and conclusions • A framework to perform clustering on categorical time-evolving data. • Detects the drifting concepts at different sliding windows, • Generates the clustering results based on the current concept, • Analyzes and shows the relationship between clustering results by visualization.

  12. Comments • Advantage • This proposed framework provides a solution for time-evolving data clustering in categorical domain. • It also provides an alternative for the similarity measurement between cluster and input in categorical data set based on NIR. • Drawback • Merely categorical data can be processed in this framework with NIR, even numerical data must be transformed to categorical labels as well. In other words, it seems unsuitable for clustering in mixed data domain. • The vector dimension of each class did not be reduced, it will spend too many spaces to preserve overall vector information. • Node important vector is similar to binary coding, it makes the result of cosine measurement be very tiny. • Application • Concept-drifting detection for time-evolving data set in categorical domain.

More Related