120 likes | 287 Views
On Data Labeling for Clustering Categorical Data. Hung- Leng Chen, Kun-Ta Chuang, Member, and Ming- Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 200 8 / 11/4. Outline . Introduction Related work Model of MARDL ( MAximal Resemblance Data Labeling)
E N D
On Data Labeling for Clustering Categorical Data Hung-Leng Chen, Kun-Ta Chuang, Member, and Ming-Syan Chen TKDE, Vol. 19, No. 11, 2008, pp. 1458-1471. Presenter : Wei-Shen Tai 2008/11/4
Outline • Introduction • Related work • Model of MARDL (MAximal Resemblance Data Labeling) • Experimental results • Conclusions • Comments
Motivation • Sampling • Scales down the size of the database and speed up clustering algorithms. • Problem comes from how to allocate the unclustered data into appropriate clusters. Clustering Large Database Sampled data Sampling Unclustered data Labeling ?
Objective • Data Labeling • Gives each unclustered data point the most appropriate cluster label. • MARDL is independent of clustering algorithms, and any categorical clustering algorithm can be utilized in this framework.
Categorical cluster representative • Node • Attribute name + attribute value. E.g. [A1=a], [A2=m] is an node. • N-nodeset • A set of n nodes, in which every node is a member of the distinct attribute Aa. E.g. {[A1=a], [A2=m]} is a 2-nodeset. • Independent nodesets • Two nodesets do not contain nodes from the same attributes are said to be independent with each other in a represented cluster. • E.g. {[A1=a], [A2=m]} and {[A3=c]} • p({[A1=a], [A2=m],[A3=c]}) =p({[A1=a], [A2=m]})*p({[A3=c]})
Node and n-nodeset importance • Information theorem • Entropy
N-nodeset importance representative(NNIR) • NNIR tree constructing and pruning • An Apriori-like algorithm. • Initialization • Computing candidate nodeset importance and pruning • Generating candidate nodeset • Pruning • Threshold • Importance of t nodeset is less than a predefined θ. • Relative maximum • Importance of (t+1) nodeset is larger than importance of t nodeset. • Hybrid
Maximal resemblance data labeling • Goal of MARDL • Decide the most appropriate cluster label ci for the unlabeled data point. • A unclustered data point {[A1=a], [A2=m],[A3=c ]} to the combination{[A1=a], [A2=m]} and {[A3=c ]} in Cluster c1.
Approximate algorithm for MARDL • Only one combination is considered and utilized • Tree nodes are queued and sorted by importance value. • The nodeset with maximal importance is selected. • Those nodesets which are not independent with the selected nodeset are removed from the queue. • A unclustered data point {[A1=a], [A2=m],[A3=c ]}and a tree nodeset queue.
Conclusions • MARDL • Allocates unlabeled data point into appropriate clusters when the sampling technique is utilized to cluster a very large categorical database. • NIR • A categorical cluster representative technique. • NNIR • A more powerful representative than NIR while the combinations of attribute values are considered.
Comments • Advantage • A good method to assign unclustered data to appropriate trained clusters in categorical data sampling clustering methods. • The concept, derived from existed method (Apriori and information theorem) , is easy to understand and accept. • MARDL is independ of clustering methods and any categorical clustering algorithm can be utilized in this framework. • Drawback • It spends much time to construct the tree of each cluster and the tree is quite complex to represent cluster. • Because the importance of t+1 nodeset may be larger than the importance of t nodeset, it will take much time to process the hybrid pruning in computing all of candidate t+1 nodeset. • Application • Unclustered data classification while the sampling technique is utilized to cluster a very large categorical database.