430 likes | 449 Views
A method is proposed to cluster categorical data by defining a measure of similarity between groups and using a gravitation model to optimize the clustering result.
E N D
A Hybrid Method to Categorical Clustering 學生:吳宗和 Advisor : Prof. J. Hsiang Date : 7/4/2006
Outlines • Introduction • Motivation • Related Work • Research Goal / Notation • Our Method • Experiments • Conclusion • Future Work
Introduction • Data clustering is to partition a set of data elements into groups such that elements in the same groups are similar, while elements in different groups are dissimilar. [1] • The similarity must be well-defined by developer. • Data elements can be • Recorded as Numerical values. (Degree: 0o~360o ) • Recorded as Categorical values. (Sex : { Male, Female}) Find the latent structure in the dataset. • Clustering Methods purposed for numerical data can’t fit for categorical data, because lack of a proper measure of similarity[21].
Example: • An instance of the movie database.
Outlines • Introduction • Motivation • Related Work • Research Goal / Notation • Our Method • Experiments • Conclusion • Future Work
Motivation • Needs, • For database, much of the data in databases are categorical (described by people). • For web navigation, web documents can be categorical data after feature selection. • For knowledge discovery, find latent structure in data. • The difficulties to solve clustering categorical data, • data elements with categorical values they can take are not ordered. no intuitive measure of distance.[21] • methods specified for clustering categorical data are not easy to use.
Motivation (cont’) • Problems in Related work • Hard to understand or use. • Parameters are data sensitive. • Most of them CANNOT decide the number of groups when clustering.[3,13,17,18] • Our gauss/assumption. • Need All new method? • Reuse the methods proposed for numerical data. ( measure of distance can’t fit the categorical data ) • Or only NEED a good measure of similarity / distance between two groups. • REUSE the framework for clustering numerical data. ( intuitive approach, fewer parameters, and less sensitive for data ) • Get better clustering result.
Related Work • Partition Based • K-modes [Z. Huang. (1998)] • Monte-Carlo algorithm [Tao Li, et. al 2003] • COOLCAT [Barbara, D., Couto, J., & Li, Y. (2002)] • Agglomerative Based • ROCK [Guha, S., Rastogi, R.,& Shim, K. (2000)] • CACTUS [ Ganti, V., Genhrke, J., & Ramarkrishnan, R. (1999)] • Best-K ( ACE )[Keke Chen, Ling Liu (2005)]
Outlines • Introduction • Motivation • Related Work • Research Goal / Notation • Our Method • Experiments • Conclusion • Future Work
Research Goal • Propose a method • A measure of similarity between two groups. • Reuse the gravitation concept.( for intuition, fewer parameters ) • Decide the proper number of groups by machine. • Strength the clustering result.( for better result )
Notation • D:data set of N elements p1,p2,p3,…,pn ,|D| = n. • Element pi is a multidimensional vector of d categorical attributes, i.i.e. pi = <ai1,ai2,…,aid> , where aij Aj, Aj is the set of all possible values aij can take, and Ajis a finite set. • For related work, K : an integer given by user, then divide elements into K groups G1,G2,…,GKsuch that ∪Gi=D and i,j , ij, Gi∩Gj =ψ.
Outlines • Introduction • Motivation • Related Work • Research Goal • Our Method • A measure of similarity • Step 1: Gravitation Model • Step 2: Strength the result. • Experiments • Conclusion • Future Work
Our method • One similarity function, two steps. • Define a measure of similarity between two groups. • Use the similarity to be the measure of distance. • Two steps. • Step 1. • Use the concept of gravitation model. • Find the most suitable group k. • Step 2. • Conquer the local optimal. Optimize the result.
Similarity(Gi,Gj) The Intuition of similarity • Based on the structure of the group. • Each group has its own probability distribution of each attribute and can be represented as its group structure. • Groups are similar if their have very similar probability distribution of each attribute.
The Similarity function • A group Gi = <A1,A2,A3,…,Ad>, Aris a random variable for all r = 1…d, p(Ar=v | Gi) is the probability, when Ar=v. • Entropy of Arin group Gi : entropy(Ar) = • Entropy of GiE(Gi) = Sim(Gi,Gj)=
To geometric analogy • Use Sim(Gi,Gj) to be our geometric analogy, because • Sim(Gi,Gj) ≧ 0, i,j. • Sim(Gi,Gi) = 0, i. • Sim(Gi,Gi) = Sim(Gj,Gi), i,j.
Step 1. • Use the concept of gravitation model. • Mass, Radius, and Time. • Mass of a group = the # of elements. • Radius of two groups Gi,Gj = similarity(Gi,Gj) • Time = ∆T, a constant. • Merging groups Get a clustering tree.
A Merging Pass • M(Gi)= # elements in Gi. • R(Gi,Gj)=Sim(Gi,Gj). • G is 1.0. • fg(Gi,Gj) is the gravitation force. • Δt is the time period. Δt = 1.0. • Rnew(Gi,Gj) = R(Gi,Gj) -1/2(accgi+accgj)Δt2 • If Rnew(Gi,Gj) < 0, Merge Gi and Gj. • Getting closer {Radius = 0} merge. Build a clustering tree
Suitable K. • In the sequence of merging passes, find Suitable number of groups in the partition. • Suppose step1 merges to 1 group, after T iterations. • S1, S2, S3,…, ST-1, ST. Si : i-th iteration • Stablei= ΔTime(Si , Si+1) x Radius(Si), Radius(Si) = Min{ Sim( Gα,Gβ) , Gα,Gβ in Si and αβ } • In the following slides, we use K to denote the proper number of clusters in the partition.
Step 2. • Still has Local minimal problem? • Yes. Conquer it by randomized algorithm. • Randomized step. • If Exchange two elements from two groups[11] Time consumption. • Goal : Enlarge distance/radius of two groups. • For Efficiency • Tree-structured data structure digital search tree. • Enlarge the distance/radius of each two groups in the result obtained from Step 1.
Step 2(con’t) • Why digital search tree (D.S.T)? • Locality. • Storing elements • storing strings from elements. • Deeper nodes are looked like their siblings and children.
Operations in D.S.T • D.S.T is tree-structured, storing binary strings, and degree of each node ≦ 2. • Operations cost less ( O(d) ) • Search • Insert • Delete • Search operation. ( O(d) ) • Ex:
Operations in D.S.T (cont’) • Insertion (O(d)), Ex: Insert C010 into the tree. • Deletion (O(d)), Ex: delete G110
Randomizing • Transform the result from step 1 into a forest with K digital search trees. • Nodes in Tree i ≡ elements in Gi , i. • Randomly select 2 trees i,j, and select an internal node αfrom tree i. • Let gα = { node n | α is the ancestor of node n } ∪{α} • To calculate,
Randomizing (cont’) • For example, K = 2. • α= ‘0101’ Good Exchange !!!
Outlines • Introduction • Motivation • Related Work • Research Goal / Notation • Our Method • Experiments • Datasets • Compared methods and Criterion • Conclusion • Future Work
Experiments • Datasets • Real datasets[7]. • Congressional Voting Record. • Mushroom. • Document datasets[8,9]. • Reuters 21578. • 20 Newsgroups.
Compared methods & Criterion • Compared methods • K-modes ( randomly partition elements into K groups ) • Coolcat ( select some elements to be seeds and place seeds into K groups, then place the remaining) • Criterion (10 times average) • Category Utility = • Expected Entropy = • Purity =
Real datasets • Do classify like human. • Mushroom dataset • 2 classes( edible, poisonous ) • 22 attributes. • 8124 records.
Real datasets • Voting Record dataset • 2 classes (democrat, republican), 435 records, 16 attributes (all boolean valued). • Some records (10%) have missed value.
Document Datasets • For applying. • 20 Newsgroups • 20 subjects.Each subject contains 1000 articles. • 20000 documents (too many). • Select documents from 3 subjects to be the dataset (3000 articles). • Use BOW package[4] to do feature selection. 100 features.
Document Datasets • Reuters 21578 • 135 subjects. 21578 articles, each subject contains different # of articles. • Select first 10 most articles subjects. • Use BOW package[4] to do feature selection. 100 features.
Conclusion • In this work, we purposed a measure of similarity such that • Can reuse the framework used on numerical data clustering. • With FEWER parameters and easy to use. • Try to avoid trapping into local minima. • Still can obtain same or better clustering result than methods proposed for categorical data.
Future Work • For our method itself, • Still take a long time to calculate, because we need to fix and satisfy “triangle-inequality law” for our similarity function. • For application. • To build a framework to cluster documents without other packages’ help. • Use our method to solve “Binding sets” problem in Bioinformatics.
Reference • [1] A.K. Jain M.N Murty, and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys. Vol.31, 1999. • [2] Yen-Jen Oyang, Chien-Yu Chen, Shien-Ching Huang, and Cheng-Fang Lin. Characteristics of a Hierarchical Data Clustering Algorithm Based on Gravity Theory. Technical Report of NTUCSIE 02-01. • [3] S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Proc. of IEEE Intl. Conf. on Data Eng. (ICDE), 1999. • [4] McClallum, A. K. Bow: A tookit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow. • [5] DIGITAL LIBRARIES: Metadata Resources. http://www.ifla.org/II/metadata.htm • [6] W.E. Wright. A formulization of cluster analysis and gravitational clustering. Doctoral Dissertation. Washington University, 1972. • [7] Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. • [8] David D. Lewis. Reuters-21578 text categorization test collection. AT&T Labs – Research. 1997. • [9] Ken Lang. 20 Newsgroups. Http://people.csail.mit.edu/jrennie/20Newsgroups/ • [10] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.
Reference • [11] T. Li, S. Ma, and M. Ogihara. Entropy-based criterion in categorical clustering. Proc. of Intl. Conf. on Machine Learning (ICML), 2004. • [12] Bock, H.-H. Probabilistic aspects in cluster analysis. In O. Opitz(Ed.), Conceptual and numerical analysis of data, 12-44. Berlin: Springer-verlag. 1989. • [13] Z. Huang. A fast clustering algorithm to cluster very large categorical data sets in data mining. Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997. • [14] MacQueen, J. B. (1967) Some Methods for Classification and Analysis of Multivariate Observations, In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297. • [15] Richard O. Duda and Peter E. Hard. Pattern Classification and Secen Analysis. A Wiley-Interscience Publication, New York, 1973. • [16] P Andritsos, P Tsaparas, RJ Miller, KC Sevcik. LIMBO: Scalable Clustering of Categorical Data. Hellenic Database Symposium, 2003. • [17] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS-clustering categorical data using summaries. Proc. of ACM SIGKEE Conference, 1999. • [18] D. Barbara, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. Proc. of ACM Conf. on Information and Knowledge Mgt. (CIKM), 2002. • [19] R. C.Dubes. How many clusters are best? – an experiment. Pattern Recognition, vol. 20, no 6, pp.645-663, 1987. • [20] Keke Chen and Ling Liu: "The ‘Best K’ for Entropy-based Categorical Clustering ", Proc of Scientific and Statistical Database Management (SSDBM05). Santa Barbara, CA June 2005. • [21] 林正芳,歐陽彥正,”以重力為基礎的二階段階層式資料分群演算法特性之研究”,國立臺灣大學資訊工程學系,碩士論文。民91。 • [24] I.H. Witten, E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, San Francisco, 2000.
Appendix 1 • K-modes • Extend the k-means algorithm for clustering large data sets with categorical values. • Use “Mode” to represent a group.
Appendix 2 • ROCK • Using Jaccard coefficient: to measure the distance between points Xi,Xk in attribute j • Define Otherwise Neighbor(xi,xk) = false. • Define Link(p,Ci ) = the # of Neightbor(p,q) = 1, for every q belongs to cluster Ci. • Best clusters (Objective function): • For well defined data set , ψ is 0.5, f(ψ) = (1+ψ)/(1-ψ) • Threshold ψ affects the result is good or not, and hard to decide. • Relaxing the problem to many clusters.
Appendix 3 • CACTUS is an agglomerative algorithm. • Use attribute values to group data. • Define: • Support : • Strong connection: S_conn(sij,sjt)=true, , if Pair(sij,sjt) > threshold ψ. • From pairs to sets of attributes. • A cluster is defined as a region of attributes that are pairwise strongly connected. • From the attributes view. • Find k regions of attributes. • Setting threshold ψ decides the result is good or not.
Appendix 4 • COOLCAT [2002] • Use entropy to measure the uncertainty of the attribute j. • Si = { Si1, Si2, Si3,…,Sit }, • Entropy(Si) = • When Si is much clear, Entropy(Si) 0. uncertainty is small. • For a cluster k, the entropy of k is • 在Ci中,各屬性Si的亂度的總和。 • For a partition P = {C1,C2,…Ck} • The expected Entropy of P is
Appendix 4 (cont’) • For a given partition P, P is best clusters, P satisfies unique value in each attribute for each cluster Ci. • COOLCAT’s objective function is . • Since • Coolcat picks k points from data set S, such that k are most unlike for each other, when beginning. And assign these points to k clusters. • For each point p in S-{ assigned points }, assign p into cluster i with the minimal increasing entropy. • Very easy to implement, but performance is strongly connected to the initial selection. • Sequence sensitive. • Trapped in local minima easily, expected entropy is not the minimal.