440 likes | 669 Views
Incorporating User Provided Constraints into Document Clustering. Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer Science Wayne State University Detroit, MI48202 {chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu. Outline. Introduction
E N D
Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer Science Wayne State University Detroit, MI48202 {chenyanh, rege, mdong, jinghua, fotouhi}@wayne.edu
Outline • Introduction • Overview of related work • Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Conclusion
Inter-cluster distances are maximized Intra-cluster distances are minimized What is clustering? • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Government Science Arts Document Clustering • Grouping of text documents into meaningful clusters in an unsupervised manner.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Clustering Example . . . . . .
Semi-supervised clustering: problem definition • Input: • A set of unlabeled objects • A small amount of domain knowledge (labels or pairwise constraints) • Output: • A partitioning of the objects into k clusters • Objective: • Maximum intra-cluster similarity • Minimum inter-cluster similarity • High consistency between the partitioning and the domain knowledge
Seeded points Must-link Cannot-link Semi-Supervised Clustering • According to different given domain knowledge: • Users provide class labels(seeded points) a priori to some of the documents • Users know about which few documents are related (must-link) or unrelated (cannot-link)
Why semi-supervised clustering? • Large amounts of unlabeled data exists • More is being produced all the time • Expensive to generate Labels for data • Usually requires human intervention • Use human input to provide labels for some of the data • Improve existing naive clustering methods • Use labeled data to guide clustering of unlabeled data • End result is a better clustering of data • Potential applications • Document/word categorization • Image categorization • Bioinformatics (gene/protein clustering)
Outline • Introduction • Overview of related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical work for SS-NMF • Experiments and results • Conclusion
Clustering Algorithm • Document hierarchical clustering • Bottom-up, agglomerative • Top-down, divisive • Document partitioning (flat clustering) • K-means • probabilistic clustering using the Naïve Bayes or Gaussian mixture model, etc. • Document clustering based on graph model
Semi-supervised Clustering Algorithm • Semi-supervised Clustering with labels (Partial label information is given ): • SS-Seeded-Kmeans ( Sugato Basu, et al. ICML 2002) • SS-Constraint-Kmeans ( Sugato Basu, et al. ICML 2002) • Semi-supervised Clustering with Constraints (Pairwise Constraints (Must-link, Cannot-link) is given): • SS-COP-Kmeans (Wagstaff et al. ICML01) • SS-HMRF-Kmeans (Sugato Basu, et al. ACM SIGKDD 2004) • SS-Kernel-Kmeans (Brian Kulis, et al. ICML 2005) • SS-Spectral-Normalized-Cuts (X. Ji, et al. ACM SIGIR 2006)
Overview of K-means Clustering • K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters. • Objective function: Locally minimizes sum of squared distance between the data points and their correspondingcluster centers: Algorithm: Initialize k cluster centers randomly. Repeat until convergence: • Cluster Assignment Step: Assign each data point xito the cluster fh such that distance of xi from center of fh is minimum • Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
Semi-supervised Kernel K-means (SS-KK) [Brian Kulis, et al. ICML 2005] • Semi-supervised Kernel K-means algorithm: where is kernel function mapping from , is centroid, is the cost of violating the constraint between two points • First term: kernel k-means objective function • Second term: reward function for satisfying must-link constraints • Third term: penalty function for violating cannot-link constraints
Overview of Spectral Clustering • Spectral clustering is a graph-theoretic clustering algorithm Weighted Graph G=(V, E, A) min between-cluster similarities (weights : Aij)
Spectral Normalized Cuts • Min similarity between & : Balance weights: Cluster indicator: • Graph partition becomes: • Solution is eigenvector of:
Semi-supervised Spectral Normalized Cuts (SS-SNC)[X. Ji, et al. ACM SIGIR 2006] • Semi-supervised Spectral Learning algorithm: where , • First term: spectral normalized cut objective function • Second term: reward function for satisfying must-link constraints • Third term: penalty function for violating cannot-link constraints
Outline • Introduction • Related work • Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • NMF review • Model formulation and algorithm derivation • Theoretical result for SS-NMF • Experiments and results • Conclusion
Non-negative Matrix Factorization (NMF) • NMF is to decompose matrix into two parts( D. Lee et al., Nature 1999) • Symmetric NMF for clustering (C. Ding et al. SIAM ICDM 2005) X F ~ = G min || X – FGT||2 ~ = x x min || A – GSGT||2
SS-NMF • Incorporate prior knowledge into NMF based framework for document clustering. • Users provide pairwise constraints: • Must-link constraints CML : two documents di and dj must belong to the same cluster. • Cannot-link constraints CCL : two documents di and dj must belong to the different cluster. • Constraints are defined by associated violation cost matrix W: • W reward : cost of violating the constraint between document di and dj if a constraint exists. • Wpenalty : cost of violating the constraints between document di and dj if a constraint exists.
SS-NMF Algorithm • Define the objective function of SS-NMF: where is the cluster label of
Outline • Introduction • Overview of related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Conclusion
Algorithm Correctness and Convergence • Based on constraint optimization theory, auxiliary function, we can prove SS-NMF: • Correctness:Solution converges to local minimum • 2. Convergence:Iterative algorithm converges (Details in paper [1], [2]) [1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%) [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, to appear, 2008.
Orthogonal SymmetricSemi-supervised NMF is equivalent to Semi-supervised Kernel K-means (SS-KK) and Semi-supervised Spectral Normalized Cuts (SS-SNC)! SS-NMF: General Framework for Semi-supervised Clustering Proof: (1) (2) (3)
Outline • Introduction • Overview of related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Artificial Toy Data • Real Data • Conclusion
Experiments on Toy Data • 1.Artificial toy data: consisting of two natural clusters
Results on Toy Data (SS-KK and SS-NMF) • Hard Clustering: Each object belongs to a single cluster • Soft Clustering: Each object is probabilisticallyassigned to clusters. Right Table: Difference between cluster indicator G of SS-KK (hard clustering) and SS-NMF (soft clustering) for the toy data
Results on Toy Data (SS-SNC and SS-NMF) (b) Data distribution in the SS-NMF subspace of two column vectors of G. The data points from the two clusters get distributed along the two axes. (a) Data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.
Time Complexity Analysis Up Figure: Computational Speed comparison for SS-KK, SS-SNC and SS-NMF ( )
Experiments on Text Data 2. Summary of data sets[1] used in the experiments. [1]http://www.cs.umn.edu/~han/data/tmdata.tar.gz • Evaluation Metric: where n is the total number of documents in the experiment, δis the delta function that equals one if , is the estimated label, is the ground truth.
Results on Text Data (Compare with Unsupervised Clustering) • (1) Comparison with unsupervised clustering approaches: Note: SS-NMF adds 3% constraints
Results on Text Data(Before Clustering and After Clustering) (c) Document-document similarity matrix after clustering with SS-NMF (k=5) (b) Document-document similarity matrix after clustering with SS-NMF (k=2) (a) Typical document-document matrix before clustering
Results on Text Data (Clustering with Different Constraints) Left Table: Comparison of confusion matrix C and normalized cluster centroid matrix S of SS-NMF for different percentage of documents pairwise constrained
Results on Text Data (Compare with Semi-supervised Clustering) • (2) Comparison with SS-KK and SS-SNC (b) England-Heart (c) Interest-Trade (a) Graft-Phos
Results on Text Data (Compare with Semi-supervised Clustering) • Comparison with SS-KK and SS-SNC (Fbis2, Fbis3, Fbis4, Fbis5)
Experiments on Image Data 3. Image data sets[2] used in the experiments. Up Figure: Sample images for images categorization. (From up to down: O-Owls, R-Roses, L-Lions, E-Elephants, H-Horses) [2] http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.data.html
Results on Image Data (Compare with Unsupervised Clustering) • (1) Comparison with unsupervised clustering approaches: Up Table : Comparison of image clustering accuracy between KK, SNC, NMF and SS-NMF with only 3% pair-wise constraints on the images. It shows that SS-NMF consistently outperforms other well-established unsupervised image clustering methods.
Results on Image Data (Compare with Semi-supervised Clustering) • (2) Comparison with SS-KK and SS-SNC: Left Figure: Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (a) O-R, (b) L-H, (c) R-L, (d) O-R-L.
Results on Image Data (Compare with Semi-supervised Clustering) • (2) Comparison with SS-KK and SS-SNC: Left Figure: Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS-NMF for different percentages of images pairs constrained (e) L-E-H, (f) O-R-L-E, (g) O-L-E-H, (h) O-R-L-E-H
Outline • Introduction • Related work • Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering • Theoretical result for SS-NMF • Experiments and results • Conclusion
Conclusion • Semi-supervised Clustering: -many real world applications - outperform the traditional clustering algorithms • Semi-supervised NMF algorithm provides a unified mathematic framework for semi-supervised clustering. • Many existing semi-supervised clustering algorithms can be extended to achieve multi-type objects co-clustering tasks.
Reference [1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image Clustering from Accumulated User Feedbacks”, Proc. of ACM Multimedia, Germany, 2007. [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October 2007. (Regular Paper, acceptance rate 7.2%) [3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, invited as a best paper of ICDM 07, to appear 2008.