Constrained Spectral Clustering: Incorporating Prior Knowledge for Document Analysis

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu

Traditional Clustering Methods • Methods • K-Means • Ratio Cut • Average Association • Normalized Cut • Min-Max Cut • Important question: • For each given data set, there are always many possible ways of partitioning the data set.

Traditional Clustering Methods • Related Work on Semi-Supervised Learning: • Wagstaff et al. introduced two types of constraints: “must link” , “cannot link” • Basu, et al. developed a semi-supervised K-means that make use of labeled data to generate initial seed cluster, and to guide the clustering process.

Normalized Cut • J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000. • Model the given document set using a undirected graph G(V,E,W) • V: vertex set, represents a document vector • E: edge set, assigned a weight to reflect the similarity between the two documents. • W: graph affinity matrix

Normalized Cut Measures how tightly the cluster S is connected with the rest of the data set. Measures how compact the entire data set is.

Normalized Cut • Let be the indicator vector of the cluster ,and each element takes a binary value {1,0} • Then we get: D=diagonal matrix 1 2 3

Normalized Cut 1 2 3 4 4 1 2 3 Minimize the cost function

Normalized Cut

Incorporating Prior Knowledge • The prior knowledge is provided in the form of indicating several pairs of documents which the user whishes to be grouped into the same cluster. • Constraint vector:

The flow path of CNC • Create the graph affinity matrix in which each element represents the similarity between the two documents. • Compute the diagonal matrix D • Form the constraint matrix U by the user • Form the matrix and compute its K smallest eigenvalues and the corresponding eigenvectors. • Project each document into the eigen-space spanned by the K eigenvectors. Apply K-means algorithm to find the K document clusters within this eigen-space

Data description • This paper evaluated the performance of their document clustering model using two data set: Reuters-21578 and 20 Newsgroups document corpora. • Newsgroups data set contains 20000 documents that were collected from 20 newsgroups in the public domain.

Evaluation • Given the two set of document clusters C, C’, their mutual information metric is defined as: 0: two sets are independent 1: two sets are identical

Result

Conclusion • This paper proposed a constrained spectral clustering method (CNC) to incorporate user’s prior knowledge during the document cluster analysis. • CNC model is a very effective semi-supervised document clustering tool, especially with very low amount of training samples. • CNC model did not form constraints for prior knowledge related to cannot-link constraint.

Constrained Spectral Clustering: Incorporating Prior Knowledge for Document Analysis

Constrained Spectral Clustering: Incorporating Prior Knowledge for Document Analysis

Presentation Transcript

Web Document Clustering

Exploiting Wikipedia as External Knowledge for Document Clustering

Prior Knowledge Assessment

Prior Knowledge

Access Prior Knowledge

Prior Knowledge!

Prior Knowledge

Prior Knowledge

Document Clustering

Prior Knowledge

Web Document Clustering

Prior knowledge

Prior knowledge necessary

Document Clustering

Developing Prior Knowledge with Primary Sources

Exploiting Wikipedia as External Knowledge for Document Clustering

Activate Prior Knowledge

Prior Knowledge

Activate Prior Knowledge

Web Document Clustering