A New Suffix Tree Similarity Measure for Document Clustering

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng CityUniversity of Hong Kong WWW2007

INTRODUCTION • 目的: To develop a document clustering algorithm to categorize the Web documents in an online community • The Vector Space Document (VSD) - representation of any document as a feature vector of the words • Suffix tree document model - identifying phrases that are common to groups of documents

suffix sub-string

Suffix Tree Document Model • 1.cat ate cheese • 2. mouse ate cheese too • 3.cat ate mouse too

STC Algorithm (Suffix Tree Clustering) • 1. The common suffix tree generating • 2. Base cluster selecting Each base cluster B is assigned a score s(B) • |B| = the number of documents in B • |P| = the number of words in Phase • 3. Cluster merging • Jaccord coefficient

The base cluster graph

Problem of STC • STC algorithm sometimes generates some large-sized clusters with poor quality • No quality measure like tf-idf • No single-link, group-average and complete-link • Solution • mapping each node of a suffix tree into a unique dimension of a M dimensional space • M = total number of nodes in the suffix tree except the root node

The New Suffix Tree Similarity Measure • Each document d can be represented as a feature vector of the weights of M nodes • df(n) = the number of the different documents that have traversed node n • tf(n, d) = the total traversed times of document d through node n • ex. df(b) = 3 , tf(b,1) =1

The New Suffix Tree Similarity Measure • tf-idf formula • cosine similarity • GAHC algorithm (GA with HC mutation )

A Closer Look to Sufx Tree Document Model • Efciency Analysis • constructing the suffix tree O(m^2) • Ukkonen's paper provided a algorithm to build a suffix tree in O(m) • Stopword or Stopnode • Words in the stoplist - the score s(B) of a base cluster • stopnode - A node with a high df can be ignored

Document Preparing • 1. combine all posts of the same thread into a single document • 2. all non-word tokens are stripped • 3. all stopwords are identified and removed • 4. Porter stemming algorithm is applied • 6. the posts containing at least 3 distinct words are selected

Cluster Topic Summary Generating • topic summary generating concerns two important information retrieval work • 1. ranking the documents in a cluster by a quality score • 2. extracting common phrases as the topic summary

Cluster Topic Summary Generating • Document quality evaluation • Web documents provide some additional human assessments for the document quality evaluation • view clicks, reply posts and recommend clicks • top 10% documents as the representatives of the cluster • the nodes traversed by the representative documents are selected and sorted by their idf in ascend order. Finally the top 5 nodes are selected.

EVALUATION • 系統產生的 cluster C = {C1,C2, …,Ck} • 答案的cluster • Recall (i, j) = • Precision (i, j) =

Document Collections • OHSUMED Document Collection • 8 category, 800 documents, containing 6,281 distinct words. The average length of the documents is about 110 (by words) • RCV1 Document Collection • 10 groups of documents, containing 19,229 distinct words. The average length of documents is about 150

Results and Discussion

Results and Discussion • STC algorithm - there is no effective measure to evaluate the quality of the clusters during the cluster merging • Thus STC algorithm seldom generated large size clusters with high quality in the experiments

Results and Discussion • DS3 document

CONCLUSIONS AND FUTURE WORK • By completely mapping all nodes in the common suffix tree into a M dimensional space of VSD model, the advantages of VSD model and suffix tree model are smoothly inherited • suffix tree similarity measure is very simple, but the implementation is quite difficult • time efficiency and the space efficiency • Applying the new similarity measure in Chinese document

A New Suffix Tree Similarity Measure for Document Clustering