1 / 23

A New Suffix Tree Similarity Measure for Document Clustering

A New Suffix Tree Similarity Measure for Document Clustering. Hung Chim, Xiaotie Deng City University of Hong Kong WWW2007. INTRODUCTION. 目的 : To develop a document clustering algorithm to categorize the Web documents in an online community

hea
Download Presentation

A New Suffix Tree Similarity Measure for Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng CityUniversity of Hong Kong WWW2007

  2. INTRODUCTION • 目的: To develop a document clustering algorithm to categorize the Web documents in an online community • The Vector Space Document (VSD) - representation of any document as a feature vector of the words • Suffix tree document model - identifying phrases that are common to groups of documents

  3. suffix sub-string

  4. Suffix Tree Document Model • 1.cat ate cheese • 2. mouse ate cheese too • 3.cat ate mouse too

  5. STC Algorithm (Suffix Tree Clustering) • 1. The common suffix tree generating • 2. Base cluster selecting Each base cluster B is assigned a score s(B) • |B| = the number of documents in B • |P| = the number of words in Phase • 3. Cluster merging • Jaccord coefficient

  6. The base cluster graph

  7. Problem of STC • STC algorithm sometimes generates some large-sized clusters with poor quality • No quality measure like tf-idf • No single-link, group-average and complete-link • Solution • mapping each node of a suffix tree into a unique dimension of a M dimensional space • M = total number of nodes in the suffix tree except the root node

  8. The New Suffix Tree Similarity Measure • Each document d can be represented as a feature vector of the weights of M nodes • df(n) = the number of the different documents that have traversed node n • tf(n, d) = the total traversed times of document d through node n • ex. df(b) = 3 , tf(b,1) =1

  9. The New Suffix Tree Similarity Measure • tf-idf formula • cosine similarity • GAHC algorithm (GA with HC mutation )

  10. A Closer Look to Sufx Tree Document Model • Efciency Analysis • constructing the suffix tree O(m^2) • Ukkonen's paper provided a algorithm to build a suffix tree in O(m) • Stopword or Stopnode • Words in the stoplist - the score s(B) of a base cluster • stopnode - A node with a high df can be ignored

  11. Document Preparing • 1. combine all posts of the same thread into a single document • 2. all non-word tokens are stripped • 3. all stopwords are identified and removed • 4. Porter stemming algorithm is applied • 6. the posts containing at least 3 distinct words are selected

  12. Cluster Topic Summary Generating • topic summary generating concerns two important information retrieval work • 1. ranking the documents in a cluster by a quality score • 2. extracting common phrases as the topic summary

  13. Cluster Topic Summary Generating • Document quality evaluation • Web documents provide some additional human assessments for the document quality evaluation • view clicks, reply posts and recommend clicks • top 10% documents as the representatives of the cluster • the nodes traversed by the representative documents are selected and sorted by their idf in ascend order. Finally the top 5 nodes are selected.

  14. EVALUATION • 系統產生的 cluster C = {C1,C2, …,Ck} • 答案的cluster • Recall (i, j) = • Precision (i, j) =

  15. Document Collections • OHSUMED Document Collection • 8 category, 800 documents, containing 6,281 distinct words. The average length of the documents is about 110 (by words) • RCV1 Document Collection • 10 groups of documents, containing 19,229 distinct words. The average length of documents is about 150

  16. Results and Discussion

  17. Results and Discussion • STC algorithm - there is no effective measure to evaluate the quality of the clusters during the cluster merging • Thus STC algorithm seldom generated large size clusters with high quality in the experiments

  18. Results and Discussion • DS3 document

  19. CONCLUSIONS AND FUTURE WORK • By completely mapping all nodes in the common suffix tree into a M dimensional space of VSD model, the advantages of VSD model and suffix tree model are smoothly inherited • suffix tree similarity measure is very simple, but the implementation is quite difficult • time efficiency and the space efficiency • Applying the new similarity measure in Chinese document

More Related