1 / 21

Web Document Clustering

Web Document Clustering. Department of Computer Science and Engineering Southern Methodist University Wenyi Ni. Why web document clustering is needed? . 3.3 billion web pages on the internet Every time you post a query, the search engine returns thousands of records.

gene
Download Presentation

Web Document Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Document Clustering Department of Computer Science and Engineering Southern Methodist University Wenyi Ni

  2. Why web document clustering is needed? • 3.3 billion web pages on the internet • Every time you post a query, the search engine returns thousands of records. • Did you efficiently find what you wanted? • Web document clustering is a good choice. • An example: www.metacrawler.com

  3. How to present a web document in a general model? • TF-IDF • Each web document is consisted by words. • The more words they share, the more likely they are similar. • Each Web document D can be represented by the following form: D = {d1,d2…, dn} Where n means that there are totally n different words in the document collection. • di represents the appearance of the ith word in the document.(1 means exist, 0 means non-exist) • The order of di is determined by the weight.

  4. tfij is number of occurrences of the word tj in the Web document Di. idfj is Inverse document frequency. dfj is the number of Web documents in which word tj occurs in the document collection. n is the total number of Web documents in the document collection. How to calculate the weight?

  5. How to calculate the similarity between two web documents • Jaccard similarity measure: • Other common measures: Cosine, Dice, Overlap

  6. Agglomerative Hierarchical clustering • Start with regarding each document as an individual cluster • Merge the most similar pair of documents or document clusters.(use the similarity measure) • Step 2 is iteratively executed until all objects are contained within a single cluster, which become the root of the tree.

  7. K-means clustering • Arbitrary select K documents as seeds, they are the initial centroids of each cluster. • Assign all other documents to the closest centroid • Compute the centroid of each cluster again. Get new centroid of each cluster • Repeat step2,3, until the centroid of each cluster doesn’t change.

  8. Some other refinement algorithm using TF-IDF model • Biselting K-means • Scatter/Gather

  9. Bisecting K-means 1.Select a cluster to split (There are several ways to select which cluster to split. No significance difference exists in terms of clustering accuracy). We normally choose the largest cluster or the one with the least overall similarity 2.Employ the basic k-means algorithm to subdivide the chosen cluster. 3.Repeat step 2 for a constant number of times. Then perform the split that produces clusters with the highest overall similarity 4.Repeat the above step1,2,3, until the desired number of clusters is reached

  10. How to present a web document in STC model • What is STC? Suffix Tree clustering • The whole web document is treated as a string • The identification of base clusters is the creation of an inverted index of strings for the web document collection

  11. A suffix tree example(courtesy form zemair): • Three strings. Each string is a document. • Cat ate cheese • Mouse ate cheese too • Cat ate mouse too.

  12. STC algorithm(cont) 1.Document cleaning Delete the word prefix and suffix, reduce plural to singular. Sentence boundaries are marked and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped. 2.Identify Base Cluster. Create an inverted index of strings from the web document collection with using a suffix tree. Each node of the suffix tree represents a group of documents and a string that is common to all of them. The label of the node represents the common string. Each node represents a base cluster.

  13. STC algorithm(cont) 3.Score base clusters. Each base cluster is assigned a score • The score formula: S(B)=|B|*f(|P|) • |B| is the number of documents in base cluster B • |P| is the number of words in string P that has a non-zero score • The function f penalizes single word, linear for string that is two to six words long. And become constant for longer string.

  14. STC algorithm 4.Combine base clusters. The similarity measure used to combine base clusters is based on the overlap of their document sets: Bx and By with size |Bx| and |By| |BxBy| represents the number of documents common to both base clusters. Define the similarity of Bx and By to be 1 if: |Bx By|/|Bx|>0.5 and |Bx By|/|By|>0.5. Otherwise is 0. Two base clusters are connected if they have similarity of 1. Using a single-link clustering algorithm, all the connected base clusters are clustering together. All the documents in these base clusters constitute a web document cluster.

  15. Link Based Model • Idea: Web pages that share common links each other are very likely to be tightly related • Each web document P is represented as 2 vectors: Pout(N-dimension) and Pin(M-dimension) • Pout,i represents whether the web document P has a out-link in the ith item of vector Pout • Pin,j represents whether the web document P has a in-link in the jth item of vector Pin For example: Pout( link1, link2,…,linkn) represents all the out-link in web document collection. Document Pout,2= 1 means this document has link2 as out-link.

  16. Link based algorithm 1.Filter irrelevant web documents A document is regarded irrelevant if the sum of in-links and out-links less than 2 2.Use near-common link of cluster to grantee intra-cluster cohesiveness Every cluster should have at least one 30% near common link 3.Assign each web document to cluster, generate base clusters.  Similarity between the document and the corresponding cluster is above the similarity threshold  The document has a link in common with near common links of the corresponding cluster 4.Generate final clusters by merging base clusters

  17. How to evaluate the quality of the result clusters (cont) • Entropy 1)For each cluster, the class distribution of the data(we usually use TREC5,TREC6 document collection) is calculated first. 2)Using this class distribution, the entropy of each cluster j is calculated. Ej = -Spijlog(pij) 3) The best quality is that all the documents in the cluster fall into the same class that is known before clustering

  18. How to evaluate the quality of the result clusters • F-measure 1)Calculate the recall and precision of that cluster for each given class. 2)For cluster j and it’s corresponding class i Recall(i, j) = nij/ni Percision(i, j) = nij/nj F(i, j) = ( 2 * Recall(i, j) * Percision(i, j)) / ((Percision(i, j) + Recall(i, j))

  19. Algorithm evaluation and comparison • TF-IDF based AHC Good cluster quality, time complexity O(n²) • TF-IDF based K-means Linear time complexity O(Kmn) Sensitive to outliers • STC Best for increment. Linear time complexity O(n), has memory problem. • Link based Linear time complexity O(mn), low dimension, good cluster quality.

  20. Future work • Each algorithm has its advantage and disadvantage. We need to refine these algorithms. Sometime we need trade off. • Still some room to make it better. 1.increase the entropy or F-measure value of the result clusters(The evaluation value is under 0.6 in almost all algorithm,while the best is 1) 2.decrease the response time(we often need to process a large document collection. We need a fast algorithm)

  21. End

More Related