1 / 21

Web Document Clustering: A Feasibility Demonstration

This document explores Web Document Clustering (WDC) as a solution to the low precision of web search engines, demonstrating the feasibility of Suffix Tree Clustering (STC) algorithm. STC efficiently identifies sets of documents with common phrases, satisfying key requirements such as relevance, browsable summaries, overlap, snippet-tolerance, and incrementality. The operating procedure involves document cleaning, identifying base clusters, and combining them through a suffix tree. STC allows for incremental processing of documents, accommodating cluster overlap and multi-word phrases crucial for success. Experimental evaluations highlight the advantages of STC in web search engine results clustering.

jbirch
Download Presentation

Web Document Clustering: A Feasibility Demonstration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01

  2. Motivation Low precision of Web search engines—hard for users to locate expected information quickly… Solutions: • Increase precision– by filtering methods? by advanced pruning options?… • Web Document Clustering  - Cluster documents returned by search engine in response to a query and re-present them

  3. Key Requirementsfor Web Document Clustering • Relevance • Browsable Summaries • Overlap • Snippet-tolerance • “snippet”: small piece of info. Or brief extract • Speed • Incrementality

  4. Suffix Tree Clustering(STC) • STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases. • STC satisfies the key requirements: • STC treats a document as a string, making use of proximity information between words. • STC is novel, incremental, and O(n) time algorithm. • STC succinctly summarizes clusters’ contents for users. • Quickbecause of working on smaller setof documents, incremantality • …

  5. Operating procedure of STC • Step1: Document “cleaning” • Html -> plain text • Words stemming • Mark sentence boundaries • Remove non-word tokens • Step 2: Identifying Base Clusters • Step3: Combining Base Clusters

  6. Step2:Identifying base Clusters—Suffix Tree * STC treats a document as a set of strings… • Suffixtree of string S: a compact tree containing all the suffixes of S • Suffix of a word: lovely • Suffix of a string: “Friends” is a lovely show. • Precise definition: • A suffix tree is a rooted, directed tree. • Each internal node has 2+ children. • Each edge is labeled with a non-empty sub-string of S. The label of a node is defined to be the concatenation of the edge-labels on the path from the root to that node • No two edges out of the same node can have edge-labels that begin with the same word—compact.

  7. Ex. A Suffix Tree of Strings • String1: “cat ate cheese”, • String2: “mouse ate cheese too” • String3: “cat ate mouse too”

  8. Base clusters Base clusters corresponding to the suffix tree nodes

  9. Cluster score • s(B) = |B| * f(|P|) • |B| is the number of documents in base cluster B • |P| is the number of words in P that have a non-zero score • zero score words: stopwords, too few(<3) or too many( >40%)

  10. Step 3:Combining Base Clusters • Merge base clusters with a high overlap in their document sets • documents may share multiple phrases. • Similarity of Bm and Bn(0.5 is paramter) 1 iff | Bm  Bn| / | Bm | > 0.5 = and | Bm  Bn| / | Bn | > 0.5 0 otherwise

  11. Base Cluster Graph Node: cluster Edge: similarity between two clusters > 1 What if “ate” is in the stop word list?

  12. STC is Incremental • As each document arrives from the web, we • “clean” it (linear with collection size) • Add it to the suffix tree. Each node that is updated/created as a result of this is tagged(linear) • Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters(linear) • Check any changes to the final clusters(linear) • Score and sort the final clusters, choose top 10...(linear)

  13. STC allows cluster overlap… • Why overlap is reasonable? a document often has 1+ topics • STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents • But not too similar to be merged into one cluster..

  14. Experiments • Cluster output of meta search engine, using STC alg. • Representative of Web search engines • WEB clustering, instead of “IR corpus”

  15. Evaluation-Precision • Precision of different Clustering algorithm

  16. Cluster overlap & multi-word phrases are critical to STC’s success

  17. Cluster overlap & multi-word phrases are specifically effective to STC’s success

  18. Why? • Allowing a document to appear in multiple clusters is only advantageous if that document is relevant; placing an irrelevant document in multiple clusters can only hurt cluster quality

  19. Snippets versus Whole Document

  20. Execution time • Incremental – use “free” CPU time when the system is waiting for the search engine results to arrive over the web – speedy

  21. Conclusion • The identification of the unique requirements of document clustering of Web seach engine results • The definition of STC – an incremental, o(n) time clustering algorithm that satisfies these requirements • The first experimental evaluation of clustering algorithms on Web search engine results

More Related