210 likes | 221 Views
Web Document Clustering: A Feasibility Demonstration. Hui Han CSE dept. PSU 10/15/01. Motivation. Low precision of Web search engines—hard for users to locate expected information quickly… Solutions: Increase precision– by filtering methods? by advanced pruning options?…
E N D
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01
Motivation Low precision of Web search engines—hard for users to locate expected information quickly… Solutions: • Increase precision– by filtering methods? by advanced pruning options?… • Web Document Clustering - Cluster documents returned by search engine in response to a query and re-present them
Key Requirementsfor Web Document Clustering • Relevance • Browsable Summaries • Overlap • Snippet-tolerance • “snippet”: small piece of info. Or brief extract • Speed • Incrementality
Suffix Tree Clustering(STC) • STC is a linear time clustering algorithm that is based on a suffix tree which efficiently identifies sets of documents that share common phrases. • STC satisfies the key requirements: • STC treats a document as a string, making use of proximity information between words. • STC is novel, incremental, and O(n) time algorithm. • STC succinctly summarizes clusters’ contents for users. • Quickbecause of working on smaller setof documents, incremantality • …
Operating procedure of STC • Step1: Document “cleaning” • Html -> plain text • Words stemming • Mark sentence boundaries • Remove non-word tokens • Step 2: Identifying Base Clusters • Step3: Combining Base Clusters
Step2:Identifying base Clusters—Suffix Tree * STC treats a document as a set of strings… • Suffixtree of string S: a compact tree containing all the suffixes of S • Suffix of a word: lovely • Suffix of a string: “Friends” is a lovely show. • Precise definition: • A suffix tree is a rooted, directed tree. • Each internal node has 2+ children. • Each edge is labeled with a non-empty sub-string of S. The label of a node is defined to be the concatenation of the edge-labels on the path from the root to that node • No two edges out of the same node can have edge-labels that begin with the same word—compact.
Ex. A Suffix Tree of Strings • String1: “cat ate cheese”, • String2: “mouse ate cheese too” • String3: “cat ate mouse too”
Base clusters Base clusters corresponding to the suffix tree nodes
Cluster score • s(B) = |B| * f(|P|) • |B| is the number of documents in base cluster B • |P| is the number of words in P that have a non-zero score • zero score words: stopwords, too few(<3) or too many( >40%)
Step 3:Combining Base Clusters • Merge base clusters with a high overlap in their document sets • documents may share multiple phrases. • Similarity of Bm and Bn(0.5 is paramter) 1 iff | Bm Bn| / | Bm | > 0.5 = and | Bm Bn| / | Bn | > 0.5 0 otherwise
Base Cluster Graph Node: cluster Edge: similarity between two clusters > 1 What if “ate” is in the stop word list?
STC is Incremental • As each document arrives from the web, we • “clean” it (linear with collection size) • Add it to the suffix tree. Each node that is updated/created as a result of this is tagged(linear) • Update the relevant base clusters and recalculate the similarity of these base clusters to the rest of k highest scoring base clusters(linear) • Check any changes to the final clusters(linear) • Score and sort the final clusters, choose top 10...(linear)
STC allows cluster overlap… • Why overlap is reasonable? a document often has 1+ topics • STC allows a document to appear in 1+ clusters, since documents may share 1+ phrases with other documents • But not too similar to be merged into one cluster..
Experiments • Cluster output of meta search engine, using STC alg. • Representative of Web search engines • WEB clustering, instead of “IR corpus”
Evaluation-Precision • Precision of different Clustering algorithm
Cluster overlap & multi-word phrases are critical to STC’s success
Cluster overlap & multi-word phrases are specifically effective to STC’s success
Why? • Allowing a document to appear in multiple clusters is only advantageous if that document is relevant; placing an irrelevant document in multiple clusters can only hurt cluster quality
Execution time • Incremental – use “free” CPU time when the system is waiting for the search engine results to arrive over the web – speedy
Conclusion • The identification of the unique requirements of document clustering of Web seach engine results • The definition of STC – an incremental, o(n) time clustering algorithm that satisfies these requirements • The first experimental evaluation of clustering algorithms on Web search engine results