Web Document Clustering

Web Document Clustering By Sang-Cheol Seok

1.Introduction: Web document clustering? Why? Two results for the same query ‘amazon’ • Google : currently the most powerful search engine • Metacrawler : a search engine which cluster retrieved web documents.

2. Approaches • Using contents of documents • Using user’s usage logs • Using current search engines • Using hyperlinks • Other classical methods

(1) Using Contents of Documents • Creating clusters based on snippets returned by web search engines. • clusters based on snippets are almost as good as clusters created using the full text of Web documents. • Suffix Tree Clustering (STC) : incremental, O(n) time algorithm • three logical steps: (1) document “cleaning”, (2) identifying base clusters using a suffix tree, and (3) combining these base clusters into clusters

(2) Using user’s usage logs • Advantage: relevancy information is objectively reflected by the usage logs • An experimental result on www.nasa.gov/

(3) Using current web search engines – Metacrawler • Step1: When MetaCrawler receives a query, it posts the query to multiple search engines in parallel. • Step2: performs sophisticated pruning on the responses returned. (prune 75% of the returned responses as irrelevant, outdated, or unavailable ) • Metacrawler at U. of Washington.

(4) Using hyperlinks • Consider web documents as vertices and the hyperlinks as direct edges in a direct graph. • Similarity-based clustering method was successfully used in image segmentation • Kleinberg’s HITS algorithm • based purely on hyperlink information. • authority and hub documents for a user query. • only cover the most popular topics and leave out the less popular ones.

(4) Using Hyperlinks: continued • cluster web documents based on both the textual and hyperlink • the hyperlink structure is used as the dominant factor in the similarity metric

(5) Other classical clustering methods • K-means method • HAC (hierarchical agglomerative clustering) • DBSCAN (Density-based SCAN) • And Single-link and group-average methods, Complete-link methods, Single-pass methods, and Buckshot and Fraction have been used

3. Key requirements and future challenges (1) key requirements for Web document clustering methods • Relevance • Browsable Summaries • Overlap • Speed • Incrementality for some methods.

3. Key requirements and future challenges: continued (2) Concerns on current methods • Each method has pros and cons. • Using hyperlinks : the best accuracy and still some room to improve and it does not overlap. • STC : best to browse and for incrementality. • Metacrawler : best to prune.

3. Key requirements and future challenges: continued Future challenges • We can not take advantage of all pros of each method. • Some pros work against other pros. • So, we have to trade off. • Moreover, we need to find improvements.

Web Document Clustering