A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan

Clustering Web Search Results • Challenges: • On short snippets instead of whole docs • Clustering must be done on the fly • Clusters should be labeled with meaningful text (accurate and intelligible) • Clusters need to be distinctive • Vivisimo • SNAKET

Categorization of Works • Flat clustering vs. Hierarchical clustering • Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)

Preprocessing • Fetch snippets from 16 search engines • Enrich snippets with anchor texts from a crawled database of 200M web pages

Identification of Candidate Phrases for Labels • Enumerate all pairs of words within a certain proximity window (of size 4) in snippets • Score them based on: • NLP features: PoS, NE • ODP occurrences: term frequency (col freq * inv cat freq?), containing category • Discard low-score pairs

Identification of Candidate Phrases for Labels (cont.) • Word pairs are atomic phrases (how about single words?) • Incrementally merge word pairs into longer phrases (preserve ordering and limit size) • Score phrases based on its constitutes’ scores • Discard low-score phrases

Hierarchical Clustering • Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping • Primary label: the aforementioned candidate phrase • Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster

Hierarchical Clustering (cont.) • Merge atomic clusters into candidate second-level clusters if they share primary/secondary labels • Primary label: the shared label • Secondary label: other labels occurring in 80% of the snippets in the cluster • Prune second-level clusters that are have similar coverage or similar labels • Recursively produce third-level clusters

How SNAKET can be Used • Hierarchical browsing for knowledge extraction • Hierarchical browsing for result selection • Query reformulation • Personalized ranking(?)

Evaluation

Evaluation (cont.)

Clustering technology: PageRank of the future? • Pros: • Ambiguous query: narrow down result list • Less-ambiguous query: get a bird’s eye view of different aspects • Cons: • Clustering is slow but often unnecessary • Takes time to look at the clusters • Cluster and label quality still to be desired

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Presentation Transcript

Clustering Web Search Results

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING

Clustering Web Search Results

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Semantic, Hierarchical, Online Clustering of Web Search Results

Effective Time Ratio: A measure for Web search engine with document snippet

Personalized Search Based on User Search Histories

Clustering User Queries of a Search Engine

Hierarchical Clustering

Snippet Engine as a Database Server

Clustering Personalized Web Search Results

Hierarchical Clustering

Scaling Personalized Web Search

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering