120 likes | 274 Views
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering. Paolo Ferragina, Antonio Gulli Presented by Bin Tan. Clustering Web Search Results. Challenges: On short snippets instead of whole docs Clustering must be done on the fly
E N D
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan
Clustering Web Search Results • Challenges: • On short snippets instead of whole docs • Clustering must be done on the fly • Clusters should be labeled with meaningful text (accurate and intelligible) • Clusters need to be distinctive • Vivisimo • SNAKET
Categorization of Works • Flat clustering vs. Hierarchical clustering • Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)
Preprocessing • Fetch snippets from 16 search engines • Enrich snippets with anchor texts from a crawled database of 200M web pages
Identification of Candidate Phrases for Labels • Enumerate all pairs of words within a certain proximity window (of size 4) in snippets • Score them based on: • NLP features: PoS, NE • ODP occurrences: term frequency (col freq * inv cat freq?), containing category • Discard low-score pairs
Identification of Candidate Phrases for Labels (cont.) • Word pairs are atomic phrases (how about single words?) • Incrementally merge word pairs into longer phrases (preserve ordering and limit size) • Score phrases based on its constitutes’ scores • Discard low-score phrases
Hierarchical Clustering • Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping • Primary label: the aforementioned candidate phrase • Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster
Hierarchical Clustering (cont.) • Merge atomic clusters into candidate second-level clusters if they share primary/secondary labels • Primary label: the shared label • Secondary label: other labels occurring in 80% of the snippets in the cluster • Prune second-level clusters that are have similar coverage or similar labels • Recursively produce third-level clusters
How SNAKET can be Used • Hierarchical browsing for knowledge extraction • Hierarchical browsing for result selection • Query reformulation • Personalized ranking(?)
Clustering technology: PageRank of the future? • Pros: • Ambiguous query: narrow down result list • Less-ambiguous query: get a bird’s eye view of different aspects • Cons: • Clustering is slow but often unnecessary • Takes time to look at the clusters • Cluster and label quality still to be desired