Matching DOM Trees to Search Logs for Accurate Webpage Clustering

Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta

Data extraction Website-specific wrappers Structured DB (product_name, price, rating) Webpages from a site

Wrapper 1 Data Extraction • Building wrappers[Muslea+/98, Crescenzi+/01, Cohen+/02, Hogue+/05, Irmak+/06] • Cluster pages from the website based on similarity of DOM structure • Pick a few example pages per cluster • Manually annotate the DOM nodes which contain the data • Automatic wrapper induction using these annotations Wrapper 2

Data Extraction • Clustering affects quality • Too few clusters: • Heterogeneity of clusters • Imperfect wrappers, or even inability to build wrappers • Too many clusters: • Significant editorial effort required to build wrappers • We want to automatically get a good clustering, for any website

Main Idea Wrappers extract it Users search for it “Useful” info on a page search terms match page content html h1 search +click b “html h1” and “html h1 b” are key paths DOM paths repeatedly referenced by search terms are “key” paths

Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths

Mapping pages to clusters • Pages in a cluster should have similar tree structure • and hence, similar paths • Represent a page by a shingle of its paths [Buttler/04] • Using key paths: • Shingle preferentially picks key paths in the page • Requires a global ranking of key paths

Mapping pages to clusters • One cluster per shingle All pages in a cluster share the same k “key” paths

Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths

Identify key paths title html • For every (query, webpage) pair • match query terms to text of a DOM path • yields precision and recall for every path • Need to aggregate over all queries and webpages • Expected precision and recall of a path • High if path appears on many queried pages, • and has high precision/recall in most of them h1 price b

Identify key paths • How can we combine expected precision and recall into one ranking of key paths? • F-measure, but • Precision typically more important than recall • Precision and recall may be in completely different scales • This scaling factor varies among websites

Identify key paths • How can we combine expected precision and recall into one ranking of key paths? • Borda method [Borda/1781] • Create two rankings of paths, one by precision and one by recall • Combine rankings into one ranking, using relative importance of precision to recall • Immune to varying scales of precision/recall values among websites

Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths, but • Key paths can be dependent • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths

Handling dependent paths • Consider the following two paths: • html body div div table tr td h1 span (“product name”) • html body div div table tr td h1 • If one is a key path, probably the other is too • Shingle can get “swamped” • Shingle of a page becomes:(product_name, product_name_parent, product_name_ancestor) • instead of:(product_name, buy_button, rating)

Handling dependent paths • Several sources of dependence • Multiple paths may have similar content • “product name” header and its parent • product name mentioned in a header and in the text • Multiple paths may always co-occur • “product name” header and “price”

Handling dependent paths • Identify key independent paths • Build a graph of dependencies between paths • Pick an independent set of pathsi.e., a set of paths where no one is connected to another • Computation is weighted strongly towards the top-ranked paths • Under our weighting scheme, greedily picking an independent set is optimal

Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths • Several other optimizations (in paper)

Experiments • 10 major websites • Sampled ~20,000 pages each • Built ground truth • Ran an existing clustering algorithm • Manually checked results • Homogeneous clusters: merge when necessary • Heterogeneous clusters: change parameters, repeat • Small sample of search logs • ~5K unique queries per site • Far fewer than the number of pages per site

Experiments • Compared to clustering using well-known tree-similarity metrics • Path Shingles: Shingle of DOM paths without using key paths [Buttler/04] • pq-Grams: Shingle of sub-trees of DOM tree [Augsten+/05] • m/k Path Shingles: Like path shingles, except only m out of k shingle elements need to match

Experiments Our algorithm [Buttler/04] [Augsten+/05] • Compared clustering using Adjusted RAND index • higher is better, 1.0 is perfect Search logs give significant lift, with very low variance

Experiments Precision of IndepPaths Comparison against paths actually used by manually-designed wrappers Key Paths correspond to paths used in wrappers

Experiments Examples of top-ranked paths

Conclusions • Clusters affect both • wrapper quality, and • degree of editorial effort • We use search logs to automatically find good clusters • Current efforts: • Combining search features with content features to pick key paths

Mapping pages to clusters • Given an ranked list of key paths • Given a shingle-size k • For any page P • Find KP = all key paths in P • If |KP| < k • Shingle = KP plus randomly chosen paths from page • else • Shingle = top-ranked k paths in KP

Experiments

Experiments [Buttler/04] [Augsten+/05] • Compared clustering using Adjusted RAND index • higher is better, 1.0 is perfect Our algorithm Shingles w/o key paths Shingles of DOM subtrees Shingle of 8 paths; only 6 need to match Search logs give significant lift, with very low variance

Matching DOM Trees to Search Logs for Accurate Webpage Clustering

Matching DOM Trees to Search Logs for Accurate Webpage Clustering

Presentation Transcript

Trees, Binary Trees, and Binary Search Trees

DIGITAL SEARCH TREES

Binary Trees, Binary Search Trees

Game Trees-Clustering

Search Trees

Publishing Search Query logs

Binary Search Trees

Matching Similarity for Keyword - based Clustering

Search Trees

Binary Trees, Binary Search Trees

Binary Search Trees

Binary Trees, Binary Search Trees

Search Trees

Web search logs

Binary Trees, Binary Search Trees

Search Trees

Binary Trees, Binary Search Trees

Clustering Event Logs Using Iterative Partitioning

Trees and Binary Search Trees

Binary Trees, Binary Search Trees

Binary Trees, Binary Search Trees

Trees and Binary Search Trees