270 likes | 360 Views
Matching DOM Trees to Search Logs for Accurate Webpage Clustering. Deepayan Chakrabarti Rupesh Mehta. Data extraction. Website-specific wrappers. Structured DB. (product_name, price, rating). Webpages from a site. Wrapper 1. Data Extraction.
E N D
Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta
Data extraction Website-specific wrappers Structured DB (product_name, price, rating) Webpages from a site
Wrapper 1 Data Extraction • Building wrappers[Muslea+/98, Crescenzi+/01, Cohen+/02, Hogue+/05, Irmak+/06] • Cluster pages from the website based on similarity of DOM structure • Pick a few example pages per cluster • Manually annotate the DOM nodes which contain the data • Automatic wrapper induction using these annotations Wrapper 2
Data Extraction • Clustering affects quality • Too few clusters: • Heterogeneity of clusters • Imperfect wrappers, or even inability to build wrappers • Too many clusters: • Significant editorial effort required to build wrappers • We want to automatically get a good clustering, for any website
Main Idea Wrappers extract it Users search for it “Useful” info on a page search terms match page content html h1 search +click b “html h1” and “html h1 b” are key paths DOM paths repeatedly referenced by search terms are “key” paths
Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths
Mapping pages to clusters • Pages in a cluster should have similar tree structure • and hence, similar paths • Represent a page by a shingle of its paths [Buttler/04] • Using key paths: • Shingle preferentially picks key paths in the page • Requires a global ranking of key paths
Mapping pages to clusters • One cluster per shingle All pages in a cluster share the same k “key” paths
Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths
Identify key paths title html • For every (query, webpage) pair • match query terms to text of a DOM path • yields precision and recall for every path • Need to aggregate over all queries and webpages • Expected precision and recall of a path • High if path appears on many queried pages, • and has high precision/recall in most of them h1 price b
Identify key paths • How can we combine expected precision and recall into one ranking of key paths? • F-measure, but • Precision typically more important than recall • Precision and recall may be in completely different scales • This scaling factor varies among websites
Identify key paths • How can we combine expected precision and recall into one ranking of key paths? • Borda method [Borda/1781] • Create two rankings of paths, one by precision and one by recall • Combine rankings into one ranking, using relative importance of precision to recall • Immune to varying scales of precision/recall values among websites
Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths, but • Key paths can be dependent • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths
Handling dependent paths • Consider the following two paths: • html body div div table tr td h1 span (“product name”) • html body div div table tr td h1 • If one is a key path, probably the other is too • Shingle can get “swamped” • Shingle of a page becomes:(product_name, product_name_parent, product_name_ancestor) • instead of:(product_name, buy_button, rating)
Handling dependent paths • Several sources of dependence • Multiple paths may have similar content • “product name” header and its parent • product name mentioned in a header and in the text • Multiple paths may always co-occur • “product name” header and “price”
Handling dependent paths • Identify key independent paths • Build a graph of dependencies between paths • Pick an independent set of pathsi.e., a set of paths where no one is connected to another • Computation is weighted strongly towards the top-ranked paths • Under our weighting scheme, greedily picking an independent set is optimal
Main Idea • Clustering using key paths • Pre-processing step (for each site) • Given a large sample of pages and search logs • Identify key paths • Run-time (for that website) • Given a new webpage • Find which key paths exist on the page • Map page to cluster using its key paths • Several other optimizations (in paper)
Experiments • 10 major websites • Sampled ~20,000 pages each • Built ground truth • Ran an existing clustering algorithm • Manually checked results • Homogeneous clusters: merge when necessary • Heterogeneous clusters: change parameters, repeat • Small sample of search logs • ~5K unique queries per site • Far fewer than the number of pages per site
Experiments • Compared to clustering using well-known tree-similarity metrics • Path Shingles: Shingle of DOM paths without using key paths [Buttler/04] • pq-Grams: Shingle of sub-trees of DOM tree [Augsten+/05] • m/k Path Shingles: Like path shingles, except only m out of k shingle elements need to match
Experiments Our algorithm [Buttler/04] [Augsten+/05] • Compared clustering using Adjusted RAND index • higher is better, 1.0 is perfect Search logs give significant lift, with very low variance
Experiments Precision of IndepPaths Comparison against paths actually used by manually-designed wrappers Key Paths correspond to paths used in wrappers
Experiments Examples of top-ranked paths
Conclusions • Clusters affect both • wrapper quality, and • degree of editorial effort • We use search logs to automatically find good clusters • Current efforts: • Combining search features with content features to pick key paths
Mapping pages to clusters • Given an ranked list of key paths • Given a shingle-size k • For any page P • Find KP = all key paths in P • If |KP| < k • Shingle = KP plus randomly chosen paths from page • else • Shingle = top-ranked k paths in KP
Experiments [Buttler/04] [Augsten+/05] • Compared clustering using Adjusted RAND index • higher is better, 1.0 is perfect Our algorithm Shingles w/o key paths Shingles of DOM subtrees Shingle of 8 paths; only 6 need to match Search logs give significant lift, with very low variance