190 likes | 366 Views
Web Page Clustering using Heuristic Search in the Web Graph. IJCAI 07. Motivation - 1/2. The reasons for clustering of search results are two-fold cluster hypothesis : similar documents tend to be relevant to the same requests
E N D
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07
Motivation - 1/2 • The reasons for clustering of search results are two-fold • cluster hypothesis : similar documents tend to be relevant to the same requests • ranked list is usually too large and contains many irrelevant documents • Successful academic and industrial (vivisimo.com) • Organize search results into groups (clusters) • Topical similarity
Motivation - 2/2 • Clustering problem : • there is not enough contextual information on a page • For example: savethejaguar.com • Web sites are contextually different but actually refer to the same meaning of the query • Michel D´ecary • a computer scientist (www.zoominfo.com/MichelDecary), • a lawyer (www.stikeman.com/cgi-bin/profile.cfm?P ID=366), • and a chansonnier (www.decary.com).
Introduction - 1/3 • Thematic locality of the Web graph: • Directed graph in which nodes are Web pages and edges are hyperlink • If page A hyperlink page B, page A and page B are semantically close. • For example: • Michel D´ecary • a computer scientist (www.zoominfo.com/MichelDecary), • and a chansonnier (www.decary.com). • cogilex.com
Introduction - 2/3 • Heuristic Search : • To collect as much useful information as possible while crawling the Web • Heuristic estimate the amount of information available in a particular Web sub-graph. • This paper uses heuristics to estimate the utility of expanding the current node in terms of leading to the target node. • The heuristics are not to reduce the search time, but to improve the search accuracy. • Heuristics are used as filters to prune branches of search trees that are likely to establish undesired connections between unrelated Web pages.
Introduction - 3/3 • Multi-agent system: • Given n Web pages in the ranked list • ncollaborative Web agents • initial dataset : assigned one page • Each agentperforms heuristic searchto traverse the Web graphin order to meet as many other agents as possible. • Two applications: • Web appearance disambiguation • Search result clustering
Multi-agent heuristic search • Two multi-agent heuristic search • Sequential Heuristic Search (SHS) • Frontier: • a list of nodes (URL) to be expanded (initially, the URL of its source page) • Filter : ( later) • Initialize :
Multi-agent heuristic search • The SHS algorithm • simple and intuitive • One crucial drawback • there is no possibility to control the topology of the constructed clusters • In a worst case • If , , and • Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak Page A --> Page B Page B --> Page C Page C --> Page D
Multi-agent heuristic search • Incremental Heuristic Search (IHS)
Heuristics - 1/2 • Two heuristics • Topology-driven • High-degree node elimination • Remove high out-degree pages and high in-degree pages • Content-driven • Person name heuristic
Heuristics - 2/2 • To detect high out-degree URL • Using Google’s link:operator • Threshold in/out hyperlinks 1000 • Person names consist of two, three or four words • This heuristic excludes people names that are too common (again, we use Google’s link: operator) • In many cases, an entity tagged as a person name has millions of Google’s hits if it is a tagger error. • Examples of such entities are Price Range and Mac Os.
Datasets - disambiguation dataset • Web appearance disambiguation dataset • www.cs.umass.edu/~ronb • It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors). • The dataset is labeled according to the person’s occupation. • The process crawled the Web starting with these 1085 pages (source pages). • 7009 pages at the first hop ((一次飛行的)航程), • 69,454 pages at the second hop • 592,299 pages at the third hop
Datasets - Jaguar dataset - 1/2 • Problem of clustering Web search results • Retrieved and labeled 100 first Google hits obtained on the query jaguar.
Datasets - Jaguar dataset - 2/2 • Jaguar dataset • K = 3 (car, Mac Os, and cats) • 883 pages on the first hop • 8548 pages on the second hop • 56,287 pages on the third hop
Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)
Conclusion • This paper is the first study of heuristic search in the Web graph. • Heuristic search : • Viable in the vast domain of the WWW • Clustering of Web search results • Web appearance disambiguation
Introduction - 4/4 • Topological clustering • Only k largest cluster : • a set C of k • Initial : Each document from the original ranked list into one cluster C’ • a set C’ of k’ > k topical cluster • For each cluster ciC to find it closest cluster cj’ from C’ • j=argmaxj’|cic’j’|