Web Page Clustering using Heuristic Search in the Web Graph

Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07

Motivation - 1/2 • The reasons for clustering of search results are two-fold • cluster hypothesis : similar documents tend to be relevant to the same requests • ranked list is usually too large and contains many irrelevant documents • Successful academic and industrial (vivisimo.com) • Organize search results into groups (clusters) • Topical similarity

Motivation - 2/2 • Clustering problem : • there is not enough contextual information on a page • For example: savethejaguar.com • Web sites are contextually different but actually refer to the same meaning of the query • Michel D´ecary • a computer scientist (www.zoominfo.com/MichelDecary), • a lawyer (www.stikeman.com/cgi-bin/profile.cfm?P ID=366), • and a chansonnier (www.decary.com).

Introduction - 1/3 • Thematic locality of the Web graph: • Directed graph in which nodes are Web pages and edges are hyperlink • If page A hyperlink page B, page A and page B are semantically close. • For example: • Michel D´ecary • a computer scientist (www.zoominfo.com/MichelDecary), • and a chansonnier (www.decary.com). • cogilex.com

Introduction - 2/3 • Heuristic Search : • To collect as much useful information as possible while crawling the Web • Heuristic estimate the amount of information available in a particular Web sub-graph. • This paper uses heuristics to estimate the utility of expanding the current node in terms of leading to the target node. • The heuristics are not to reduce the search time, but to improve the search accuracy. • Heuristics are used as filters to prune branches of search trees that are likely to establish undesired connections between unrelated Web pages.

Introduction - 3/3 • Multi-agent system: • Given n Web pages in the ranked list • ncollaborative Web agents • initial dataset : assigned one page • Each agentperforms heuristic searchto traverse the Web graphin order to meet as many other agents as possible. • Two applications: • Web appearance disambiguation • Search result clustering

Multi-agent heuristic search • Two multi-agent heuristic search • Sequential Heuristic Search (SHS) • Frontier: • a list of nodes (URL) to be expanded (initially, the URL of its source page) • Filter : ( later) • Initialize :

Multi-agent heuristic search • The SHS algorithm • simple and intuitive • One crucial drawback • there is no possibility to control the topology of the constructed clusters • In a worst case • If , , and • Pages A and D will be placed in the same cluster despite that the semantic relation between them is probably weak Page A --> Page B Page B --> Page C Page C --> Page D

Multi-agent heuristic search • Incremental Heuristic Search (IHS)

Heuristics - 1/2 • Two heuristics • Topology-driven • High-degree node elimination • Remove high out-degree pages and high in-degree pages • Content-driven • Person name heuristic

Heuristics - 2/2 • To detect high out-degree URL • Using Google’s link:operator • Threshold in/out hyperlinks 1000 • Person names consist of two, three or four words • This heuristic excludes people names that are too common (again, we use Google’s link: operator) • In many cases, an entity tagged as a person name has millions of Google’s hits if it is a tagger error. • Examples of such entities are Price Range and Mac Os.

Datasets - disambiguation dataset • Web appearance disambiguation dataset • www.cs.umass.edu/~ronb • It consists of 1085 Web pages retrieved on 12 names of people from Melinda Gervasio’s social network (mostly, SRI engineers and university professors). • The dataset is labeled according to the person’s occupation. • The process crawled the Web starting with these 1085 pages (source pages). • 7009 pages at the first hop ((一次飛行的)航程), • 69,454 pages at the second hop • 592,299 pages at the third hop

One-Cluster

Datasets - Jaguar dataset - 1/2 • Problem of clustering Web search results • Retrieved and labeled 100 first Google hits obtained on the query jaguar.

Datasets - Jaguar dataset - 2/2 • Jaguar dataset • K = 3 (car, Mac Os, and cats) • 883 pages on the first hop • 8548 pages on the second hop • 56,287 pages on the third hop

Agglomerative/Conglomerative Distributional Clustering (A/CDC) ( Bekkerman and McCallum, 2005)

Conclusion • This paper is the first study of heuristic search in the Web graph. • Heuristic search : • Viable in the vast domain of the WWW • Clustering of Web search results • Web appearance disambiguation

Introduction - 4/4 • Topological clustering • Only k largest cluster : • a set C of k • Initial : Each document from the original ranked list into one cluster C’ • a set C’ of k’ > k topical cluster • For each cluster ciC to find it closest cluster cj’ from C’ • j=argmaxj’|cic’j’|

Web Page Clustering using Heuristic Search in the Web Graph

Web Page Clustering using Heuristic Search in the Web Graph

Presentation Transcript

Clustering Web Search Results

Clustering Web Search Results

The web graph

Dijkstra’s Algorithm and Heuristic Graph Search

Web Page

Web page

Web Page

Web Document Clustering

Clustering dense graphs: A web site graph paradigm

Improving Web Search Results Using Affinity Graph

Using Rules in Web Page Tables

Improving Web Search Results Using Affinity Graph

Web Page Categorization without the Web Page

Web page

Online Clustering of Web Search results

Web Page

Clustering the Tagged Web

Clustering Personalized Web Search Results

Web Page Clustering based on Web Community Extraction

Structural Web Search Using a Graph-Based Discovery System

Optimized Graph Search Using Multi-Level Graph Clustering

Web Service Clustering