Improving Web Search Results Using Affinity Graph

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research Asia SIGIR 2005

INTRODUCTION • The top search results can hardly cover a sufficient variety of topics (redundant) • re-ranking method based on MMR • There is no indication about how informative a returned document is on the query topic (coverage) • subtopic retrieval method • two novel metrics, diversity and information richness

BACKGROUND • The most famous works on link analysis • PageRank and HITS algorithm • Explicit link analysis and implicit link analysis • two web pages are implicitly linked if they are visited sequentially by the same end-user. • DirectHit and Small Web Search

AFFINITY RANKING

AFFINITY RANKING • Diversity: Given a set of documents R , we use diversity Div(R) to denote the number of different topics contained in R. • Information Richness: Given a document collection D={d1…dn}, we use information richness InfoRich (di) to denote richness of information contained in the document di with respect to the entire collection D.

Affinity Graph Construction • According to vector space model , similarity between a documents pair of di and dj can be calculated as • For further measurement on the significance of the similarity between each document pair, we define the affinity of dj to di as

InformationRichness Computation • After obtaining Affinity Graph, we apply a link analysis algorithm similar to PageRank • M is normalized to make the sum of each row equal to 1.

InformationRichness Computation • the score of document di can be deduced from those of all other document linked to it • With dumping factor c (similar to the random jumping factor in PageRank):

InformationRichness Computation • information can choose where to flow according to the following two rules: • With a probability c, the information will flow into document nodes which di links • With a probability of c - 1 the information will randomly flow into any document in the collection.

Diversity Penalty

Re-ranking Method • The re-ranking mechanism is a combination of results from fulltext search and Affinity Ranking • score-combination

Re-ranking Method • rank-combination

EXPERIMENTS • Yahoo! Directory • contained a total of 292,216 categories (including leaf categories and non-leaf categories) • All categories are organized into a 16-level hierarchy. • we have downloaded 792,601 documents in total. • ODP (Open Directory Project) • We downloaded the directory in August, 2004. ODP includes a total of 172,565 categories • we have downloaded 1,547,000 documents in total.

EXPERIMENTS • Newsgroup dataset • The Newsgroup data is composed of 256,449 posts collected from 117 commercial application with a total size of about 400M • Title and content of the post are given a 3:1 weighting ratio in indexing process • There is no explicit link existing among the posts • large amount of posts are very likely to be devoted to the same topic

Affinity Ranking vs. K-Means Clustering

Affinity Ranking vs. K-Means Clustering • The top 1000 search results of each query are passed to AR or Kmeans algorithm to re-rank top 10 results • For K-Means algorithm, we set K=10 and use the top 1 document of each cluster to construct the top 10 results

Affinity Ranking vs. K-Means Clustering

Affinity Ranking in Newsgroup dataset • Query • We compare our approach with the Okapi system in three aspects: diversity, information richness and relevance

Affinity Ranking in Newsgroup dataset • Four researchers are hired to labele the top 50 search results for each of the 20 queries based on the following steps:

Affinity Ranking in Newsgroup dataset • N is the number of users • X could be diversity, information richness, or relevance of the top search results • A and F represent results from our ranking scheme and full-text search

Improvement in Top 10 Search Results • As the top 10 search results always receive the most attention of end-users • In this experiment, we use the rank-combination scheme and which α= 0 and β =1

Improvement within Top 50 Search Results

A Case Study • This example is extracted from our experiments on the Newsgroup search for the query “Outlook print error”

CONCLUSIONS • Proposed two new metrics, diversity and information richness • A novel ranking scheme, Affinity Ranking, is proposed to re-rank the search results • Our experiments showed that the proposed metrics and new ranking method can effectively improve the search performance • Future work includes scaling our Affinity Ranking computation, for example, to the Web scale

Improving Web Search Results Using Affinity Graph

Improving Web Search Results Using Affinity Graph

Presentation Transcript

Clustering Web Search Results

Graph Algorithms Using Depth First Search

Clustering Web Search Results

Graph Substructure Search

Improving Search

Improving Protein-Ligand Binding Affinity Prediction using Random Forest

Improving the Quality of Visual Web Browsing by Using Weighted Graph Drawing

Graph Search Methods

Improving web image search results using query-relative classifiers

Web Page Clustering using Heuristic Search in the Web Graph

Improving Web Search Results Using Affinity Graph

Improving Query Results using Answer Corroboration

Clustering Search Results Using PLSA

Improving Error Discovery using Guided Search

Clustering Personalized Web Search Results

Structural Web Search Using a Graph-Based Discovery System

Optimized Graph Search Using Multi-Level Graph Clustering

Improving full-text search results on dúchas.ie using language technology

Using Web Search Methods Refining Results

Graph Algorithms Using Depth First Search

Clustering Search Results Using PLSA

Graph Algorithms Using Depth First Search