Combining Link and Content Information in Web Search

Combining Link and Content Information in Web Search Fabiana F. Prabhakar Megan Smith

Motivation • Web search results can be much more improved by considering the documents links structure. • Create an algorithm that can rank the documents based on their links and content combined and that can perform well during query time. • Hits: not feasible to compute hubs and authorities during query time; • Topic drift: both Hits and PageRank to not take the topic in consideration when ranking the pages.

PageRank • Web surfer who jumps from page to page, choosing with uniform probability which link to follow at each step; • From time to time, the surfer will jump to a random page with a small probability. This also happens whenever a page with no links is reached; • Represent the web as a graph: each page is a node and each outlink is an edge in the graph.

Directed Surfer Model • Probabilistically hops from page to page, depending on the content of the pages and the query terms the surfer is looking for. • A page rank is calculated for each document term pair in the collection(this calculation is done offline, not during query time).

QD-PageRankq(j) • For a single term, the resulting probability distribution over pages is: QD-PageRankq(j)=P(j)=(1- ) P’q(j) + (i Bj)Pq (i) Pq (ij) • Pq (i j) Probability that the surfer will jump from I to j for the query q. • P’q(j) specifies where the surfer will choose to jump when not following links. Jumping outside the topic.

Some definitions • W = set of words in the collection; • S = number of unique document-term pairs; • N = total number of documents.

Rq(j) Relevance of page j to query q • P’q(j)= Rq(j) / (k  W)Rq(k) • Pq (i j)= Rq(j) / (k  Fi)Rq(k) • When choosing among outlinks, the directed surfer tends to follow those which lead to pages with relevant content.

Multiple-term query (during retrieval) Q={q1,q2,…,qn} LOOP{ //select a term that was not selected before SELECT q from Q according to P(q); Use QD-PageRankq(j) to calculate QD-PageRankQ(j)*; } *QD-PageRankQ(j) = PQ(j)=(q Q)P(q)Pq (j)

Scalability • QD-PageRankq(j) is calculated considering just documents that contain q. The storage requirement is proportional to S (< N). • QD-PageRankQ(j) is calculated during query time.

Time Requirements • Time to compute QD-PageRankq(j) for all q in W = O(S). Experiments have shown that the computation converges in fewer iterations for these smaller sub-graphs, reducing the computational requirements. • For most words, the sub-graph will fit in memory, reducing disk I/O during computation.

Results • Three volunteers were asked to provide a single word and two double world queries. • For each query, the top 10 results from standard Page-Rank and QD-PageRank were randomly mixed and given to four volunteers, who were asked to rate each result. • None of them knew how the results were obtained.

Results (cont.)

Paper Combining Link and Content Information in Web Search http://www.cs.washington.edu/pedrod/papers/webdyn.pdf Richardson and Domingos, 2004 (Original conference version: The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, 2002 - http://citeseer.ist.psu.edu/460350.html)

Combining Link and Content Information in Web Search

Combining Link and Content Information in Web Search

Presentation Transcript

Information Retrieval and Web Search

INFORMATION RETRIEVAL AND WEB SEARCH

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Content Search Web Part and Stuff

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Web Search and Information Retrieval

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search