130 likes | 526 Views
Combining Link and Content Information in Web Search Fabiana F. Prabhakar Megan Smith Motivation Web search results can be much more improved by considering the documents links structure.
E N D
Combining Link and Content Information in Web Search Fabiana F. Prabhakar Megan Smith
Motivation • Web search results can be much more improved by considering the documents links structure. • Create an algorithm that can rank the documents based on their links and content combined and that can perform well during query time. • Hits: not feasible to compute hubs and authorities during query time; • Topic drift: both Hits and PageRank to not take the topic in consideration when ranking the pages.
PageRank • Web surfer who jumps from page to page, choosing with uniform probability which link to follow at each step; • From time to time, the surfer will jump to a random page with a small probability. This also happens whenever a page with no links is reached; • Represent the web as a graph: each page is a node and each outlink is an edge in the graph.
Directed Surfer Model • Probabilistically hops from page to page, depending on the content of the pages and the query terms the surfer is looking for. • A page rank is calculated for each document term pair in the collection(this calculation is done offline, not during query time).
QD-PageRankq(j) • For a single term, the resulting probability distribution over pages is: QD-PageRankq(j)=P(j)=(1- ) P’q(j) + (i Bj)Pq (i) Pq (ij) • Pq (i j) Probability that the surfer will jump from I to j for the query q. • P’q(j) specifies where the surfer will choose to jump when not following links. Jumping outside the topic.
Some definitions • W = set of words in the collection; • S = number of unique document-term pairs; • N = total number of documents.
Rq(j) Relevance of page j to query q • P’q(j)= Rq(j) / (k W)Rq(k) • Pq (i j)= Rq(j) / (k Fi)Rq(k) • When choosing among outlinks, the directed surfer tends to follow those which lead to pages with relevant content.
Multiple-term query (during retrieval) Q={q1,q2,…,qn} LOOP{ //select a term that was not selected before SELECT q from Q according to P(q); Use QD-PageRankq(j) to calculate QD-PageRankQ(j)*; } *QD-PageRankQ(j) = PQ(j)=(q Q)P(q)Pq (j)
Scalability • QD-PageRankq(j) is calculated considering just documents that contain q. The storage requirement is proportional to S (< N). • QD-PageRankQ(j) is calculated during query time.
Time Requirements • Time to compute QD-PageRankq(j) for all q in W = O(S). Experiments have shown that the computation converges in fewer iterations for these smaller sub-graphs, reducing the computational requirements. • For most words, the sub-graph will fit in memory, reducing disk I/O during computation.
Results • Three volunteers were asked to provide a single word and two double world queries. • For each query, the top 10 results from standard Page-Rank and QD-PageRank were randomly mixed and given to four volunteers, who were asked to rate each result. • None of them knew how the results were obtained.
Paper Combining Link and Content Information in Web Search http://www.cs.washington.edu/pedrod/papers/webdyn.pdf Richardson and Domingos, 2004 (Original conference version: The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, 2002 - http://citeseer.ist.psu.edu/460350.html)