230 likes | 322 Views
PageRank without hyperlinks: Structural re-ranking using links induced by language models. Oren Kurland and Lilian Lee Cornell SIGIR 2005. Objective. IR re-ranking on non-hypertext documents using PageRank Use language-model-based weights in the PageRank matrix. Method Outline.
E N D
PageRank without hyperlinks: Structural re-ranking using links induced by language models Oren Kurland and Lilian Lee Cornell SIGIR 2005
Objective • IR re-ranking on non-hypertext documents using PageRank • Use language-model-based weights in the PageRank matrix
Method Outline • Initial retrieval using KL-Divergence model (use Lemur) • Generate PageRank matrix from top k retrieved documents according to the paper’s model • Do the PageRank iterations • Re-rank the documents
Concept 1: Generation Probability • The probability of a word w occurring in a document x or document collection x acccording to the maximum likelihood model is • tf is the term frequency
Concept 1: Generation Probability (Cont.) • Using the Dirichlet-smoothed model, we get • pcMLE(w)is the MLE probability of w in the entire document collection c • controls the influence of pcMLE(w)
Concept 1: Generation Probability(Cont.) • Two ways of defining the probability of a document x generating a sequence of words w1w2…wn are
Concept 1: Generation Probability(Cont.) • KL-Divergence combines the previous two functions into • That’s the generation probability function for this paper • The probability of document d generating word sequence s
Concept 2: Top Generators • The top generators of a document s are the documents d with the highest generation probabilities
Graph Generation • We can construct a graph from a collection of documents • Two ways of defining the edges and edge weights are
Graph Generation (Cont.) • og means an edge from document o to document g • The first definition assigns a uniform weight of 1 to all edges pointing from a document to its top generator • The second definition uses generation probability as weight
Weight-Smoothing • We can smooth the edge weights to give non-zero weights for all edges • Dinit is the set of documents we wish to re-rank • controls the influence of the components
Concept 3: Graph Centrality • Now that we have a graph, how do we define the centrality (importance) of each node (document)? • Influx version: • The centrality of a node is simply the weight of the edges pointing to it
Concept 3: Graph Centrality (Cont.) • Recursive Influx Version: • Centrality is recursively defined • This is the PageRank version
Concept 3: Graph Centrality (Cont.) • We get a total of 4 models if we consider uniform/non-uniform weights and non-recursive/recursive influxes • Recall that uniform weights mean edge weights with values 0 or 1
Combining Centrality with Initial Relevance Score • Centrality scores are computed on the set of initially retrieved documents • Initially retrieved documents also have relevance score assigned by KL-divergence retrieval model • We can combine the two scores: • Cen(d;G) is centrality score • pd(q) is retrieval score • Just a simple product of the two scores
Final combinations of models • Now we have 8 models: • U-In • W-In • U-In+LM (centrality * retrieval score) • W-In+LM • R-U-In • R-W-In • R-U-In+LM • R-W-In+LM
Experiment 1: Model Comparison • 4 TREC corpora • Re-rank top 50 retrieved documents • Upper-bound performance: place all relevant documents in the top 50 documents to the front • Initial ranking: optimize parameter for best precision at 1000 • Optimal baseline: performance of best parameter
Experiment 1 Results • Highlighted values indicate the best performances • The R-W-In+LM model has the best performance on average
Experiment 2: Cosine Similarity • Top Generators and edge weights are computed using language model pd(s) • Replace pd(s) by tf*idf cosine similarity between 2 documents
Experiment 2: Results • means language model is better than cosine similarity by at least 5% • means cosine similarity is better than language model by 5% • Language model is better overall
Experiment 3: Centrality Alternatives • The best re-ranking model so far is R-W-In+LM: • What if we replace Cen(d;G) by other scores
Experiment 3: Results • Again, R-W-In+LM wins
Conclusion • PageRank on documents without explicit links