PageRank without hyperlinks: Structural re-ranking using links induced by language models

PageRank without hyperlinks: Structural re-ranking using links induced by language models Oren Kurland and Lilian Lee Cornell SIGIR 2005

Objective • IR re-ranking on non-hypertext documents using PageRank • Use language-model-based weights in the PageRank matrix

Method Outline • Initial retrieval using KL-Divergence model (use Lemur) • Generate PageRank matrix from top k retrieved documents according to the paper’s model • Do the PageRank iterations • Re-rank the documents

Concept 1: Generation Probability • The probability of a word w occurring in a document x or document collection x acccording to the maximum likelihood model is • tf is the term frequency

Concept 1: Generation Probability (Cont.) • Using the Dirichlet-smoothed model, we get • pcMLE(w)is the MLE probability of w in the entire document collection c •  controls the influence of pcMLE(w)

Concept 1: Generation Probability(Cont.) • Two ways of defining the probability of a document x generating a sequence of words w1w2…wn are

Concept 1: Generation Probability(Cont.) • KL-Divergence combines the previous two functions into • That’s the generation probability function for this paper • The probability of document d generating word sequence s

Concept 2: Top Generators • The  top generators of a document s are the documents d with the highest  generation probabilities

Graph Generation • We can construct a graph from a collection of documents • Two ways of defining the edges and edge weights are

Graph Generation (Cont.) • og means an edge from document o to document g • The first definition assigns a uniform weight of 1 to all edges pointing from a document to its top generator • The second definition uses generation probability as weight

Weight-Smoothing • We can smooth the edge weights to give non-zero weights for all edges • Dinit is the set of documents we wish to re-rank •  controls the influence of the components

Concept 3: Graph Centrality • Now that we have a graph, how do we define the centrality (importance) of each node (document)? • Influx version: • The centrality of a node is simply the weight of the edges pointing to it

Concept 3: Graph Centrality (Cont.) • Recursive Influx Version: • Centrality is recursively defined • This is the PageRank version

Concept 3: Graph Centrality (Cont.) • We get a total of 4 models if we consider uniform/non-uniform weights and non-recursive/recursive influxes • Recall that uniform weights mean edge weights with values 0 or 1

Combining Centrality with Initial Relevance Score • Centrality scores are computed on the set of initially retrieved documents • Initially retrieved documents also have relevance score assigned by KL-divergence retrieval model • We can combine the two scores: • Cen(d;G) is centrality score • pd(q) is retrieval score • Just a simple product of the two scores

Final combinations of models • Now we have 8 models: • U-In • W-In • U-In+LM (centrality * retrieval score) • W-In+LM • R-U-In • R-W-In • R-U-In+LM • R-W-In+LM

Experiment 1: Model Comparison • 4 TREC corpora • Re-rank top 50 retrieved documents • Upper-bound performance: place all relevant documents in the top 50 documents to the front • Initial ranking: optimize  parameter for best precision at 1000 • Optimal baseline: performance of best  parameter

Experiment 1 Results • Highlighted values indicate the best performances • The R-W-In+LM model has the best performance on average

Experiment 2: Cosine Similarity • Top Generators and edge weights are computed using language model pd(s) • Replace pd(s) by tf*idf cosine similarity between 2 documents

Experiment 2: Results •  means language model is better than cosine similarity by at least 5% •  means cosine similarity is better than language model by 5% • Language model is better overall

Experiment 3: Centrality Alternatives • The best re-ranking model so far is R-W-In+LM: • What if we replace Cen(d;G) by other scores

Experiment 3: Results • Again, R-W-In+LM wins

Conclusion • PageRank on documents without explicit links

PageRank without hyperlinks: Structural re-ranking using links induced by language models

PageRank without hyperlinks: Structural re-ranking using links induced by language models

Presentation Transcript

CS 388: Natural Language Processing: N-Gram Language Models

Statistical NLP Winter 2009

Link Analysis: PageRank and Similar Ideas

Chapter 10

The PageRank Citation Ranking: Bringing Order to the Web

Formal Models of Language

Logic and Language Links

Ranking

Language Models

The Efficacy of Collusions in Web Ranking and the Countermeasurements

Structural Static Models

Pagerank

Hyperlinks

Including links on your page

Ranking

Diversity in Ranking via Resistive Graph Centers

Applying References and Hyperlinks

Final Review Tech II

Links