110 likes | 208 Views
Link-based ranking I. Broder’s breakdown. First-generation ranking: Ranked Boolean with TF-IDF-like factors Second-generation: Off-page, Web-specific factors Anchor text, click-through, link analysis Plus, focus on corpus-improvement Third-generation (yet to come):
E N D
Broder’s breakdown • First-generation ranking: • Ranked Boolean with TF-IDF-like factors • Second-generation: • Off-page, Web-specific factors • Anchor text, click-through, link analysis • Plus, focus on corpus-improvement • Third-generation (yet to come): • Answering the need behind the query
Link-analysis • Made famous by Google, but used by everyone now • Basic idea: a “link” is an endorsement • Has roots in bibliographic citation analysis • Decades-old work for determining “important” papers
Naïve approach: in-degree • Idea: use in-degree to rank pages • Off-host links are better endorsements • Problems: • Too democratic: • Mossberg is better endorsement than Stata • 2-link page is stronger endorsement than 100-link • Easy to Spam (for above reasons)
PageRank: recursive extension of in-degree • Let R(P) be the “PageRank” of P • R(P) = j/N+(1-j)*sum R(b_i)/outdegree(b_i) • where j is a number in (0,1) • b_i are the pages pointing to P • (draw picture) • (point out connection to previous page)
Random-walk model • Imagine the following model of a surfer: • Prob. j: jump to a random page • Prob. 1-j: follow random link from current page • A random walk is a Markov process • An N-state system with an NxN “transition matrix” T of (indep) transition probabilities • Tik is probability of jumping from state i to k • PageRank: Tik = j/N + (1-j)/outdegree(i) • Markov processes have been studied extensively
Markov processes • Markov process is ergodic if: • No zeros in transition matrix • ==> Can be in any state at any time step with non-zero probability • For ergodic Markov processes, there are unique long-term visit rates for every state known as stationary or steady-state probabilities that are independent of the process’ starting state
Ergodic Markov processes • Let r be the N-dimensional vector giving the stationary prob’s for an erg. Markov process • The following is true: r = rT • Thus, r is the principle (left) eigenvector of T, that is, the eigenvector with the largest eigenvalue
Computing stationary prob’s • Start with any r0, then: • r1 = r0 T, r2 = r1 T (= r0*T^2), etc. • Converges rapidly (<== Web is has a good “expansion factor”/is “rapidly mixing” <== outlinks from a small, random set leads to a sufficiently larger set)
Searching with PageRank • Naïve: Find all pages that contain query terms, then rank according to PageRank • In reality, combine PageRank with “IR” score: • Weight anchor, title, header, and body hits differently (e.g., anchor hits might weigh high due to extra trust) • Non-linear “tapering” so no one can overpower • At one point, AV had almost 100 factors! • Tuning very important, very expensive
Eigenvectors and ranking • Many link-based ranking schemes are based on eigenvector computations • Simple variations: • Bias jump probabilities by additional notion of “endorsement” (still query independent) • Bias link/jump probabilities by topic relevance (query dependence) or by “personal interests” (“personal PageRank”) • Next time: “Hubs and Authorities”