Link Analysis Ranking Algorithms on the World Wide Web

Link Analysis Ranking Algorithms on the World Wide Web Allan Borodin Computer Science Department, University of Toronto, and GammasiteGareth O. RobertsDepartment of Mathematics and Statistics, Lancaster University Jeffrey S. RosenthalDepartment of Statistics, University of TorontoPanayiotis TsaparasComputer Science Department, University of Toronto

Link Analysis Ranking on the Web • View the Web as a graph • Each web page is a node • Each hyperlink is a directed edge • Underlying Intuition: • A link from node i to node j, denotes endorsement of node j as an authority on a topic • The Problem: Mine the Web Graph Secrets • Discover good authorities. • Rank nodes according to their authority weight

Roadmap • Previous Work • Extensions of existing algorithms • A novel Bayesian Algorithm • Experimental Results • A Theoretical Framework • The Grand Finale

Previous Algorithms • Page Rank [Brin and Page 1997] • Query independent • Random Surfer Model • Hubs and Authorities [Kleinberg 1997] • Query dependent • Kleinberg [Kleinberg 1997] • SALSA [Lempel and Moran 2000] • Other • [Henzinger and Bharat 1998] • [Rafiei and Mendelzon 2000] • PHITS [Cohn and Chang 2001]

Hubs and Authorities • Create Root Set from text-based search engine. • Expand to Base Set • Construct underlying Graph • Remove intra-domain links

Hubs and Authorities • Pages with double identity (hubs, authorities) • Good hubs point to many good authorities • Good authorities are pointed by many good hubs • Assign each page i an authority weight and a hub weight . • Target: Find good authorities

Kleinberg Algorithm • Initialize all weights to 1. • Repeat until convergence • I operation: authorities “collect” the hub weights • O operation : hubs “collect” the authority weights • Normalize weights under some norm

Kleinberg Algorithm (cont.) • Equivalent to SVD decomposition of adjacency matrix A • Authority weights converge to principal eigenvector of ATA • Hub weights converge to principal eigenvector of AAT

SALSA • Replace the I and O operations with • I’ operation: authorities average the hub weights • O’ operation: hubs average the authority weights

SALSA • Equivalent to a random walk on the bipartite graph • For a connected component the stationary distribution satisfies • For the whole graph, pick starting point uniformly at random. The authority weight of node i in component j

pSALSA • Pick the initial node with probability proportional to the popularity (in-degree) of the node. • Perform the same random walk as in SALSA • Stationary distribution

Kleinberg Algorithm and Random Walks • pSALSA is equivalent to a single I operation. • The nth step of the Kleinberg algorithm gives weight • The stationary distribution of a random walk with transition probabilities

Hub-Averaging Algorithm • Asymmetric view of Hubs, Authorities • Good hubs point only to good authorities • Algorithm • Perform I operation of Kleinberg • Perform O’ operation of SALSA

Threshold Algorithms • Hub Threshold Algorithm • I operation: Keep only the hub weights above average • Authority Threshold Algorithm • O operation: Keep only the top K authority weights • Full Threshold Algorithm • Apply Thresholds to both I and O operations

Breadth First Search Algorithm • pSALSA: weights according to 1-neighborhood popularity • Kleinberg: weights according to global structure • BFS: Combine • Assign weights according to n-neighborhood popularity. • Visit neighbors in a BFS fashion, alternating between B and F steps • Apply exponentially decreasing weighting

Bayesian Algorithm:The Model • Assign to each page i parameters • ( : “link tendency” parameter) • Probability of a link between i and j • Simplified Bayesian: • Assign prior distributions to the parameters

Bayesian Algorithm:The Algorithm • Condition on the observed adjacency matrix A • Obtain posterior distribution using Bayes Rule • Compute the conditional means using Metropolis Algorithm • Output the conditional means of the authority parameters

Roadmap • Previous Work • Extensions of existing algorithms • A novel Bayesian Algorithm • Experimental Results • A Theoretical Framework • The Grand Finale • http://www.cs.toronto.edu/~tsap/experiments

Experimental Results • No undisputed “best” algorithm • No algorithm performs consistently well • There are queries where no algorithm performs well • There are queries where all algorithms perform well • Some algorithms are more “focused”, others more “spread” • Some algorithms are more prone to topic drift • The construction of the Base Set Graph is very important

Experimental Results • Kleinberg • Converges to the most Tightly Knit Community (TKC phenomenon) • Prone to topic drift • pSALSA • Spreads the authority weight over different communities • May introduce spurious authorities • The two ends of the spectrum (east v.s. west) (genetic)

Comparative Evaluation of Algorithms HThresh AThresh Hub-Avg BFS SBayesian FThresh Bayesian • Similarity: intersection over top ten • Simplified Bayesian very close to pSALSA • Threshold algorithms close to Kleinberg • Other algorithms range in the middle

Roadmap • Previous Work • Extensions of existing algorithms • A Bayesian Approach • Experimental Results • A Theoretical Framework • The Grand Finale

A Theoretical Framework • Link Analysis Ranking algorithm A • A(G)[j]: authority weight of jth page. • L-algorithm A: vector A(G) is normalized under L norm • Unnormalized algorithm: no normalization at any step • e.g. unnormalized pSALSA

Monotonicity • Definition: An algorithm A is monotone if for any two nodes, j,k: • All algorithms we consider are monotone

Similarity:Distance Measures • norm: • Distance between weight vectors • Distance between algorithms

Similarity:Distance Measures • Rank Distance (counts the number of swapped pairs) • Distance between weight vectors • Distance between algorithms

Similarity • Definition:Two -algorithms are similar if • Definition: Two algorithms are rank similar if • Definition: Two algorithms are rank matching if

Similarity:Results • Hub-Averaging and pSALSA are neither similar, nor rank similar • Kleinberg and pSALSA are neither similar nor rank similar • Kleinberg and Hub-Averaging are neither similar nor rank similar

Stability • Intuition: An algorithm is stable if small changes on the graph have small effect on the output • Let • Definition: An -algorithm A is stable if • Definition: An algorithm A is rank stable if

Stability: Results • Kleinberg and Hub-Averaging are neither stable nor rank stable • pSALSA is stable, and rank stable

Locality • Let • Local: • Pairwise local: • Rank Local:

Locality: Results • Unnormalized pSALSA is local • pSALSA is rank local, and pairwise local • Theorem (Uniqueness of pSALSA) Any algorithm that is monotone, label independent and local is rank matching with pSALSA

Future Work • Investigate the use of other statistical and machine learning techniques for link analysis • Expand and explore the Theoretical Framework • Investigate the similarity of Simplified Bayesian and pSALSA

Link Analysis Ranking Algorithms on the World Wide Web

Link Analysis Ranking Algorithms on the World Wide Web

Presentation Transcript

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

ADHD on the World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

Analysis of Link Structures on the World Wide Web and Classified Improvements

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

Link Analysis Ranking

Link Analysis Algorithms

Analysis of Link Structures on the World Wide Web and Classified Improvements

The World Wide Web

The World Wide Web