390 likes | 488 Views
Link Analysis Ranking Algorithms on the World Wide Web.
E N D
Link Analysis Ranking Algorithms on the World Wide Web Allan Borodin Computer Science Department, University of Toronto, and GammasiteGareth O. RobertsDepartment of Mathematics and Statistics, Lancaster University Jeffrey S. RosenthalDepartment of Statistics, University of TorontoPanayiotis TsaparasComputer Science Department, University of Toronto
Link Analysis Ranking on the Web • View the Web as a graph • Each web page is a node • Each hyperlink is a directed edge • Underlying Intuition: • A link from node i to node j, denotes endorsement of node j as an authority on a topic • The Problem: Mine the Web Graph Secrets • Discover good authorities. • Rank nodes according to their authority weight
Roadmap • Previous Work • Extensions of existing algorithms • A novel Bayesian Algorithm • Experimental Results • A Theoretical Framework • The Grand Finale
Previous Algorithms • Page Rank [Brin and Page 1997] • Query independent • Random Surfer Model • Hubs and Authorities [Kleinberg 1997] • Query dependent • Kleinberg [Kleinberg 1997] • SALSA [Lempel and Moran 2000] • Other • [Henzinger and Bharat 1998] • [Rafiei and Mendelzon 2000] • PHITS [Cohn and Chang 2001]
Hubs and Authorities • Create Root Set from text-based search engine. • Expand to Base Set • Construct underlying Graph • Remove intra-domain links
Hubs and Authorities • Pages with double identity (hubs, authorities) • Good hubs point to many good authorities • Good authorities are pointed by many good hubs • Assign each page i an authority weight and a hub weight . • Target: Find good authorities
Kleinberg Algorithm • Initialize all weights to 1. • Repeat until convergence • I operation: authorities “collect” the hub weights • O operation : hubs “collect” the authority weights • Normalize weights under some norm
Kleinberg Algorithm (cont.) • Equivalent to SVD decomposition of adjacency matrix A • Authority weights converge to principal eigenvector of ATA • Hub weights converge to principal eigenvector of AAT
SALSA • Replace the I and O operations with • I’ operation: authorities average the hub weights • O’ operation: hubs average the authority weights
SALSA • Equivalent to a random walk on the bipartite graph • For a connected component the stationary distribution satisfies • For the whole graph, pick starting point uniformly at random. The authority weight of node i in component j
pSALSA • Pick the initial node with probability proportional to the popularity (in-degree) of the node. • Perform the same random walk as in SALSA • Stationary distribution
Kleinberg Algorithm and Random Walks • pSALSA is equivalent to a single I operation. • The nth step of the Kleinberg algorithm gives weight • The stationary distribution of a random walk with transition probabilities
Roadmap • Previous Work • Extensions of existing algorithms • A novel Bayesian Algorithm • Experimental Results • A Theoretical Framework • The Grand Finale
Hub-Averaging Algorithm • Asymmetric view of Hubs, Authorities • Good hubs point only to good authorities • Algorithm • Perform I operation of Kleinberg • Perform O’ operation of SALSA
Threshold Algorithms • Hub Threshold Algorithm • I operation: Keep only the hub weights above average • Authority Threshold Algorithm • O operation: Keep only the top K authority weights • Full Threshold Algorithm • Apply Thresholds to both I and O operations
Breadth First Search Algorithm • pSALSA: weights according to 1-neighborhood popularity • Kleinberg: weights according to global structure • BFS: Combine • Assign weights according to n-neighborhood popularity. • Visit neighbors in a BFS fashion, alternating between B and F steps • Apply exponentially decreasing weighting
Roadmap • Previous Work • Extensions of existing algorithms • A novel Bayesian Algorithm • Experimental Results • A Theoretical Framework • The Grand Finale
Bayesian Algorithm:The Model • Assign to each page i parameters • ( : “link tendency” parameter) • Probability of a link between i and j • Simplified Bayesian: • Assign prior distributions to the parameters
Bayesian Algorithm:The Algorithm • Condition on the observed adjacency matrix A • Obtain posterior distribution using Bayes Rule • Compute the conditional means using Metropolis Algorithm • Output the conditional means of the authority parameters
Roadmap • Previous Work • Extensions of existing algorithms • A novel Bayesian Algorithm • Experimental Results • A Theoretical Framework • The Grand Finale • http://www.cs.toronto.edu/~tsap/experiments
Experimental Results • No undisputed “best” algorithm • No algorithm performs consistently well • There are queries where no algorithm performs well • There are queries where all algorithms perform well • Some algorithms are more “focused”, others more “spread” • Some algorithms are more prone to topic drift • The construction of the Base Set Graph is very important
Experimental Results • Kleinberg • Converges to the most Tightly Knit Community (TKC phenomenon) • Prone to topic drift • pSALSA • Spreads the authority weight over different communities • May introduce spurious authorities • The two ends of the spectrum (east v.s. west) (genetic)
Comparative Evaluation of Algorithms HThresh AThresh Hub-Avg BFS SBayesian FThresh Bayesian • Similarity: intersection over top ten • Simplified Bayesian very close to pSALSA • Threshold algorithms close to Kleinberg • Other algorithms range in the middle
Roadmap • Previous Work • Extensions of existing algorithms • A Bayesian Approach • Experimental Results • A Theoretical Framework • The Grand Finale
A Theoretical Framework • Link Analysis Ranking algorithm A • A(G)[j]: authority weight of jth page. • L-algorithm A: vector A(G) is normalized under L norm • Unnormalized algorithm: no normalization at any step • e.g. unnormalized pSALSA
Monotonicity • Definition: An algorithm A is monotone if for any two nodes, j,k: • All algorithms we consider are monotone
Similarity:Distance Measures • norm: • Distance between weight vectors • Distance between algorithms
Similarity:Distance Measures • Rank Distance (counts the number of swapped pairs) • Distance between weight vectors • Distance between algorithms
Similarity • Definition:Two -algorithms are similar if • Definition: Two algorithms are rank similar if • Definition: Two algorithms are rank matching if
Similarity:Results • Hub-Averaging and pSALSA are neither similar, nor rank similar • Kleinberg and pSALSA are neither similar nor rank similar • Kleinberg and Hub-Averaging are neither similar nor rank similar
Stability • Intuition: An algorithm is stable if small changes on the graph have small effect on the output • Let • Definition: An -algorithm A is stable if • Definition: An algorithm A is rank stable if
Stability: Results • Kleinberg and Hub-Averaging are neither stable nor rank stable • pSALSA is stable, and rank stable
Locality • Let • Local: • Pairwise local: • Rank Local:
Locality: Results • Unnormalized pSALSA is local • pSALSA is rank local, and pairwise local • Theorem (Uniqueness of pSALSA) Any algorithm that is monotone, label independent and local is rank matching with pSALSA
Future Work • Investigate the use of other statistical and machine learning techniques for link analysis • Expand and explore the Theoretical Framework • Investigate the similarity of Simplified Bayesian and pSALSA