330 likes | 481 Views
Scaling Personalized Web Search. Glen Jeh , Jennfier Widom Stanford University. Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion. Today’s topics. Overview Motivation Personal PageRank Vector Efficient calculation of PPV Experimental results Discussion.
E N D
Scaling Personalized Web Search Glen Jeh , Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion
Today’s topics • Overview • Motivation • Personal PageRank Vector • Efficient calculation of PPV • Experimental results • Discussion
PageRank Overview • Ranking method of web pages based on the link structure of the web • Important pages are those linked-to by many important pages • Original PageRank has no initial preference for any particular pages
PageRank Overview • The ranking is based on the probability that a random surfer will visit a certain page at a given time • E(p) can be: • Uniformly distributed • Biased distributed
Motivation • We would like to give higher importance to user selected pages • User may have a set P of preferred pages • Instead of jumping to any random page with probability c, the jump is restricted to P • That way, we increase the probability that the random surfer will stay in the near environment of pages in P • Considering P will create a personalized view of the importance of pages on the web
Personalized PageRank Vector (PPV) • Restrict preference sets P to subsets of a set of hub pages H - set of pages with high PageRank • PPV is a vector of length n, where n is the number of pages on the web • PPV[p] = the importance of page p
PPV Equation • u – preference vector • |u| = 1 • u(p) = the amount of preference for page p • A – nxn matrix • c – the probability the random surfer jumps to a page in P
PPV – Problem • Not practical to compute PPV’s during query time • Not practical to compute and store offline • There are preference sets • How to calculate PPV? How to do it efficiently?
Main Steps to solution • Break down preference vectors into common components • Computation divided between offline (lots of time) and online (focused computation) • Eliminates redundant computation
Linearity Theorem • The solution to a linear combination of preference vectors is the same linear combination of the corresponding PPV’s . • Letxibe a unit vector • Letri be the PPV corresponding to xi, called hub vector
r1 r2 r12 rk … … … … Example … … Personal preferences of David x1, x2, x12
Good, but not enough… • If hub vector ri for each page in H can be computed ahead of time and stored, then computing PPV is easier • The number of pre-computed PPV decrease from to |H|. • But…. • Each hub vector computation requires multiple scans of the web graph • Time and space grow linearly with |H| • The solution so far is impractical
Decomposition of Hub Vectors • In order to compute and store the hub vectors efficiently, we can further break them down into… • Partial vector –unique component • Hubs skeleton –encode interrelationships among hub vectors • Construct into full hub vector during query time • Saves computation time and storage due to sharing of components among hub vectors
Inverse P-distance • Hub vectorrp can be represented as inverseP-distance vector • l(t) – the number of edges in path t • P(t) – the probability of traveling on path t
Paths that going through some page Partial Vectors • Breaking rp into into two components: • Partial Vectors- computed without using any intermediate nodes from H • The rest • For well-chosen sets H, it will be true that for many pages p,q Partial Vector
Good, but not enough… • Precompute and store the partial vector • Cheaper to compute and store than • Decreases as |H| increases • Add at query time to compute the full hub vector • But… • Computing and storing could be expensive as itself
Paths that go through some page Handling the case p or q is itself in H Hubs Skeleton • Breaking down : • Hubs skeleton - The set of distances among hub, giving the interrelationships among partial vectors • for each p, rp(H) has size at most |H|, much smaller than the full hub vector Hubs skeleton Partial Vectors
Example H b a c d
Hubs skeleton may be deferred to query time Pre- computed of partial vectors Putting it all together Given a chosen reference set P • Form a preference vector u • Calculate hub vector for each ik • Combine the hub vectors
Algorithms • Decomposition theorem • Basic dynamic programming algorithm • Partial vectors - Selective expansion algorithm • Hubs skeleton - Repeated squaring algorithm
Decomposition theorem • The basis vector rp is the average of the basis vectors of its out-neighbors, plus a compensation factor • Define relationships among basis vectors • Having computed the basis vectors of p’s out-neighbors to certain precision, we can use the theorem to compute rp to greater precision
Basic dynamic programming algorithm • Using the decomposition theory, we can build a dynamic programming algorithm which iteratively improves the precision of the calculation • On iteration k, only paths with length ≤ k-1 are being considered • The error is reduced by a factor of 1-c on each iteration
Computing partial vectors • Selective expansion algorithm • Tours passing through a hub page H are never considered • The expansion from p will stop when reaching page from H
Computing hubs skeleton • Repeated squaring algorithm • Using the intermediate results from the computation of partial vectors • The error is squared on each iteration – reduces error much faster • Running time and storage depend only on the size of rp(H) • This allows to defer the computation to query time
Experimental results • Perform experiments using real web data from Stanford’s WebBase, containing 80 million pages after removing leaf pages • Experiments were run using a 1.4 gigahertz CPU on a machine with 3.5 gigabytes of memory
Experimental results • Partial vector approach is much more effective when H contains high-PageRank pages • H was taken from the top 1000 to the top 100,000 pages with the highest PageRank
Experimental results • Compute hubs skeleton for |H|=10,000 • Average size is 9021 entries, much less than dimensions of full hub vectors
Experimental results • Instead of using the entire set rp(H), using only the highest m enteries • Hub vector containing 14 million nonzero entries can be constructed from partial vectors in 6 seconds
Discussion • Are personalized PageRank’s even useful? • What if personally chosen pages are not representative enough? Too focused? • Even if overhead is scalable with number of pages, do light-web users want to accept that overhead? • performance depends on choice of personal pages
References • Scaling Personalized Web Search • Glen Jeh and Jennifer Widom WWW2003 • Personalized PageRank seminar: Link mining • http://www.informatik.uni-freiburg.de/~ml/teaching/ws04/lm/20041207_PageRank_Alcazar.ppt