1 / 31

Scaling Personalized Web Search

Scaling Personalized Web Search. Glen Jeh , Jennfier Widom Stanford University. Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion. Today’s topics. Overview Motivation Personal PageRank Vector Efficient calculation of PPV Experimental results Discussion.

Download Presentation

Scaling Personalized Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Personalized Web Search Glen Jeh , Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion

  2. Today’s topics • Overview • Motivation • Personal PageRank Vector • Efficient calculation of PPV • Experimental results • Discussion

  3. PageRank Overview • Ranking method of web pages based on the link structure of the web • Important pages are those linked-to by many important pages • Original PageRank has no initial preference for any particular pages

  4. PageRank Overview • The ranking is based on the probability that a random surfer will visit a certain page at a given time • E(p) can be: • Uniformly distributed • Biased distributed

  5. Motivation • We would like to give higher importance to user selected pages • User may have a set P of preferred pages • Instead of jumping to any random page with probability c, the jump is restricted to P • That way, we increase the probability that the random surfer will stay in the near environment of pages in P • Considering P will create a personalized view of the importance of pages on the web

  6. Personalized PageRank Vector (PPV) • Restrict preference sets P to subsets of a set of hub pages H - set of pages with high PageRank • PPV is a vector of length n, where n is the number of pages on the web • PPV[p] = the importance of page p

  7. PPV Equation • u – preference vector • |u| = 1 • u(p) = the amount of preference for page p • A – nxn matrix • c – the probability the random surfer jumps to a page in P

  8. PPV – Problem • Not practical to compute PPV’s during query time • Not practical to compute and store offline • There are preference sets • How to calculate PPV? How to do it efficiently?

  9. Main Steps to solution • Break down preference vectors into common components • Computation divided between offline (lots of time) and online (focused computation) • Eliminates redundant computation

  10. Linearity Theorem • The solution to a linear combination of preference vectors is the same linear combination of the corresponding PPV’s . • Letxibe a unit vector • Letri be the PPV corresponding to xi, called hub vector

  11. r1 r2 r12 rk … … … … Example … … Personal preferences of David x1, x2, x12

  12. Good, but not enough… • If hub vector ri for each page in H can be computed ahead of time and stored, then computing PPV is easier • The number of pre-computed PPV decrease from to |H|. • But…. • Each hub vector computation requires multiple scans of the web graph • Time and space grow linearly with |H| • The solution so far is impractical

  13. Decomposition of Hub Vectors • In order to compute and store the hub vectors efficiently, we can further break them down into… • Partial vector –unique component • Hubs skeleton –encode interrelationships among hub vectors • Construct into full hub vector during query time • Saves computation time and storage due to sharing of components among hub vectors

  14. Inverse P-distance • Hub vectorrp can be represented as inverseP-distance vector • l(t) – the number of edges in path t • P(t) – the probability of traveling on path t

  15. Paths that going through some page Partial Vectors • Breaking rp into into two components: • Partial Vectors- computed without using any intermediate nodes from H • The rest • For well-chosen sets H, it will be true that for many pages p,q Partial Vector

  16. Good, but not enough… • Precompute and store the partial vector • Cheaper to compute and store than • Decreases as |H| increases • Add at query time to compute the full hub vector • But… • Computing and storing could be expensive as itself

  17. Paths that go through some page Handling the case p or q is itself in H Hubs Skeleton • Breaking down : • Hubs skeleton - The set of distances among hub, giving the interrelationships among partial vectors • for each p, rp(H) has size at most |H|, much smaller than the full hub vector Hubs skeleton Partial Vectors

  18. Example H b a c d

  19. Hubs skeleton may be deferred to query time Pre- computed of partial vectors Putting it all together Given a chosen reference set P • Form a preference vector u • Calculate hub vector for each ik • Combine the hub vectors

  20. Algorithms • Decomposition theorem • Basic dynamic programming algorithm • Partial vectors - Selective expansion algorithm • Hubs skeleton - Repeated squaring algorithm

  21. Decomposition theorem • The basis vector rp is the average of the basis vectors of its out-neighbors, plus a compensation factor • Define relationships among basis vectors • Having computed the basis vectors of p’s out-neighbors to certain precision, we can use the theorem to compute rp to greater precision

  22. Basic dynamic programming algorithm • Using the decomposition theory, we can build a dynamic programming algorithm which iteratively improves the precision of the calculation • On iteration k, only paths with length ≤ k-1 are being considered • The error is reduced by a factor of 1-c on each iteration

  23. Computing partial vectors • Selective expansion algorithm • Tours passing through a hub page H are never considered • The expansion from p will stop when reaching page from H

  24. Computing hubs skeleton • Repeated squaring algorithm • Using the intermediate results from the computation of partial vectors • The error is squared on each iteration – reduces error much faster • Running time and storage depend only on the size of rp(H) • This allows to defer the computation to query time

  25. Experimental results • Perform experiments using real web data from Stanford’s WebBase, containing 80 million pages after removing leaf pages • Experiments were run using a 1.4 gigahertz CPU on a machine with 3.5 gigabytes of memory

  26. Experimental results • Partial vector approach is much more effective when H contains high-PageRank pages • H was taken from the top 1000 to the top 100,000 pages with the highest PageRank

  27. Experimental results • Compute hubs skeleton for |H|=10,000 • Average size is 9021 entries, much less than dimensions of full hub vectors

  28. Experimental results • Instead of using the entire set rp(H), using only the highest m enteries • Hub vector containing 14 million nonzero entries can be constructed from partial vectors in 6 seconds

  29. Discussion • Are personalized PageRank’s even useful? • What if personally chosen pages are not representative enough? Too focused? • Even if overhead is scalable with number of pages, do light-web users want to accept that overhead? • performance depends on choice of personal pages

  30. References • Scaling Personalized Web Search • Glen Jeh and Jennifer Widom WWW2003 • Personalized PageRank seminar: Link mining • http://www.informatik.uni-freiburg.de/~ml/teaching/ws04/lm/20041207_PageRank_Alcazar.ppt

More Related