1 / 32

On Link-based Similarity Join

VLDB 2011. Presenter: Reynold Cheng Department of Computer Science The University of Hong Kong ckcheng@cs.hku.hk. On Link-based Similarity Join. A joint work with: Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) Jiawei Han (University of Illinois Urbana Champaign).

eunicet
Download Presentation

On Link-based Similarity Join

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLDB 2011 Presenter: Reynold Cheng Department of Computer Science The University of Hong Kong ckcheng@cs.hku.hk On Link-based Similarity Join • A joint work with: • Liwen Sun, Xiang Li, David Cheung (University of Hong Kong) • Jiawei Han(University of Illinois Urbana Champaign)

  2. Graph applications • Social networks • Bibliographic networks • Coauthor/citation relationships • Biological databases • Protein-protein interaction link prediction, recommendation, spam detection,... L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  3. Link-based Similarity (LS) • Similarity between a node pair based on links • Personalized PageRank • [Widom,WWW’03][Fogara, Inter. Math’05] • SimRank • [Lizorkin, VLDBJ’10] [Li, SDM’10] • Discounted Hitting Times • [Sarkar, KDD’10] L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  4. Similarity Join • Similarity join: discovers relationship between two sets of objects based on some similarity function • Extensively studied in: • high dim. data [Boehm, SIGMOD’01] [Dittrich, KDD’01] • sets/strings [Arasu, VLDB’06] [Xiao, WWW’08] • Similarity join for graphs: use shortest-path distance for road network and graph pattern matching [Sankaranarayanan, GIS’06; Zou, VLDB’09] L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  5. Link-based Similarity Join (LS-Join) • LS-Join: Given two subsets of nodes P and Q in a graph and a LS measure S, return k pairs of nodes, with the highest values of S. L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  6. LS-Join and Promotion Strategies • Find the top-k closest (Sales, Customer) from a social network, using PageRank • In a citation network, find top-k similar pairs of papers from the DB and AI communities Top-1 LS-Join on Sales, Customer L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  7. More about LS measures • A LS measure often involves random walk • Let be a probabilistic measure between u and v • Personalized PageRank (PPR) • : prob. a surfer from u visits v at i-th step • SimRank (SR) • : prob. 2 surfers from u and v first meet at i-th step • Discounted Hitting Time (DHT) • : prob. a surfer from u first visits v at i-thstep • can be expensive to compute L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  8. Challenge of Evaluating LS-Join • Let S(u,v) be the similarity between u and v based on a LS measure • A simple algorithm: • For each node pair and , compute S(p, q) • Return the kpairs with the highest S(p,q) • Drawback: • S(p,q) is expensive to compute • S(p,q) of a non-answer pair is also evaluated • Can we have a better solution? L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  9. LS-Join Algorithms • Iterative Deepening Join (IDJ) • An algorithm for computing any given LS measure • Customization of IDJ for: • Personalized PageRank (PPR) • SimRank (SR) L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  10. e-function: A general form of S(u,v) depth S(u,v) has a general form called e-function • where • a, b:real-valued constants; a>0 • :decay factor; 0 < <1 • : prob. measure • e.g., for PPR: • : prob. a surfer from u visits v at i-th step • a = 1- ; b = 0 Practically, we approximate S(u,v) by some d

  11. Properties of e-function where • Observations • This bound decreases exponentiallywith d • At small d, Sd(u,v) is cheap to compute; it only needs short random walks L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  12. Iterative Deepening Join (IDJ) • At iteration i, compute the bound of S(u,v), where d=2i • As d increases, the bound shrinks and converges to S(u,v) • Compute the bound more frequently at small depths • Higher pruning power • The boundis cheaper to compute • Conversely, spend less effort for large d

  13. IDJ Example: find the top-1 pair Iteration 1: d = 2. Compute S2: Perform 2 steps of random walks graph space Prune nodes using bounds L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  14. IDJ Example: find the top-1 pair Iteration 2: d = 4. Compute S4: Perform 4 steps of random walks graph space Prune nodes using bounds L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  15. IDJ Example: find the top-1 pair Compute S8: Perform 8 steps of random walks Iteration 3: d = 8. graph space Compute actual score S; Return top-1 pair L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  16. Remarks on IDJ • IDJ is inspired by the Iterative Deepening Depth-First Search • Search a small scope at early iterations for efficient pruning • Exponentially expand the search scope • Space efficient • only store the states of one random surfer at a time • Use a small heap to track the top-k candidate pairs • IDJ computes many Sd(u,v)’s, which is expensive when d is large. • Can we achieve better pruning for PPR and SR? L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  17. Customization for PPR • Personalized PageRank • Vi(p,q): prob. a random surfer from p visits q at the i-th step. L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  18. Customization for PPR • Upper-Bound for PPR • Vi(p,Q): prob. a random surfer from p visits any node in Q at the i-th step. • Vi(p,q) ≤ Vi(p,Q), since . • Replace Vi(p,q) with Vi(p,Q) and obtain an upper-bound of Sd(p,q). • How to obtain Vi(p, Q) efficiently? • Take nodes in Q as start points, and perform backward random walks L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  19. Example: Compute V2(p, Q) Normal (forward) random walk P Q 1/10 1/2 1/5 1/2 1/2 1/5 1/5 1/5 1/5 1/10 V2( , Q ) = 1/10 + 1/10 = 1/5

  20. Example: Compute V2(p, Q) Normal (forward) random walk P Q 1/5 1/5 1 1/5 1/5 1 1/5 1/5 1/5 V2( , Q ) = 1/10 + 1/10 = 1/5 V2( , Q ) = 1/5 + 1/5 = 2/5

  21. Example: Compute V2(p, Q) Normal (forward) random walk backward random walk P Q P Q 1/5 1/5 2/5 1/2 1/5 1 1/5 2/5 V2( , Q ) = 1/10 + 1/10 = 1/5 • Benefit • Compute V2(p, Q) for all p in P by • ONE ROUND of random walks • – O(|P|) improvement! V2( , Q ) = 1/5 + 1/5 = 2/5

  22. Customization for SR (Sketch) • SR is more difficult to handle than PPR • SR involves computing prob. that tworandom surfers first meet at the i-thiteration • Computing Pi(p,q) and Sd(u,v) can be very costly • Idea: prune node pairs without evaluating Pi. • Pr(“first meet”) ≤ Pr(“meet”) • Pr(“meet”) is much cheaper to derive • Further speed up by backward random walk L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  23. Experiments • Data set • Yeast: protein-protein interaction graph • Coauthor: graph extracted from DBLP • Cora: citation graph • Default value • k = 50 L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  24. PPR on Yeast L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  25. PPR on Coauthor L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  26. Performance Analysis • PPR on Coauthor L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  27. SR on Cora L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  28. Performance Analysis • SR on Cora SR in Cora L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  29. Conclusions • The LS-join is a similarity join for graph applications • The e-function captures random-walk LS measures • We develop two LS-join algorithms • IDJ for any e-function • Customized and faster algorithms for PPR and SR L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  30. Thank you!Reynold ChengUniversity of Hong Kongckcheng@cs.hku.hkhttp://www.cs.hku.hk/~ckcheng L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  31. Future Work • Examine other link-based similarity measures • Consider content- and link- similarity together • Develop indexes and distributed algorithms L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

  32. References • J. Sankaranarayanan et al. Distance join queries on spatial networks. In GIS, pages 211–218, 2006. • L. Zou et al. Distance-join: pattern match query in a large graph database. PVLDB, 2(1):886–897, 2009. • J. Dittrich et al. GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces. In KDD, pages 47–56, 2001. • A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918–929, 2006. • C. Boehm et al. Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data. In SIGMOD, pages 379–388, 2001. • C. Xiao et al. Efficient similarity joins for near duplicate detection. In WWW, pages 131–140, 2008. • G. Jeh and J. Widom. Scaling personalized web search. In WWW, pages 271–279, 2003. • D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrankcomputation. VLDBJ, 19:45–66, 2010. • P. Li et al. Fast single-pair simrank computation. In SDM, pages 571–582, 2010. • D. Fogaras and B. R´acz. Scaling link-based similarity search. In WWW, pages 641–650, 2005. • P. Sarkar and A. Moore. Fast nearest neighbor search in disk-resident graphs. In KDD, pp. 513–522, 2010. L. Sun, R. Cheng, X. Li, D. Cheung, J. Han

More Related