760 likes | 931 Views
Large Graph Mining: Power Tools and a Practitioner’s guide. Task 3: Recommendations & proximity Faloutsos, Miller & Tsourakakis CMU. Outline. Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs
E N D
Large Graph Mining:Power Tools and a Practitioner’s guide Task 3: Recommendations & proximity Faloutsos, Miller & Tsourakakis CMU Faloutsos, Miller, Tsourakakis
Outline Introduction – Motivation Task 1: Node importance Task 2: Community detection Task 3: Recommendations Task 4: Connection sub-graphs Task 5: Mining graphs over time Task 6: Virus/influence propagation Task 7: Spectral graph theory Task 8: Tera/peta graph mining: hadoop Observations – patterns of real graphs Conclusions Faloutsos, Miller, Tsourakakis
Acknowledgement: Most of the foils in ‘Task 3’ are by Hanghang TONG www.cs.cmu.edu/~htong Faloutsos, Miller, Tsourakakis
Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis
Motivation: Link Prediction ? A B i i i i Should we introduce Mr. A to Mr. B? Faloutsos, Miller, Tsourakakis
Motivation - recommendations Products / movies customers ‘smith’ ?? Terminator 2 Faloutsos, Miller, Tsourakakis
Answer: proximity • ‘yes’, if ‘A’ and ‘B’ are ‘close’ • ‘yes’, if ‘smith’ and ‘terminator 2’ are ‘close’ QUESTIONS in this part: • How to measure ‘closeness’/proximity? • How to do it quickly? • What else can we do, given proximity scores? Faloutsos, Miller, Tsourakakis
How close is ‘A’ to ‘B’? a.k.a Relevance, Closeness, ‘Similarity’… Faloutsos, Miller, Tsourakakis
Why is it useful? Recommendation And many more Image captioning [Pan+] Conn. / CenterPiece subgraphs [Faloutsos+], [Tong+], [Koren+] and Link prediction [Liben-Nowell+], [Tong+] Ranking [Haveliwala], [Chakrabarti+] Email Management [Minkov+] Neighborhood Formulation [Sun+] Pattern matching [Tong+] Collaborative Filtering [Fouss+] … Faloutsos, Miller, Tsourakakis
Region Automatic Image Captioning Image Test Image Keyword Sea Sun Sky Wave Cat Forest Tiger Grass Q: How to assign keywords to the test image? A: Proximity! [Pan+ 2004] Faloutsos, Miller, Tsourakakis
Center-Piece Subgraph(CePS) Input Output CePS guy CePS Original Graph Q: How to find hub for the black nodes? A: Proximity! [Tong+ KDD 2006] Faloutsos, Miller, Tsourakakis
Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis
How close is ‘A’ to ‘B’? • Should be close, if they have • many, • short • ‘heavy’ paths Faloutsos, Miller, Tsourakakis
Why not shortest path? Some ``bad’’ proximities A: ‘pizza delivery guy’ problem Faloutsos, Miller, Tsourakakis
Why not max. netflow? Some ``bad’’ proximities A: No penalty for long paths Faloutsos, Miller, Tsourakakis
What is a ``good’’ Proximity? … • Multiple Connections • Quality of connection • Direct & In-directed Conns • Length, Degree, Weight… Faloutsos, Miller, Tsourakakis
Random walk with restart 10 9 12 2 8 1 11 3 4 6 5 7 [Haveliwala’02] Faloutsos, Miller, Tsourakakis
Random walk with restart 0.03 0.04 10 9 0.10 12 2 0.08 0.02 0.13 8 1 0.13 11 3 0.04 4 0.05 6 5 0.13 7 0.05 Nearby nodes, higher scores Ranking vector More red, more relevant Faloutsos, Miller, Tsourakakis
Why RWR is a good score? j i : adjacency matrix. c: damping factor all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3 Faloutsos, Miller, Tsourakakis
Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • variants • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis
Variant: escape probability Define Random Walk (RW) on the graph Esc_Prob(CMUParis) Prob (starting at CMU, reaches Paris before returning toCMU) the remaining graph CMU Paris Faloutsos, Miller, Tsourakakis Esc_Prob = Pr (smile before cry)
Other Variants Other measure by RWs Community Time/Hitting Time [Fouss+] SimRank [Jeh+] Equivalence of Random Walks Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] Spring Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] details Faloutsos, Miller, Tsourakakis
Other Variants Other measure by RWs Community Time/Hitting Time [Fouss+] SimRank [Jeh+] Equivalence of Random Walks Electric Networks: EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] Spring Systems Katz [Katz], [Huang+], [Scholkopf+] Matrix-Forest-based Alg [Chobotarev+] details All are “related to” or “similar to” random walk with restart! Faloutsos, Miller, Tsourakakis
Map of proximity measurements details Regularized Un-constrained Quad Opt. Norma lize RWR Katz 4 ssp decides 1 esc_prob Esc_Prob + Sink Hitting Time/ Commute Time relax X out-degree Harmonic Func. Constrained Quad Opt. Effective Conductance “voltage = position” String System Faloutsos, Miller, Tsourakakis Physical Models Mathematic Tools
Notice: Asymmetry (even in undirected graphs) C B C-> A : high A-> C: low A D E Faloutsos, Miller, Tsourakakis
Summary of Proximity Definitions Goal: Summarize multiple relationships Solutions Basic: Random Walk with Restarts [Haweliwala’02] [Pan+ 2004][Sun+ 2006][Tong+ 2006] Properties: Asymmetry [Koren+ 2006][Tong+ 2007] [Tong+ 2008] Variants: Esc_Prob and many others. [Faloutsos+ 2004] [Koren+ 2006][Tong+ 2007] Faloutsos, Miller, Tsourakakis
Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis
Preliminary: Sherman–Morrison Lemma details If: x = + = +3 + Faloutsos, Miller, Tsourakakis
Preliminary: Sherman–Morrison Lemma details If: x = + Then: Faloutsos, Miller, Tsourakakis
Sherman – Morrison Lemma – intuition: • Given a small perturbation on a matrix A -> A’ • We can quickly update its inverse Faloutsos, Miller, Tsourakakis
SM: The block-form details A B C D Or… And many other variants… Also known as Woodbury Identity Faloutsos, Miller, Tsourakakis
SM Lemma: Applications details • RLS (Recursive least squares) • and almost any algorithm in time series! • Kalman filtering • Incremental matrix decomposition • … • … and all the fast solutions we will introduce! Faloutsos, Miller, Tsourakakis
Reminder: PageRank • With probability 1-c, fly-out to a random node • Then, we have p = c Bp + (1-c)/n 1 => p = (1-c)/n [I - c B] -1 1 Faloutsos, Miller, Tsourakakis
p = c Bp + (1-c)/n 1 The only difference Restart p Starting vector Adjacency matrix Ranking vector Faloutsos, Miller, Tsourakakis
Computing RWR 10 9 12 2 8 1 11 3 4 6 5 7 p = c Bp + (1-c)/n 1 Restart p Starting vector Adjacency matrix Ranking vector 1 n x 1 n x n n x 1 Faloutsos, Miller, Tsourakakis
Q: Given query i, how to solve it? Query ? ? Starting vector Ranking vector Ranking vector Adjacency matrix Faloutsos, Miller, Tsourakakis
OntheFly: 10 9 12 2 8 1 11 3 0.04 0.03 10 9 0.10 12 4 0.13 0.08 2 0.02 8 1 11 0.13 3 6 0.04 5 4 0.05 6 5 0.13 7 7 0.05 No pre-computation/ light storage Slow on-line response O(mE) Faloutsos, Miller, Tsourakakis
PreCompute 0.04 0.03 10 9 0.10 12 0.13 0.08 2 0.02 8 1 11 0.13 3 0.04 4 0.05 6 5 0.13 7 0.05 10 9 12 2 8 1 11 R: 3 4 6 5 7 c x Q Q Faloutsos, Miller, Tsourakakis
PreCompute: 10 9 12 2 8 1 11 3 0.04 0.03 10 9 0.10 12 4 0.13 0.08 2 0.02 8 1 11 0.13 3 6 0.04 5 4 0.05 6 5 0.13 7 7 0.05 Fast on-line response Heavy pre-computation/storage cost O(n ) 3 Faloutsos, Miller, Tsourakakis O(n ) 2
Q: How to Balance? On-line Off-line Faloutsos, Miller, Tsourakakis
How to balance? Idea (‘B-Lin’) • Break into communities • Pre-compute all, within a community • Adjust (with S.M.) for ‘bridge edges’ H. Tong, C. Faloutsos, & J.Y. Pan. Fast Random Walk with Restart and Its Applications. ICDM, 613-622, 2006. Faloutsos, Miller, Tsourakakis
B_Lin: Basic Idea[Tong+ ICDM 2006] 10 10 9 9 12 12 2 2 8 8 1 1 11 11 3 3 4 4 10 10 9 9 6 6 5 5 12 12 2 2 8 8 1 1 11 11 7 7 3 3 4 4 6 6 5 5 7 7 0.04 10 10 0.03 9 9 10 9 12 12 0.10 2 2 12 0.13 0.08 2 8 8 1 1 0.02 11 11 8 3 3 1 11 0.13 3 0.04 4 4 4 6 6 5 5 0.05 6 5 0.13 7 7 7 0.05 Find Community Combine Fix the remaining Faloutsos, Miller, Tsourakakis
B_Lin: details details ~ + ~ ~ ~ W 1: within community W Cross-community Faloutsos, Miller, Tsourakakis
B_Lin: details details ~ ~ W W1 -1 Easy to be inverted LRA difference -1 ~ ~ I – c I – c –cUSV SM Lemma! Faloutsos, Miller, Tsourakakis
B_Lin: summary • Pre-Computational Stage • Q: • A: A few small, instead of ONE BIG, matrices inversions • On-Line Stage • Q: Efficiently recover one column of Q • A: A few, instead of MANY, matrix-vector multiplications Efficiently compute and store Q Faloutsos, Miller, Tsourakakis
Query Time vs. Pre-Compute Time Log Query Time • Quality: 90%+ • On-line: • Up to 150x speedup • Pre-computation: • Two orders saving Log Pre-compute Time Faloutsos, Miller, Tsourakakis
Detailed outline • Problem dfn and motivation • Solution: Random walk with restarts • Efficient computation • Case study: image auto-captioning • Extensions: bi-partite graphs; tracking • Conclusions Faloutsos, Miller, Tsourakakis
gCaP: Automatic Image Caption Q { } Cat Forest Grass Tiger {?, ?, ?,} … { } Sea Sun Sky Wave A: Proximity! [Pan+ KDD2004] Faloutsos, Miller, Tsourakakis
Sea Sun Sky Wave Cat Forest Tiger Grass Region Image Test Image Keyword Faloutsos, Miller, Tsourakakis
Region Image Test Image {Grass, Forest, Cat, Tiger} Sea Sun Sky Wave Cat Forest Tiger Grass Keyword Faloutsos, Miller, Tsourakakis