Fast Random Walk with Restart and Its Applications

Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec. 18-22, HongKong

Motivating Questions • Q: How to measure the relevance? • A: Random walk with restart • Q: How to do it efficiently? • A: This talk tries to answer!

10 9 12 2 8 1 11 3 4 6 5 7 Random walk with restart

0.03 0.04 10 9 0.10 12 2 0.08 0.02 0.13 8 1 0.13 11 3 0.04 4 0.05 6 5 0.13 7 0.05 Random walk with restart Nearby nodes, higher scores Ranking vector More red, more relevant

{ } Cat Forest Grass Tiger {?, ?, ?,} Automatic Image Caption • Q … { } Sea Sun Sky Wave ? A: RWR! [Pan KDD2004]

Sea Sun Sky Wave Cat Forest Tiger Grass Region Image Test Image Keyword

Region Image Test Image {Grass, Forest, Cat, Tiger} Sea Sun Sky Wave Cat Forest Tiger Grass Keyword

Neighborhood Formulation … … Q: what is most related conference to ICDM A: RWR! [Sun ICDM2005] … … Conference Author

NF: example

Center-Piece Subgraph(CePS) Q ? Original Graph Black: query nodes CePS A: RWR! [Tong KDD 2006]

CePS: Example

Other Applications • Content-based Image Retrieval [He] • Personalized PageRank [Jeh], [Widom], [Haveliwala] • Anomaly Detection (for node; link) [Sun] • Link Prediction [Getoor], [Jensen] • Semi-supervised Learning [Zhu], [Zhou] • …

Roadmap • Background • RWR: Definitions • RWR: Algorithms • Basic Idea • FastRWR • Pre-Compute Stage • On-Line Stage • Experimental Results • Conclusion

10 9 12 2 8 1 11 3 4 6 5 7 Computing RWR Restart p Starting vector Adjacent matrix Ranking vector 1 n x 1 n x n n x 1

: Maxwell Equation for Web! Beyond RWR [Chakrabarti] P-PageRank [Haveliwala] SM Learning [Zhou, Zhu] RL in CBIR [He] RWR [Pan, Sun] PageRank [Haveliwala] Fast RWR Finds the Root Solution !

Q: Given query i, how to solve it? ? ?

10 9 12 2 8 1 11 3 0.04 0.03 10 9 0.10 12 4 0.13 0.08 2 0.02 8 1 11 0.13 3 6 0.04 5 4 0.05 6 5 0.13 7 7 0.05 OntheFly: No pre-computation/ light storage Slow on-line response O(mE)

0.04 0.03 10 9 0.10 12 0.13 0.08 2 0.02 8 1 11 0.13 3 0.04 4 0.05 6 5 0.13 7 0.05 PreCompute 10 9 12 2 8 1 11 R: 3 4 6 5 7 [Haveliwala]

10 9 12 2 8 1 11 3 0.04 0.03 10 9 0.10 12 4 0.13 0.08 2 0.02 8 1 11 0.13 3 6 0.04 5 4 0.05 6 5 0.13 7 7 0.05 PreCompute: Fast on-line response Heavy pre-computation/storage cost O(n ) 3 O(n ) 2

Q: How to Balance? On-line Off-line

10 10 9 9 12 12 2 2 8 8 1 1 11 11 3 3 4 4 10 10 9 9 6 6 5 5 12 12 2 2 8 8 1 1 11 11 7 7 3 3 4 4 6 6 5 5 7 7 0.04 10 10 0.03 9 9 10 9 12 12 0.10 2 2 12 0.13 0.08 2 8 8 1 1 0.02 11 11 8 3 3 1 11 0.13 3 0.04 4 4 4 6 6 5 5 0.05 6 5 0.13 7 7 7 0.05 Basic Idea Find Community Combine Fix the remaining

Pre-computational stage -1 • Q: • A: A few small, instead of ONE BIG, matrices inversions Efficiently compute and store Q

On-Line Query Stage -1 • Q: Efficiently recover one column of Q • A: A few, instead of MANY, matrix-vector multiplication +

Pre-compute Stage • p1: B_Lin Decomposition • P1.1 partition • P1.2 low-rank approximation • p2: Q matrices • P2.1 computing (for each partition) • P2.2 computing (for concept space)

10 9 12 2 8 1 11 3 4 6 5 7 P1.1: partition 10 9 12 2 8 1 11 3 4 6 5 7 Within-partition links cross-partition links

10 9 12 2 8 1 11 3 4 6 5 7 P1.1: block-diagonal 10 9 12 2 8 1 11 3 4 6 5 7

10 9 12 2 8 1 11 3 4 6 5 7 P1.2: LRA for 10 9 12 2 8 1 11 3 4 6 5 7 ~ |S| << |W2|

p2.1 Computing

Q 1,1 1,2 Q Q 1,k Comparing and • Computing Time • 100,000 nodes; 100 partitions • Computing 100,00x is Faster! • Storage Cost • 100x saving! =

~ • Q: How to fix the green portions? ~ + ~ ? +

Q 1,1 1,2 Q 10 9 Q 1,k 12 2 8 1 11 3 4 6 5 7 p2.2 Computing: -1 _ U = V

We have: Communities Bridges SM Lemma says:

On-Line Stage • Q ? + Query Result Pre-Computation • A (SM lemma)

q1: q2: q3: q4: q5: q6: On-Line Query Stage

Experimental Setup • Dataset • DBLP/authorship • Author-Paper • 315k nodes • 1,800k edges • Approx. Quality: Relative Accuracy • Application: Center-Piece Subgraph

Query Time vs. Pre-Compute Time Log Query Time • Quality: 90%+ • On-line: • Up to 150x speedup • Pre-computation: • Two orders saving Log Pre-compute Time

Query Time vs. Pre-Storage Log Query Time • Quality: 90%+ • On-line: • Up to 150x speedup • Pre-storage: • Three orders saving Log Storage

Conclusion • FastRWR • Reasonable quality preservation (90%+) • 150x speed-up: query time • Orders of magnitude saving: pre-compute & storage • More in the paper • The variant of FastRWR and theoretic justification • Implementation details • normalization, low-rank approximation, sparse • More experiments • Other datasets, other applications

Q&A Thank you! htong@cs.cmu.edu www.cs.cmu.edu/~htong

Fast Random Walk with Restart and Its Applications

Fast Random Walk with Restart and Its Applications

Presentation Transcript

Random Walk on Graphs and its Algorithmic Applications

The Random Walk and Diffusion

Random walk

Random Sampling Algorithms with Applications

Simple Random Walk

Random Walk with Restart (RWR) for Image Segmentation

Parallelizing Random Walk with Restart for Large-Scale Query Recommendation

Random Sets Approach and its Applications

VICIOUS WALK and RANDOM MATRICES

Conditional Random Fields and Its Applications

The Random Neural Network and some of its applications

Applications with Random File Access

Quantum random flip-flop and its applications

Further Random Walk Tests

Fast Random Walk with Restart and Its Applications

Random Walk Simulation

H 0 : Random Walk

Basics of Random Matrix Theory and Its Applications - Edukite

The Random Neural Network and some of its applications

Random Walk Model