Colibri: Fast Mining of Large Static and Dynamic Graphs

Colibri: Fast Mining of Large Static and Dynamic Graphs Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug. 24-27, 2008, Las Vegas KDD 2008

Graphs are everywhere! • Q: How to find patterns? • e.g., community, anomaly, etc.

Motivation • Q: How to find patterns? • e.g., community, anomaly, etc. • A: Low-Rank Approximation (LRA) for Adjacency Matrix of the Graph. X X A L M R ~ ~

LRA for Graph Mining: Example Adj. matrix: A L M R John X X ICDM Tom KDD Conf. Cluster Bob Interaction Carl ISMB Au. clusters Van ~ ~ RECOMB Roy Author Conf. Recon. error is high  ‘Carl’ is abnormal

Challenges • How to get (L, M, R) + Efficiently (both time and space); + Intuitively (easy for interpretation); + Dynamically (track patterns over time)?

Roadmap Motivation Existing Methods SVD CUR/CX Proposed Methods: Colibri Experimental Results Conclusion

Matrix & Column Space 3 1 1 1 0 0 • Matrix • Column Space of a Matrix B = b1 , b2 are vectors in 3-d space! b1 b2 b2 b1

Projection, Projection Matrix & Core Matrix v + B BTB BT v X X X = Core Matrix ~ ~ v v Projection matrix of B Projection of v An arbitrary vector

Singular-Value-Decomposition (SVD) … …. …. …. …. a1 a2 a3 am u1 uk v1 x x … … … … … … … … vk V: right singular vectors … … … ~ … ~ A: n x m U: left singular vectors

SVD: How to • #1: Find the left matrix U, where • #2: Project A into the column space of U Projection Matrix of Column Space of U

SVD: drawbacks A U V • Efficiency • Time • Space (U, V) are dense • Interpretation • Dynamic: not easy = 1st singular vector 2nd singular vector

CUR (CX) decomposition …. …. …. …. x x … … … U R … … … • Sample Columns from A to form C • Project A onto the col. Space of C ~ … … … ~ C A: n x m

CUR (CX): advantages • Efficiency (better than SVD) • Time • (c is # of sampled col.s) • Space (C, R) are sparse • Interpretation

CUR (CX): drawbacks • Redundancy in C, wasting both time and space • Dynamic: not easy • 3 copies of green, • 2 copies of red, • 2 copies of purple • purple=0.5*green + red…

Roadmap Motivation Existing Methods Colibri Colibri-S for static graphs Colibri-D for dynamic graphs Experimental Results Conclusion

Colibri-S: Basic Idea Colibri-S CUR (CX) x Original Matrix x …. …. …. M R • 3 copies of green, • 2 copies of red, • 2 copies of purple • purple=0.5*green + red… L We want the Col.s in L are linearly independent with each other!

Input Output …. …. …. …. L = : Linearly Ind. Col.s -1 ? LT L M= = Core Matrix …. Initially Sampled matrix C Q: How to find L & M from C efficiently? R = LT x A = ….

A: Find L & M iteratively! Initial Sampled Matrix c …. … Current L & M Redundant ? discard v For each col. v in C Project it on L Expand L & M

Colibri-S vs. CUR(CX) • Quality: • Colibri-S = CUR(CX) • Time: • Colibri-S >= CUR(CX) • Space • Colibri-S >= CUR(CX) • Illustrations CUR (CX) Colibri-S

Colirbri-D for dynamic graphs Mt Rt t Lt Initially sampled matrix Mt+1 Rt+1 ? Lt+1 t+1 Q: How to update L and M efficiently?

Colibri-D: How-To Selected Redundant Selected Redundant Mt Rt t Lt Initially sampled matrix Mt+1 Rt+1 ? t+1 Lt+1 Changed from t

Colibri-D: How-To Mt Lt Selected Redundant Selected Redundant t ~ M Unchanged Cols! ~ Subspace by blue cols at t+1 L Initially sampled matrix t+1 Mt+1 Lt+1

Roadmap Motivation Existing Methods Colibri Experimental Results Conclusion

Experimental Setup • Datasets • Network traffic • 21,837 sources/destinations • 1,222 consecutive hours • 22,800 edges per hour • Accuracy: • Accu = • Space Cost:

Performance of Colibri-S CUR CUR • Accuracy • Same 91%+ • Time • 12x of CMD • 28x of CUR • Space • ~1/3 of CMD • ~10% of CUR CMD CMD Ours Ours Time Space

More Evaluation on Colibri-S Log Time (Sec) CUR CMD Colibri-S Approximation Accuracy

Performance of Colibri-D Time CMD Colibri-S Colibri-D # of changed cols Colibri-D achieves up to 112x speedups

A Family of Low-Rank Approximationfor Fast Graph Mining • Colibri-S • For static graphs • Remove redundancy • Significant saving in time & space by “free” • Colibri-D • For dynamic graphs • Explores “smoothness” • Up to 112x than best known methods

Poster tonight! Thank you! www.cs.cmu.edu/~htong

Colibri: Fast Mining of Large Static and Dynamic Graphs

Colibri: Fast Mining of Large Static and Dynamic Graphs

Presentation Transcript

Combining Static and Dynamic Analysis for Bug Finding

Fast Incremental Proximity Search in Large Graphs

Reconfigurable Computing Part II

Graph Mining - surprising patterns in real graphs

Large Graph Mining - Patterns, Tools and Cascade Analysis

Large Graph Mining – Patterns, Tools and Cascade analysis

Fast and Unified Local Search for Random Walk Based K-Nearest Neighbor Query in Large Graphs

Static and Dynamic Analysis

Analysis Modeling

Fast Minimum Spanning Tree For Large Graphs on the GPU

Always-available static and dynamic feedback: Unifying static and dynamic typing

Mining

Analysis Modeling

Solving Some Text Mining Problems with Conceptual Graphs

Lecture 5: Static ILP Basics

Tools and Algorithms for Querying and Mining Large Graphs

Chapter 5

Overview

Interleaving static and dynamic analyses to generate path tests for C functions

Dynamic routing versus static routing

Large Graph Mining - Patterns, Explanations and Cascade Analysis