290 likes | 311 Views
Colibri: Fast Mining of Large Static and Dynamic Graphs. Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong. Aug. 24-27, 2008, Las Vegas KDD 2008.
E N D
Colibri: Fast Mining of Large Static and Dynamic Graphs Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug. 24-27, 2008, Las Vegas KDD 2008
Graphs are everywhere! • Q: How to find patterns? • e.g., community, anomaly, etc.
Motivation • Q: How to find patterns? • e.g., community, anomaly, etc. • A: Low-Rank Approximation (LRA) for Adjacency Matrix of the Graph. X X A L M R ~ ~
LRA for Graph Mining: Example Adj. matrix: A L M R John X X ICDM Tom KDD Conf. Cluster Bob Interaction Carl ISMB Au. clusters Van ~ ~ RECOMB Roy Author Conf. Recon. error is high ‘Carl’ is abnormal
Challenges • How to get (L, M, R) + Efficiently (both time and space); + Intuitively (easy for interpretation); + Dynamically (track patterns over time)?
Roadmap Motivation Existing Methods SVD CUR/CX Proposed Methods: Colibri Experimental Results Conclusion
Matrix & Column Space 3 1 1 1 0 0 • Matrix • Column Space of a Matrix B = b1 , b2 are vectors in 3-d space! b1 b2 b2 b1
Projection, Projection Matrix & Core Matrix v + B BTB BT v X X X = Core Matrix ~ ~ v v Projection matrix of B Projection of v An arbitrary vector
Singular-Value-Decomposition (SVD) … …. …. …. …. a1 a2 a3 am u1 uk v1 x x … … … … … … … … vk V: right singular vectors … … … ~ … ~ A: n x m U: left singular vectors
SVD: How to • #1: Find the left matrix U, where • #2: Project A into the column space of U Projection Matrix of Column Space of U
SVD: drawbacks A U V • Efficiency • Time • Space (U, V) are dense • Interpretation • Dynamic: not easy = 1st singular vector 2nd singular vector
CUR (CX) decomposition …. …. …. …. x x … … … U R … … … • Sample Columns from A to form C • Project A onto the col. Space of C ~ … … … ~ C A: n x m
CUR (CX): advantages • Efficiency (better than SVD) • Time • (c is # of sampled col.s) • Space (C, R) are sparse • Interpretation
CUR (CX): drawbacks • Redundancy in C, wasting both time and space • Dynamic: not easy • 3 copies of green, • 2 copies of red, • 2 copies of purple • purple=0.5*green + red…
Roadmap Motivation Existing Methods Colibri Colibri-S for static graphs Colibri-D for dynamic graphs Experimental Results Conclusion
Colibri-S: Basic Idea Colibri-S CUR (CX) x Original Matrix x …. …. …. M R • 3 copies of green, • 2 copies of red, • 2 copies of purple • purple=0.5*green + red… L We want the Col.s in L are linearly independent with each other!
Input Output …. …. …. …. L = : Linearly Ind. Col.s -1 ? LT L M= = Core Matrix …. Initially Sampled matrix C Q: How to find L & M from C efficiently? R = LT x A = ….
A: Find L & M iteratively! Initial Sampled Matrix c …. … Current L & M Redundant ? discard v For each col. v in C Project it on L Expand L & M
Colibri-S vs. CUR(CX) • Quality: • Colibri-S = CUR(CX) • Time: • Colibri-S >= CUR(CX) • Space • Colibri-S >= CUR(CX) • Illustrations CUR (CX) Colibri-S
Colirbri-D for dynamic graphs Mt Rt t Lt Initially sampled matrix Mt+1 Rt+1 ? Lt+1 t+1 Q: How to update L and M efficiently?
Colibri-D: How-To Selected Redundant Selected Redundant Mt Rt t Lt Initially sampled matrix Mt+1 Rt+1 ? t+1 Lt+1 Changed from t
Colibri-D: How-To Mt Lt Selected Redundant Selected Redundant t ~ M Unchanged Cols! ~ Subspace by blue cols at t+1 L Initially sampled matrix t+1 Mt+1 Lt+1
Roadmap Motivation Existing Methods Colibri Experimental Results Conclusion
Experimental Setup • Datasets • Network traffic • 21,837 sources/destinations • 1,222 consecutive hours • 22,800 edges per hour • Accuracy: • Accu = • Space Cost:
Performance of Colibri-S CUR CUR • Accuracy • Same 91%+ • Time • 12x of CMD • 28x of CUR • Space • ~1/3 of CMD • ~10% of CUR CMD CMD Ours Ours Time Space
More Evaluation on Colibri-S Log Time (Sec) CUR CMD Colibri-S Approximation Accuracy
Performance of Colibri-D Time CMD Colibri-S Colibri-D # of changed cols Colibri-D achieves up to 112x speedups
A Family of Low-Rank Approximationfor Fast Graph Mining • Colibri-S • For static graphs • Remove redundancy • Significant saving in time & space by “free” • Colibri-D • For dynamic graphs • Explores “smoothness” • Up to 112x than best known methods
Poster tonight! Thank you! www.cs.cmu.edu/~htong