380 likes | 707 Views
GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs. Jimeng Sun CMU Spiros Papadimitriou IBM Philip S. Yu IBM Christos Faloutsos CMU. Motivation of GraphScope. Time-evolving graphs Network traffic graphs Email networks Customer product relationships
E N D
GraphScope: Parameter-Free Mining of Large Time-Evolving Graphs Jimeng Sun CMU Spiros Papadimitriou IBM Philip S. Yu IBM Christos Faloutsos CMU
Motivation of GraphScope • Time-evolving graphs • Network traffic graphs • Email networks • Customer product relationships • Call detail records in telecom networks • Financial transaction data • Key questions: • How to monitor community structures? • How to detect the change points?
Simultaneously group: customers and products, or, source-destination traffic graphs, or, sender-recipient communication, etc… 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 25 5 10 15 20 1. Community discovery Adjacency matrix Graph Customers 54% Customers Products Products Customers e.g., Researchers CEOs 97% 3% 289 /300 48/50 Customer groups 5/200 2/75 3% 96% Books BMWs Products Product groups
Find change points in group structure 2. Change detection Products Customers Products time holiday season
Scalable, Parameter-free, Incremental Problem definition • Given graphs G1, G2, … Gtwhere Gi is n-by-m • partition them into time segments G(1), G(2), … • for each segment, identify the groups G(2) G(1) time
Outline • Motivation • GraphScope • Community discovery • Change detection • Experiments
Community detectionClustering problem Compression problem t = 1 t = 2 t = 0
n1 n2 n3 m3 m2 m1 d Cost objective within a time segment p1,3 density of ones (edges) p1,2 k = 3 row groups p1,1 i,j d nimjH(pi,j) p2,3 d n1m2H(p1,2) bits for (1,2) bits total p2,2 code cost + p2,1 p3,3 p3,2 description cost p3,1 ℓ = 3 col. groups i,j + log*d + log dnimj segment duration
code cost (blocks) + description cost (blocks’ model) Cost objective within a time segment n row groups m col groups one row group one col group high low low high
code cost (blocks) + description cost (blocks’ model) Cost objectivewithin a time segment k = 3 row groups ℓ = 3 col groups low low
Search for the optimum grouping • Problem is NP-hard even for one timestamp on column permutation only • Reduction from TSP problem [Johnson+ 03] • Heuristics • Search: Split, Merge, Shuffle • Initialization: Resume, Restart
Outline • Motivation • GraphScope • Community discovery • Change detection • Experiments
Change point detection Option 1: Append to current segment
Change point detection Option 2: Start new segment change point
Change point detection 1: append Choose the most parsimonious option 2: split (time) In both cases, we do row & col. shuffles, splits and/or merges
Outline • Motivation • GraphScope • Single timestamp • Multiple timestamp • Experiments
Objectives • Effectiveness on • Community discovery • Change detection • Compression benefit • Scalable, incremental computation
Evolving communitiesNETWORK 29K hosts (nodes) 12K edges (on avg) 1,220 hours ~ 14.6M edges total time
Community change pointsENRON 34K email addresses 12K emails (on avg) 165 weeks ~ 2M emails total Key change-points correspond to key events
Compression gain Graphscope GraphScope gives 10%-150% compression gain
Graph stream clusteringScalability—NETWORK • 29K hosts (nodes) • 12K edges per hour (on average) • 1,220 hours (timestamps) • ~ 14.6M edges total < 2 sec / snapshot on avg
Related work • Co-clustering • [Dhillon+ KDD03] • [Chakrabarti+ KDD04] • Graph partitioning • [Karypis+ 99] • Time-evolving graphs • [Chakrabarti+ KDD06] • [Chi+ KDD07] • [Asur+ KDD07]
Summary • Organize into few, homogeneous communities • Find changes in community structure • Scalable • Parameter-free • Incremental
GraphScope: Parameter-Free Mining of Large Time-Evolving Graphs Jimeng Sun Spiros Papadimitriou Philip S. Yu Christos Faloutsos
Graph stream clustering t = 1 t = 2 t = 0
Graph clustering – [Chakrabarti+ KDD’04] Why is this better? Row groups Row groups versus Column groups Column groups Good Clustering • Similar nodes are grouped together • As few groups as necessary A few,homogeneous blocks Good Compression implies
Graph clustering – [Chakrabarti+ KDD’04] Why is this better? Row groups Row groups versus Column groups Column groups Good Clustering • Similar nodes are grouped together • As few groups as necessary A few,homogeneous blocks Good Compression Good Clustering Good Compression implies implies
m1 m2 m3 n1 n2 n3 i j row-partitioni description col-partitionj description Cost objective ℓ = 3 col. groups density of ones (edges) p1,1 p1,2 p1,3 block size entropy i,j nimjH(pi,j) n1m2H(p1,2) bits for (1,2) bits total k = 3 row groups p2,3 p2,1 p2,2 code cost + description cost Assumes group paritionings, sizes and densities are given p3,1 p3,2 p3,3 + i,j transmit #edges ei,j + log nimj n£m adj. matrix
Graph clusteringScalability Time vs. Size Splits Shuffles Time (sec) Number of edges Linear on the number of edges Scalable
code cost (blocks) + description cost (blocks’ model) Cost objective n row groups m col groups one row group one col group high low low high
code cost (blocks) + description cost (blocks’ model) Cost objective k = 3 row groups ℓ = 3 col groups low low
Search for optimum Cost vs. number of groups one row group one col group bit cost n row groups m col groups k k = 3 row groups ℓ = 3 col groups ℓ
Search for optimumSummary k = 1, ℓ = 1 k = 5, ℓ = 5 k = 5, ℓ = 5 split shuffle shuffle split k=1, ℓ=2 k=2, ℓ=2 k=2, ℓ=3 k=3, ℓ=3 k=3, ℓ=4 k=4, ℓ=4 k=4, ℓ=5 Merge: Decrease k or ℓ Split: Increase k or ℓ Shuffle: Rearrange rows and cols
Graph clustering – [Chakrabarti+ KDD’04] • Given a graph of interactions or associations • Customers to products • Documents to terms • People to people • Computer communications • Financial transactions • Find simultaneously • Communities (source and destination) • Their number