1 / 34

GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs

GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs. Jimeng Sun CMU Spiros Papadimitriou IBM Philip S. Yu IBM Christos Faloutsos CMU. Motivation of GraphScope. Time-evolving graphs Network traffic graphs Email networks Customer product relationships

price-bates
Download Presentation

GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GraphScope: Parameter-Free Mining of Large Time-Evolving Graphs Jimeng Sun CMU Spiros Papadimitriou IBM Philip S. Yu IBM Christos Faloutsos CMU

  2. Motivation of GraphScope • Time-evolving graphs • Network traffic graphs • Email networks • Customer product relationships • Call detail records in telecom networks • Financial transaction data • Key questions: • How to monitor community structures? • How to detect the change points?

  3. Simultaneously group: customers and products, or, source-destination traffic graphs, or, sender-recipient communication, etc… 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 25 5 10 15 20 1. Community discovery Adjacency matrix Graph Customers 54% Customers Products Products Customers e.g., Researchers CEOs 97% 3% 289 /300 48/50 Customer groups 5/200 2/75 3% 96% Books BMWs Products Product groups

  4. Find change points in group structure 2. Change detection Products Customers Products time holiday season

  5. Scalable, Parameter-free, Incremental Problem definition • Given graphs G1, G2, … Gtwhere Gi is n-by-m • partition them into time segments G(1), G(2), … • for each segment, identify the groups G(2) G(1) time

  6. Outline • Motivation • GraphScope • Community discovery • Change detection • Experiments

  7. Community detectionClustering problem  Compression problem t = 1 t = 2 t = 0

  8. n1 n2 n3 m3 m2 m1 d Cost objective within a time segment p1,3 density of ones (edges) p1,2 k = 3 row groups p1,1 i,j d nimjH(pi,j) p2,3 d n1m2H(p1,2) bits for (1,2) bits total p2,2 code cost + p2,1 p3,3 p3,2 description cost p3,1 ℓ = 3 col. groups i,j + log*d + log dnimj segment duration

  9. code cost (blocks) + description cost (blocks’ model) Cost objective within a time segment   n row groups m col groups one row group one col group high low low high

  10. code cost (blocks) + description cost (blocks’ model) Cost objectivewithin a time segment  k = 3 row groups ℓ = 3 col groups low low

  11. Search for the optimum grouping • Problem is NP-hard even for one timestamp on column permutation only • Reduction from TSP problem [Johnson+ 03] • Heuristics • Search: Split, Merge, Shuffle • Initialization: Resume, Restart

  12. Outline • Motivation • GraphScope • Community discovery • Change detection • Experiments

  13. Change point detection Option 1: Append to current segment

  14. Change point detection Option 2: Start new segment change point

  15. Change point detection 1: append Choose the most parsimonious option 2: split (time) In both cases, we do row & col. shuffles, splits and/or merges

  16. Outline • Motivation • GraphScope • Single timestamp • Multiple timestamp • Experiments

  17. Objectives • Effectiveness on • Community discovery • Change detection • Compression benefit • Scalable, incremental computation

  18. Evolving communitiesNETWORK 29K hosts (nodes) 12K edges (on avg) 1,220 hours ~ 14.6M edges total time

  19. Community change pointsENRON 34K email addresses 12K emails (on avg) 165 weeks ~ 2M emails total Key change-points correspond to key events

  20. Compression gain Graphscope GraphScope gives 10%-150% compression gain

  21. Graph stream clusteringScalability—NETWORK • 29K hosts (nodes) • 12K edges per hour (on average) • 1,220 hours (timestamps) • ~ 14.6M edges total < 2 sec / snapshot on avg

  22. Related work • Co-clustering • [Dhillon+ KDD03] • [Chakrabarti+ KDD04] • Graph partitioning • [Karypis+ 99] • Time-evolving graphs • [Chakrabarti+ KDD06] • [Chi+ KDD07] • [Asur+ KDD07]

  23. Summary • Organize into few, homogeneous communities • Find changes in community structure • Scalable • Parameter-free • Incremental

  24. GraphScope: Parameter-Free Mining of Large Time-Evolving Graphs Jimeng Sun Spiros Papadimitriou Philip S. Yu Christos Faloutsos

  25. Graph stream clustering t = 1 t = 2 t = 0

  26. Graph clustering – [Chakrabarti+ KDD’04] Why is this better? Row groups Row groups versus Column groups Column groups Good Clustering • Similar nodes are grouped together • As few groups as necessary A few,homogeneous blocks Good Compression implies

  27. Graph clustering – [Chakrabarti+ KDD’04] Why is this better? Row groups Row groups versus Column groups Column groups Good Clustering • Similar nodes are grouped together • As few groups as necessary A few,homogeneous blocks Good Compression Good Clustering Good Compression implies implies

  28. m1 m2 m3 n1 n2 n3 i j row-partitioni description col-partitionj description Cost objective ℓ = 3 col. groups density of ones (edges) p1,1 p1,2 p1,3 block size entropy i,j nimjH(pi,j) n1m2H(p1,2) bits for (1,2) bits total k = 3 row groups p2,3 p2,1 p2,2 code cost + description cost Assumes group paritionings, sizes and densities are given p3,1 p3,2 p3,3 + i,j transmit #edges ei,j + log nimj n£m adj. matrix

  29. Graph clusteringScalability Time vs. Size Splits Shuffles Time (sec) Number of edges Linear on the number of edges  Scalable

  30. code cost (blocks) + description cost (blocks’ model) Cost objective   n row groups m col groups one row group one col group high low low high

  31. code cost (blocks) + description cost (blocks’ model) Cost objective  k = 3 row groups ℓ = 3 col groups low low

  32. Search for optimum Cost vs. number of groups one row group one col group bit cost n row groups m col groups k k = 3 row groups ℓ = 3 col groups ℓ

  33. Search for optimumSummary k = 1, ℓ = 1 k = 5, ℓ = 5 k = 5, ℓ = 5 split shuffle shuffle split k=1, ℓ=2 k=2, ℓ=2 k=2, ℓ=3 k=3, ℓ=3 k=3, ℓ=4 k=4, ℓ=4 k=4, ℓ=5 Merge: Decrease k or ℓ Split: Increase k or ℓ Shuffle: Rearrange rows and cols

  34. Graph clustering – [Chakrabarti+ KDD’04] • Given a graph of interactions or associations • Customers to products • Documents to terms • People to people • Computer communications • Financial transactions • Find simultaneously • Communities (source and destination) • Their number

More Related