GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs

GraphScope: Parameter-Free Mining of Large Time-Evolving Graphs Jimeng Sun CMU Spiros Papadimitriou IBM Philip S. Yu IBM Christos Faloutsos CMU

Motivation of GraphScope • Time-evolving graphs • Network traffic graphs • Email networks • Customer product relationships • Call detail records in telecom networks • Financial transaction data • Key questions: • How to monitor community structures? • How to detect the change points?

Simultaneously group: customers and products, or, source-destination traffic graphs, or, sender-recipient communication, etc… 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 25 5 10 15 20 1. Community discovery Adjacency matrix Graph Customers 54% Customers Products Products Customers e.g., Researchers CEOs 97% 3% 289 /300 48/50 Customer groups 5/200 2/75 3% 96% Books BMWs Products Product groups

Find change points in group structure 2. Change detection Products Customers Products time holiday season

Scalable, Parameter-free, Incremental Problem definition • Given graphs G1, G2, … Gtwhere Gi is n-by-m • partition them into time segments G(1), G(2), … • for each segment, identify the groups G(2) G(1) time

Outline • Motivation • GraphScope • Community discovery • Change detection • Experiments

Community detectionClustering problem  Compression problem t = 1 t = 2 t = 0

n1 n2 n3 m3 m2 m1 d Cost objective within a time segment p1,3 density of ones (edges) p1,2 k = 3 row groups p1,1 i,j d nimjH(pi,j) p2,3 d n1m2H(p1,2) bits for (1,2) bits total p2,2 code cost + p2,1 p3,3 p3,2 description cost p3,1 ℓ = 3 col. groups i,j + log*d + log dnimj segment duration

code cost (blocks) + description cost (blocks’ model) Cost objective within a time segment   n row groups m col groups one row group one col group high low low high

code cost (blocks) + description cost (blocks’ model) Cost objectivewithin a time segment  k = 3 row groups ℓ = 3 col groups low low

Search for the optimum grouping • Problem is NP-hard even for one timestamp on column permutation only • Reduction from TSP problem [Johnson+ 03] • Heuristics • Search: Split, Merge, Shuffle • Initialization: Resume, Restart

Outline • Motivation • GraphScope • Community discovery • Change detection • Experiments

Change point detection Option 1: Append to current segment

Change point detection Option 2: Start new segment change point

Change point detection 1: append Choose the most parsimonious option 2: split (time) In both cases, we do row & col. shuffles, splits and/or merges

Outline • Motivation • GraphScope • Single timestamp • Multiple timestamp • Experiments

Objectives • Effectiveness on • Community discovery • Change detection • Compression benefit • Scalable, incremental computation

Evolving communitiesNETWORK 29K hosts (nodes) 12K edges (on avg) 1,220 hours ~ 14.6M edges total time

Community change pointsENRON 34K email addresses 12K emails (on avg) 165 weeks ~ 2M emails total Key change-points correspond to key events

Compression gain Graphscope GraphScope gives 10%-150% compression gain

Graph stream clusteringScalability—NETWORK • 29K hosts (nodes) • 12K edges per hour (on average) • 1,220 hours (timestamps) • ~ 14.6M edges total < 2 sec / snapshot on avg

Related work • Co-clustering • [Dhillon+ KDD03] • [Chakrabarti+ KDD04] • Graph partitioning • [Karypis+ 99] • Time-evolving graphs • [Chakrabarti+ KDD06] • [Chi+ KDD07] • [Asur+ KDD07]

Summary • Organize into few, homogeneous communities • Find changes in community structure • Scalable • Parameter-free • Incremental

GraphScope: Parameter-Free Mining of Large Time-Evolving Graphs Jimeng Sun Spiros Papadimitriou Philip S. Yu Christos Faloutsos

Graph stream clustering t = 1 t = 2 t = 0

Graph clustering – [Chakrabarti+ KDD’04] Why is this better? Row groups Row groups versus Column groups Column groups Good Clustering • Similar nodes are grouped together • As few groups as necessary A few,homogeneous blocks Good Compression implies

Graph clustering – [Chakrabarti+ KDD’04] Why is this better? Row groups Row groups versus Column groups Column groups Good Clustering • Similar nodes are grouped together • As few groups as necessary A few,homogeneous blocks Good Compression Good Clustering Good Compression implies implies

m1 m2 m3 n1 n2 n3 i j row-partitioni description col-partitionj description Cost objective ℓ = 3 col. groups density of ones (edges) p1,1 p1,2 p1,3 block size entropy i,j nimjH(pi,j) n1m2H(p1,2) bits for (1,2) bits total k = 3 row groups p2,3 p2,1 p2,2 code cost + description cost Assumes group paritionings, sizes and densities are given p3,1 p3,2 p3,3 + i,j transmit #edges ei,j + log nimj n£m adj. matrix

Graph clusteringScalability Time vs. Size Splits Shuffles Time (sec) Number of edges Linear on the number of edges  Scalable

code cost (blocks) + description cost (blocks’ model) Cost objective   n row groups m col groups one row group one col group high low low high

code cost (blocks) + description cost (blocks’ model) Cost objective  k = 3 row groups ℓ = 3 col groups low low

Search for optimum Cost vs. number of groups one row group one col group bit cost n row groups m col groups k k = 3 row groups ℓ = 3 col groups ℓ

Search for optimumSummary k = 1, ℓ = 1 k = 5, ℓ = 5 k = 5, ℓ = 5 split shuffle shuffle split k=1, ℓ=2 k=2, ℓ=2 k=2, ℓ=3 k=3, ℓ=3 k=3, ℓ=4 k=4, ℓ=4 k=4, ℓ=5 Merge: Decrease k or ℓ Split: Increase k or ℓ Shuffle: Rearrange rows and cols

Graph clustering – [Chakrabarti+ KDD’04] • Given a graph of interactions or associations • Customers to products • Documents to terms • People to people • Computer communications • Financial transactions • Find simultaneously • Communities (source and destination) • Their number

GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs

GraphScope : Parameter-Free Mining of Large Time-Evolving Graphs

Presentation Transcript

Graphs

Video Mining

Data Warehousing/Mining Comp 150 DW Chapter 6: Mining Association Rules in Large Databases

Graph Mining - surprising patterns in real graphs

Large Graph Mining - Patterns, Tools and Cascade Analysis

Large Graph Mining – Patterns, Tools and Cascade analysis

An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases

Mining

Bar Graphs

Solving Some Text Mining Problems with Conceptual Graphs

Distance Time Graphs

Scale Free Networks

Frequent Subgraph Mining

Tools and Algorithms for Querying and Mining Large Graphs

Overview

Graph mining in bioinformatics

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Large Graph Mining - Patterns, Explanations and Cascade Analysis

Graphs, Data Mining, and High Performance Computing

Distance Time Graphs

Motion Graphs Lecture 3