A Framework For Community Identification in Dynamic Social Networks

A Framework For Community Identification in Dynamic Social Networks Chayant Tantipathananandh Tanya Berger-WolfDavid KempePresented by Victor Lee

Outline of Presentation • The Challenge: Dynamic Social Networks • Framework and Problem Formulation • Individual and Group Colorings • Group Coloring Heuristics • Experimental Results • Future Directions

The Problem • Many well-known approaches to identify communities in social networks • Graph Partitioning • Clustering • Various measures of closeness or density • But, these approaches generally assume static networks • Most social networks are dynamic

Dynamic Social Networks • Social Networks change over time • Membership changes • Interaction changes • Most community identification techniques: • Use a single snapshot • Or use time-averaged measurements • Lose important information

Importance of Dynamic Information • Networks 1 and 2: same average characteristics,but… • Network 1 shows an oscillation • Network 2 suggests that C joins the community T1 T2 T3 T4 T5 T6 A B A B A B C A B A B A B time A B C A B C A B A B C A B C A B C Network 1 Network 2

Proposal • New framework for modeling social networks over time • Algorithms and Heuristics to identify dynamic communities • Experiments to verify the concept and the computational performance

Problem Formation • Given: • A set of individuals • A sequence of snapshot observations • Find: • A best-fit set of time-varying communities C(t) • Best-fit time-varying community membership for each individual • Approach: • Combinatorial optimization • Graph coloring

Model: Individuals and Groups • Set of individuals X = {i1, i2, …in} • Sequence of observations <P1, P2, …PT> • Discrete time • Record interaction between individuals • The set of individuals interacting at time t define a group. • If A interacts with B, and B interacts with C,than {A,B,C} ⊆ a group A C B

Group vs Community • Snapshot Graph • Individual is a vertex • Interaction is an edge • Group is a connected subgraph • Assumption: interaction is sufficiently limited so that the graph is not connected (we have disjoint groups) • Group ≠ Community • Groups capture observed interaction at a point in time • Communities extend over time

Graphing the Observations • Each time slice is one observation • Edges within a time slice show observed interaction at time t • Add edges joining all observations of the same individual • No edges between groups from one time to another ○ = individual □ = group

Refine the Problem • A community appears as a sequence of groups, of at most one group per time slice. • Tasks: • Assign each group to a community(color the group vertices) • Assign each individual to a community, for each time step (color individual vertices) • More Assumptions: • Individuals belong to one community at a time • Individuals don’t change community frequently • Individuals frequently appear in their community

Cost Model • Quantify a “good” community identification • Assign costs to undesirable behavior: • I-cost:  when an individual changes color. • G-costs: • b1 when an individual is absent from its community. • b2 when an individual is present in a different community. • C-cost: g for each color that I uses • Find a coloring with minimum cost

Coloring Choices and Costs At time T3, C temporarily changes its interaction. • Coloring 1: C changes community and then changes back. • Cost = 2*a (+ g if this color hasn’t been used before) • Coloring 2: C stays in its original community and just visits. • Cost = b1 + b2 • Optimal coloring depends on comparison (b1 + b2) < (2*a + g) or (2*a) T1 T2 T3 T4 A B C D A B C D A B C D A B C D A B C D A B C D time A B C D A B C D Coloring 1 Coloring 2

Finding Optimal Colorings • Finding the optimal solution is NP-hard • Partition the problem: • Find an optimal set of communities • Find optimal assignment of individuals to communities • If Phase 1 (Group Coloring) is completed first: • Phase 2 is reduced from O(2N) to O(2G),N = # of individuals, G = # of groups • The cost incurred by one individual’s coloring is independent of the colors chosen by others.

Independence of Individual Color Choice Proof: • Cost of an individual’s behavior = A (I-cost) + B (G-cost) + C * (C-cost) • Costs are assessed individually: • I-cost = a ∗ (# of color changes) • G-cost = b1∗ (# absences from its group) + b2∗ (# visits to other groups) • C-cost = g∗ (# of colors that an individual uses) • So, we can solve for each individual one at a time. • Moreover, we can assess cost incrementally,from time t to time t+1…

Individual Coloring Algorithm • C = set of all colors observed to be used by an individual i • F(t) = {S ⊆ C: 1 ≤ |S| ≤ t} all possible subsets of colors up to time t • G(t,x) = G-cost to use color x at time t • I(t,x,y) = I-cost to use color x at time t-1 and color y at time t • C(x,R) = C-cost to use color x when color set R has been used Min. cost at time t, using color x, with color set S used: • At time=1: G(I, {x}, x) = G(1,x)At time=t: G(t, S, x) = G(t, x) + min [ G(t-1, R, y) + I(t, x, y) + C(x, R) ]over all R and y, where R ∈F(t-1), y ∈ R R U {x} = S, i-cost: changing colorg-cost: wrong groupc-cost: new color

Optimal Individual Coloring • Given a group coloring, the minimum cost of coloring the individual I ismin G(T, S, x)S ∈F(T), x ∈ S • Time complexity is O( nT|C|2 2|C| ) • Space requirement is O( |C| 2|C| ) • If the number of groups |C| is not large, the complexity is tractable.

A possible coloring Optimal Group Coloring • Determine the best mapping of groups at time t to groups at time t+1 • Groups that are mapped across time are part of the same community and have the same color • A coloring is good if most individuals can retain their color from step to step.

Bipartite Matching Heuristic • Matching Graph • For each pair of groups g, g’ at times t, t’=t+1, add a weighted edge from vg,t to vg’,t’ • Weight = |g ∩ g’| (similarity of g to g’) • Find the maximum weight bipartite matching • Evaluation • Weights i-cost more than g-cost • Performs well if membership is fairly stable • No long range perspective • More efficient heuristics? i-cost: changing colorg-cost: wrong groupc-cost: new color

Greedy Heuristics for Group Coloring • Approach: Maximize pairwise similarity between groups, for all pairs of groups over all timesteps • Jaccard’s index: Jac(g, g′) = | g ∩ g′| | g U g′| • Weighted for temporal proximity: JacD(g, g′) = Jac(g, g′) | t - t′ | overlap between g and g′, scaled to size of g and g′

Greedy Heuristics for Group Coloring • Greedy Heuristic 1 (time is not a factor) • Construct a square similarity matrix of size |#groups| • Using agglomerative clustering • Greedy Heuristic 2 (look backwards in time) For t=1 to T do • Match most similar pairs g, g′ for any time t′ < t • If similarity=0 or all colors have been used, add a new color • Greedy Heuristic 3 (look back the shortest interval) • Like Heuristic 2, but use t′, t′ is the closest value to t such that ∃ similarity(g, g′) > 0

Experiment 1: Verify the Framework • Does the framework capture the intuitive concept of dynamic community? • Procedure • Construct small, synthetic datasets • Use exhaustive search to get a truly optimal coloring

At each time step, 1 member leaves and 1 enters a group, resulting in a complete membership change in 3 steps. (A) (a,b1,b2,g) =(1,0,1,1) (B) (a,b1,b2,g) = (1,0,3,1) Experiment 1A: “Assembly Line” • Results change as costs change. (A) favors stable membership. (B) allows for more fluid membership.

2, 3, and 4 are Children. 0 and 1 are Parents that visit a different child each timestep. (A) (a,b1,b2,g) =(1,0,1,1) (B) (a,b1,b2,g) = (1,0,3,1) Experiment 1B: “Dutiful Children” • Results: Framework succeeds at detecting the individual children as well as the visitation pattern.

Experiment 2: Quality of Heuristic Results • Do the heuristics obtain colorings similar to those of an exhaustive search? • Procedure • Re-test the synthetic datasets using the various heuristics Results: At least one Heuristic method obtains the same coloring and total cost as Exhaustive Search

Experiment 3: Real World Datasets • Do the framework and heuristics together obtain expected results using real-world datasets?

Experiment 3A: “Southern Women” • Eighteen women in 1933 in Natchez, Tennessee • Tracks their attendance at 14 social events

Experiment 3A: Prior Results • Twenty one analyses (1941 to 2001) all show similar results • Two clear communities • The membership of individuals 8, 9, and 16 is less certain.

Detects 4 communities, which are subsets of the traditional 2 communities Individuals 6 and 10 change membership over time By adjusting cost factors, the results of most of the 21 prior analyses can be duplicated (a,b1,b2,g) =(1,1,1,1) Experiment 3A: Results

28-member zebra herd observed 44 times over 3 months in 2002 The graph to the left shows the aggregate interaction. Temporal information is lost. Experiment 3B: “Grevy’s Zebra”

Inferred communities agree with manual results obtained by biologists. 4 stable communities Some short-lived communities and some visiting Experiment 3B: Results

Conclusions • We present a framework for identifying communities in dynamic social networks • The framework produces meaningful results compared to traditional methods • Heuristic methods produce near-optimal solutions • Future Directions • Develop an approximation algorithm which guarantees the quality of the result • Investigate scalability over network size and time • Relax assumptions about interaction and dynamics

A Framework For Community Identification in Dynamic Social Networks