Group Formation in Large Social Networks: Membership, Growth, and Evolution

Group Formation in Large Social Networks: Membership, Growth, and Evolution Lars Backstrom, Dan Huttenlocher, Joh Kleinberg, Xiangyang Lan Presented by Natalia Dragan Data Mining Techniques(CS6/73015-001) Fall06 Kent State University

Outline • Introduction • Motivation • Present work contributions • Related work • Membership, Growth, Evolution • Conclusions

Introduction • Structure of society: people tend to come together and form groups • Why it is important to study • To understand better decision-making behavior • Tracking early stages of an epidemic • Following popularity of new ideas and technologies • Significant growth in the scale and richness of on-line communities and social media (MySpace, Face book, LiveJournal, Flickr)

Problems with the studying social processes • Easy to build theoretical models (subgraph branching out rapidly over links in the network or collection of small disconnected components growing in a ‘speckled’ fashion) • Hard to make concrete empirical statements about these types of processes • Lack of reasonable vocabulary for studying group evolution

Present work • Principles by which groups develop and evolve in large-scale social networks • Crucial point: focusing on networks where the members have explicitly identified themselves as a group or community • (vs. an unsupervised graph clustering problem of inferring “community structures” in a network) • 3 main types of questions: Membership, Growth, Change

Membership, Growth, Change • Membership • Structural features that influence whether a given individual will join a particular group • Growth • Structural features that influence whether a given group will grow significantly over time • Change • How focus of interest changes over time • How these changes are correlated with changes in the set of group members

Related work • Identifying tightly-connected clusters within a given graph • Dill et al. consider implicitly identified communities • For a set of features (ZIP code, particular keyword, etc.) they consider a subgraph of the Web consisting of all pages containing this feature • Use of online social networks for data mining • The structure of the communities is not exploited • Diffusion on innovation study • Property which is “diffusing” in the present work – membership in a given group

Diffusion of Innovations • Diffusion of Innovations is a theory that analyzes, as well as helps explain, the adaptation of a new innovation. In other words it helps to explain the process of social change. • An innovation is an idea, practice, or object that is perceived as new by an individual or other unit of adoption. The perceived newness of the idea for the individual determines his/her reaction to it (Rogers, 1995). • In addition, diffusion is the process by which an innovation is communicated through certain channels over time among the members of a social system. Thus, the four main elements of the theory are the innovation, communication channels, time, and the social system. • http://hsc.usf.edu/~kmbrown/Diffusion_of_Innovations_Overview.htm

Related work on Diffusion of Innovation • How a social network evolves as it’s members attributes change (Sarkar and Moore, Holme and Newman) • Social network evolution in a university setting (Kossinets and Watts) • Evolution of topics over time (Wang and McCallum) • Property which is “diffusing” in the present work is a membership in a given group

Road map • Introduction • Motivation • Present work contributions • Related work • Membership, Growth, Evolution • Conclusions

Sources of data • LiveJournal • Free on-line community with ~ 10 mln members • 300,000 update the content in 24-hour period • Maintaining journals, individual and group blogs • Declaring who are their friends and to which communities they belong • DBLP • On-line database of computer science publications (about 400,000 papers) • Friendship network – co-authors in the paper • Conference - community

Community Membership • Study of processes by which individuals join communities in a social network • Fundamental question about the evolution of communities: who will join in the future? • Membership in a community – “behavior” that spreads through the network • Diffusion of innovation study perspective for this question

Dependence on number of friends: start towards membership prediction • Underlying premise in diffusion studies: an individual probability of adopting a new behavior increases with the number of friends (K) already engaging in the behavior • Theoretical modelsconcentrate on the effect of K, while the structural properties are more influential in determining membership

Approach hypothesis • For moderate values of K an individual with K friends in a group is significantly more likely to join if these K fiends are themselves mutual friends than if there are not

Dependence on number of friends . . 1st snapshot 2nd snapshot . . - user (u) , C - community, - friend . . . . . C . . C . . . . . . . . . . . . . . K = 3 . . . . . P(k) = 2/3 Probability P(k) of joining community = fraction of triples (u,C,k)

Law of Diminishing returns • In economics, diminishing returns is the short form of diminishing marginal returns. In a production system, having fixed and variable inputs, keeping the fixed inputs constant, as more of a variable input is applied, each additional unit of input yields less and less additional output. This concept is also known as the law of increasing opportunity cost or the law of diminishing returns. • http://en.wikipedia.org/wiki/Diminishing_returns

Dependence on number of friends: LiveJournal

Dependence of number of friends: DBLP

Discussion of results • The plots for LJ and DBLP have similar shapes, dominated by “diminishing returns” property (curve continues increasing, but more and more slowly even for large K) • P(2)>2P(1) – benefit of having a second friend is particularly strong (S-shaped behavior) • Curve for LJ is quitesmooth (1/2 billion triples vs. 7.8 million for DBLP)

A broader range of features • Features related to the community C (11) • Number of members (|C|) • Number of individuals with a friend in C (fringe of C) • Number of edges with both ends in the community (|Ec|) • etc. • Features related to an individual u and her set S of friends in community C (8) • Number of friend in community (|S|) • Number of adjacent pairs in S • Number of pairs in S connected via a path in Ec • etc.

Estimating probability on a broader range of features • Decision-tree techniques were applied to these features to make advances in estimating the probability of an individual joining a community • The technique incorporates • Individual’s position in the network (structural features) • Level of activities among members (group features)

Predictions for LJ and DBLP • 1st snapshot 2nd snapshot Data point (u,C) Probability UC . Fringe Fringe u . . . . . . C . C LJ: 14,448 joined community DBLP: 71,618 joined community LJ: 17,076,344 data points, 875 communities DBLP: 7,651,013 data points 20 decisions tree were built for estimation about joining

Splitting process for LJ • Each of 875 communities have half of their fringe members included in the training set (with the independent probability 0.5) • At each node in the decision tree • Every possible feature • Every binary split treshold for that feature were examined • Of all such pairs the split which produces the largest decrease in entropy was chosen

Splitting process for LJ • New splits were installed until there are fewer than 100 positive cases at the node • Leaf nodes predict the ratio of positive to total cases for that node • Averaging technique • For every case the set of decision trees, for which this case is not included in the training set, were built • The average of these predictions is a prediction for the case

Averaging model (Simple description) • Selecting a model that explains the data from all the possible models, the one which better fits the data is usually selected. • But sometimes there is some model that explains really well the data, creating a model selection uncertainty, which is usually ignored. • BMA (Bayesian Model Averaging) provides a coherent mechanism for accounting for this model uncertainty, combining predictions from the different models according to their probability. • J. A.and Madigan D. Hoeting and A.E.and Volinsky C.T. Raftery. Bayesian model averaging: A tutorial (With Discussion). Statistical Science, 44(4):382--417, 1999

Averaging model (Simple description) • Example: we have an evidence D and 3 possible hypothesis h1, h2 and h3. The posterior probabilities for those hypothesis are P( h1 | D ) = 0.4, P( h2 | D ) = 0.3 and P( h3 | D ) = 0.3 • Giving a new observation, h1 classifies it as true and h2 and h3 classify it as false, then the result of the global classifier (BMA) would be calculated as follows:

Top two level splits for predicting single individuals joining communities in LJ

Performance achieved with the decision trees Prediction performance for single individuals joining communities in LJ Prediction performance for single individuals joining communities in DBLP

Internal connectedness of friends Individuals whose friends in community are linked to one another are significantly more likely to join the community

Community Growth • Prediction task: identifying which communities will grow significantly over a given period of time • Binary classification problem • Training set Community growth >9 % < 18 % Class 1 (49.4%) Class 0 (50.6) Data set: 13 570 communities To make predictions: 100 decision trees on 100 independent samples using the community features were built Binary split is installed until a node has less than 50 data points

Top two levels of decision tree splits for predicting community growth in LJ • The features and splits varied depending on the sample, but the top 2 splits were stable

Solution to the problem • Averaging tree techniques was used • Three baselines with a single feature were considered • Size of the community • Number of people in the fringe of the community • Ratio of these two features and combination of all three features

Results Predicting community growth: baselines based on three different features, and performance using all features By including the full set of features predictions with reasonably good performance were received

Movement between communities • How people and topics move between communities • Fundamental question: given a set of overlapping communities • do topics tend to follow people • or do people tend to follow topics • Experiment set up: 87 conferences for which there is DBLP data over at least 15-year period • Cumulative set of words in titles is a proxy for top-level topics

Time Series and Detected Bursts: Term Bursts . . OOPSLA’03 . . . . Tw,C(y) = 2/6 . “Micro-Pattern Evolution” “Micro-Pattern in Java” . . . Tw,C “Micro-Pattern” is hot at OOPSLA in 2003 . . . . . . . . . . y 2000 2001 2002 2003 2004 2005 Term bursts

Time Series and Detected Bursts: Term Bursts • Tw,C(y) – fraction of paper titles at conference C in year y that contain the word w • Bursts in the usage of w • For each time series Tw,C isaninterval in which Tw,C(y) is twice the average rate (“burst rate”) • Burst detection technique exploiting stochastic model for term generation is used • Burst intervals serve to identify the “hot topics” (focus of interest at a conference) • Word wis hot at conference C in year yif the year y is contained in a burst interval of the time series Tw,C

Time Series and Detected Bursts: Movement Bursts • Author movement • Authors do not publish every year • Movement is asymmetric • Memberof a conference C in year y • Has published there in the 5 years leading up to y • Authora moves into C from B in year y (B -> C) • ahas a paper in conference C in year y and • a is a member of B in year y-1 • Property of two conferences and a year . . C B “Micro-Pattern Evolution” Smith 2002 2003

Time Series and Detected Bursts: Movement Bursts . . . . . . B . C . . . . . Brown MB,C(y) = 2/5 Dill 2002 2001 MB,C . . . . . . . . . . . . . y 2000 2001 2002 2003 2004 2005 Movement bursts

Time Series and Detected Bursts: Movement Bursts • MB,C(y) – fraction of authors at conference C in year y with the property that they are moving into C from B(B -> C) • MB,C – time seriesrepresentingauthor movement • B -> C movement bursts • an interval of y in which the value MB,C(y) exceeds the overall average by an absolute difference of .10 • Burst detection is used to find burst intervals

Goal of the Experiment 1 • Identify how word burst and movement burst intervals are aligned in time? • Word burst intervals identify hot terms • Movement burst intervals identify conference pairs B,C during which there was significant movement

Experiment 1: Papers contributing to Movement Bursts • Characteristics of papers associated with some movementburst into a conference C • They exhibit different properties from arbitrary papers at C • Using of terms currently hot at C • Using of terms that will be hot at C in the future • Paper at C in y contributesto some movement burst at C • If one of the authors is moving B -> C in y • y is a part of B -> C movement bursts . . “Micro-pattern Evolution” Smith OOPSLA’03 ICPC’02 2002 2003 2004 Movement burst

Papers contributing to Movement Bursts • Paper uses hot term • If one of the words in its title is hot for the conference and year in which it appears • Question: do papers contributing to movement bursts differ from arbitrary papers in the way they use hot terms? Papers contributing to a movement burst contain elevated frequencies of currently and expired hot terms, but lower frequencies of future hot terms A burst of authors moving into C from B are drawn to topics currently hot at C

Experiment 2: Alignment between different conferences • Conferences B and C are topically aligned in a yeary • If some word is hot at both B and C in year y • Property of two conference and a specific year • Hypothesis: two conferences are more likely to be topically aligned in a given year if there is also a movement burst going between them “Micro-pattern” OOPSLA’03 “Micro-pattern” ICSM’03

Results • 56.34% of all triples (B,C,y) such that there is B->C movement burst containing year y have the property that B and C are topically aligned in year y • 16.2 % of all triples (B,C,y) have the property that B and C are topically aligned in year y • The presence of a movement burst between 2 conferences enormously increases the chance they share a hot term

Movement bursts or term bursts come first? • There is a B -> C movement burst, andhot termswsuch that B and C are topically aligned via w in some year y inside the movement burst • 3 events of interest • The start of the burst for w at conference B • The start of the burst for w at conference C • The start of the B -> C movement burst

Four patterns of author movement and topical alignment Term burst intervals B -> C movement burst 32 194 35 61 Shared interest is 50 % more frequent than others Much more frequent for B and C to have a shared burst term that is already underway before the increase in author movement takes place

Conclusions • The ways in which communities in social network grow over time were considered • At the level of individuals and their decision to join communities • At a more global level, in which a community can evolve in membership and content

Thank you!

Group Formation in Large Social Networks: Membership, Growth, and Evolution