Walking in facebook : a case study of unbiased sampling of osnS

Walking in facebook:a case study of unbiased sampling of osnS 2010.6.9 junction

Outline • Motivation and Problem Statement • Sampling Methodology • Evaluation of Sampling Techniques • Facebook Data Analysis • Conclusion

Online Social Networks (OSNs) • Social Graph • undirected graph • G = (V, E) • V: nodes (users) • E: edges (relationships) • kv : node degree • A network of declared friendships between users, allowing users to maintain relationships • Many popular OSNs with different focus • Facebook, LinkedIn, Flickr, … • Facebook • More than 400 million active users • 50% of themlog on to Facebook in any given day • Average user has 130 friends • People spend over 500 billion minutes per month on Facebook • more than 100 million mobile users • Mobile user are twice more active than non-mobile users.

Why Sample OSNs? • Representative samples desirable • study properties • test algorithms • Obtaining complete dataset difficult • companies usually unwilling to share data • tremendous overhead to measure all (~100TB for Facebook)

Problem Statement • Obtain a representative sample of usersin a given OSN by exploration of the social graph. • Uniform sample of Facebook users • explore graph using various crawling techniques

Outline • Motivation and Problem Statement • Sampling Methodology • Crawling Methods • Convergence Evaluation • Data Collection • Evaluation of Sampling Techniques • Facebook Data Analysis • Conclusion

Crawling Methods • Crawling Methods • Breadth First Search (BFS) • Random Walk (RW) • Re-Weighted Random Walk (RWRW) • Metropolis-Hastings Random Walk (MHRW) • Uniform Sampling (UNI)

Breadth First Search (BFS) • Early measurement studies of OSNs use BFS as primary sampling technique • Starting from a seed, explores all neighbor nodes. • As this method discovers all nodes within some distance from the starting point, an incomplete BFS is likely to densely cover only some specific region of the graph. • BFS leads to bias towards high degree nodes

Random Walk (RW) • Explores graph one node at a time with replacement • In the stationary distribution • biased towards higher degree nodes (πv ~ kv) Degree of node υ Number of edges

Re-Weighted Random Walk (RWRW) • Corrects for degree bias at the end of collection • Without re-weighting, the probability distribution for node property A is: (e.g. the degree, network size ...) • Re-Weighted probability distribution : Degree of node u

Metropolis-Hastings Random Walk (MHRW) • Explore graph one node at a time with replacement • In the stationary distribution • Exactly the uniform distribution

Uniform Sampling (UNI) • As a basis for comparison (ground truth) • Rejection sampling • uniform sampling of on the 32-bit IDs • discarding the non-existing ones • yields a uniform sample of the existing user IDs in Facebook for any allocation policy (i.e. even if the userIDs are not evenly allocated in the 32-bit address space) • UNI not a general solution for sampling OSNs • userID space must not be sparse • names instead of numbers • must be supported by the systems

Convergence Detection • Number of samples (iterations) to loose dependence from starting points?

Convergence Evaluation • Using Multiple Parallel Walks to improve convergence • avoid getting trapped in certain region • starting from 28 different randomly chosen initial nodes • Detecting Convergence with Online Diagnostics • sampling longer and discard a number of initial “burn-in” iterations • Consumed BW (TB) and measurement time (days) • Crucial to decide appropriate ‘burn-in’ and total running time • Grweke Diagnostic • Gelman-Rubin Diagnostic

Geweke Diagnostic • Detect the convergence of a single Markov chain • With increasing number of iterations, Xa and Xb move further apart, which limits the correlation between them. • according to the law of large numbers, the z values become normally distributed ~ (0, 1) • Declare convergence when most values fall in the [-1,1] interval Xa Xb

Gelman-Rubin Diagnostic • Detects convergence for m>1 walks (m: # of chains) • Compare the empirical distributions of individual chains with the empirical distribution of all sequences together • if they are similar enough (R,1.02) , declare convergence Between walks variance Walk 1 Walk 2 Walk 3 Within walks variance

Data Collection • Information collected

Data Collection • 28 x 81K = 2.26 M • 28 initial starting nodes • crawl until exactly 81K samples are collected • Summary of data set • 18.53M nodes picked uniform from [1, 232] • only 1216 K users existed • 228 K users had zero friends • repeat the same node in a walk • # of rejected nodes without repetition : 645 K • RW: 97 % nodes are unique • BFS: 97 % nodes are unique • confirms that the random seeding chose different areas of FB

Outline • Motivation and Problem Statement • Sampling Methodology • Evaluation of Sampling Techniques • Convergence Analysis • Methods Comparison • Unbiased Estimation • Facebook Data Anaylsis • Conclusion

Convergence Analysis • What is a fair way to compare the results of MHRW with RW and BFS? • MHRW visits fewer unique nodes than RW and BFS • MHRW stays at some nodes for relatively long time/iterations • Happens usually at some low degree node • An appropriate practical comparison should be based on the number of visited unique nodes

Convergence Test • When does it reach equilibrium? • Burn-in determined to be 3K -> discard 6K • converge when all 28 values fall in the [-1, 1] interval • 500 iterations Node Degree • converge when all R scores drop below 1.02 • (0,1): not in / in • 3000 iterations

Methods Comparison • MHRW, RWRW produce good in estimating the probability of a node degree • The degree distribution will converge fast to a good uniform sample • Poor performance for BFS, RW 28 crawls

Unbiased Estimation (BFS, RW) • Node degree distribution • introduce a strong bias towards the high degree nodes • the low-degree nodes are under-represented

Unbiased Estimation (MHRW) • Degree distribution identical to UNI (MHRW, RWRW)

FB Social Graph– degree distribution • Degree distribution not a power law a1=1.32 a2=3.38

FB Social Graph - Assortativity • Assortativity • nodes tend to connect to similar or different nodes? • positive correlation: high degree nodes tend to connect to other high degree nodes

FB Social Graph – Privacy Awareness

Conclusion • Compared graph crawling methods • MHRW, RWRW performed remarkably well • BFS, RW lead to substantial bias • Practical recommendations • correct for bias • usage of online convergence diagnostics • proper use of multiple chains

Walking in facebook : a case study of unbiased sampling of osnS

Walking in facebook : a case study of unbiased sampling of osnS

Presentation Transcript

Promotional Sampling Case Study: Product Sampling

Facebook Case Study

A Study of social influence in diffusion of innovation over Facebook

Walking in Facebook: A Case Study of Unbiased Sampling of OSNs

Nationalisation: A case study of Zambia

Case Study of a Turnaround

Facebook strategy of a hotel Case study: Hotel Rotes Wildschwein

FROM POSTURAL HYPOTENSION TO THERAPEUTIC WALKING: A CASE STUDY

A Case Study of Successful Ministry

Dynamics of learning: A case study

Benefits of Benchmarking A Case Study

A Case Study of Kiva

Implementation of a case study

Case Study of a Turnaround

Definition of a Case Study

Case Study : A bit of Banner...

Case Study of a PROBE Building

A Case Study of Interaction Design

Case study: Evolution of a menace

Study of Sampling Fractions

A study of Online Market in Airline: A Case Study in Mongolia