Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks Cong Ding, Yang Chen*, and Xiaoming Fu University of Göttingen *Duke University

Significance of social network data crawling • Understanding user behaviors • Improving SNS architectures • Handling privacy/security issues • and so on...

Current data collection methods (1) • ISP-based measurement [Schneider IMC’09] Only ISP companies can do that

Current data collection methods (2) • Cooperate with SNS companies [Yang IMC’11] Most research groups do not have chance

Current data collection methods (3) • Crawl data by a single group (and share them to others) [Gjoka INFOCOM’10] Suffering request rate limiting

Shortages of crawling by a single group • Waste computing andnetwork resources • Introduce overhead toservice providers (andmay lead stricter rate limiting) • Lack of ground truth forthe research community

A new thought Why not collect data collaboratively?

System overview Coordinator Crawlers

System design • Fetching UIDs (BFS, etc.) • Handling crawling failure (timeout) • Bypassing request rate limiting (massive IP addresses) • Data fidelity (redundant crawling)

Implementation • A proof-of-concept prototype (without the data fidelity part)to crawl in Weibo • 472 PlanetLab servers as crawlers

Evaluation • In 24 hours, we have crawled 2.22M users’ data from Weibo,including user profiles, all the posts, all the social connections • Comparison: • Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 days • Guo et al. (PAM 2013) get 1M user’s data in 1 month

Evaluation

Conclusion and Discussion • Data sharing may violate some providers’ terms of services • Twitter does not allow to share data (even for research) • Weibo allows to share data among researchers • Unlimited data sharing might cause ethical issues • The data should be anonymized • We will publish the data crawled in the evaluation

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks

Presentation Transcript

Large Scale Integrated Circuits

Introduction to the key large-scale government surveys

Networks of Social Influence

My Favorite Algorithms for Large-Scale Data Mining

William Stallings Data and Computer Communications 7 th Edition

Session 14 Management of Large-Scale Disaster Response/Recovery

PREGEL

Online Social Networks and Media

Lecture 20: Privacy in Online Social Networks

Online Social Networks and Media

Crowd Simulation I

Data Bases in Cloud Environments

Applied Social Networks

Large-Scale Data Processing with MapReduce

LHC Scale Physics in 2008: Grids, Networks and Petabytes

Distributed Multi-Scale Data Processing for Sensor Networks

Some crawling algorithms

Scalable Data Collection in Sensor Networks

Social Network Analysis

Online summer collection- letsgetstyle.com