140 likes | 279 Views
Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks. Cong Ding, Yang Chen*, and Xiaoming Fu University of Göttingen *Duke University. Significance of social network data crawling. Understanding user behaviors Improving SNS architectures
E N D
Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks Cong Ding, Yang Chen*, and Xiaoming Fu University of Göttingen *Duke University
Significance of social network data crawling • Understanding user behaviors • Improving SNS architectures • Handling privacy/security issues • and so on...
Current data collection methods (1) • ISP-based measurement [Schneider IMC’09] Only ISP companies can do that
Current data collection methods (2) • Cooperate with SNS companies [Yang IMC’11] Most research groups do not have chance
Current data collection methods (3) • Crawl data by a single group (and share them to others) [Gjoka INFOCOM’10] Suffering request rate limiting
Shortages of crawling by a single group • Waste computing andnetwork resources • Introduce overhead toservice providers (andmay lead stricter rate limiting) • Lack of ground truth forthe research community
A new thought Why not collect data collaboratively?
System overview Coordinator Crawlers
System design • Fetching UIDs (BFS, etc.) • Handling crawling failure (timeout) • Bypassing request rate limiting (massive IP addresses) • Data fidelity (redundant crawling)
Implementation • A proof-of-concept prototype (without the data fidelity part)to crawl in Weibo • 472 PlanetLab servers as crawlers
Evaluation • In 24 hours, we have crawled 2.22M users’ data from Weibo,including user profiles, all the posts, all the social connections • Comparison: • Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 days • Guo et al. (PAM 2013) get 1M user’s data in 1 month
Conclusion and Discussion • Data sharing may violate some providers’ terms of services • Twitter does not allow to share data (even for research) • Weibo allows to share data among researchers • Unlimited data sharing might cause ethical issues • The data should be anonymized • We will publish the data crawled in the evaluation