1 / 14

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks. Cong Ding, Yang Chen*, and Xiaoming Fu University of Göttingen *Duke University. Significance of social network data crawling. Understanding user behaviors Improving SNS architectures

Download Presentation

Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crowd Crawling: Towards Collaborative Data Collection for Large-scale Online Social Networks Cong Ding, Yang Chen*, and Xiaoming Fu University of Göttingen *Duke University

  2. Significance of social network data crawling • Understanding user behaviors • Improving SNS architectures • Handling privacy/security issues • and so on...

  3. Current data collection methods (1) • ISP-based measurement [Schneider IMC’09] Only ISP companies can do that

  4. Current data collection methods (2) • Cooperate with SNS companies [Yang IMC’11] Most research groups do not have chance

  5. Current data collection methods (3) • Crawl data by a single group (and share them to others) [Gjoka INFOCOM’10] Suffering request rate limiting

  6. Shortages of crawling by a single group • Waste computing andnetwork resources • Introduce overhead toservice providers (andmay lead stricter rate limiting) • Lack of ground truth forthe research community

  7. A new thought Why not collect data collaboratively?

  8. System overview Coordinator Crawlers

  9. System design • Fetching UIDs (BFS, etc.) • Handling crawling failure (timeout) • Bypassing request rate limiting (massive IP addresses) • Data fidelity (redundant crawling)

  10. Implementation • A proof-of-concept prototype (without the data fidelity part)to crawl in Weibo • 472 PlanetLab servers as crawlers

  11. Evaluation • In 24 hours, we have crawled 2.22M users’ data from Weibo,including user profiles, all the posts, all the social connections • Comparison: • Fu et al. (PLOS ONE 2013) get 30K user’s data in 6 days • Guo et al. (PAM 2013) get 1M user’s data in 1 month

  12. Evaluation

  13. Evaluation

  14. Conclusion and Discussion • Data sharing may violate some providers’ terms of services • Twitter does not allow to share data (even for research) • Weibo allows to share data among researchers • Unlimited data sharing might cause ethical issues • The data should be anonymized • We will publish the data crawled in the evaluation

More Related