300 likes | 515 Views
Practical Recommendations on Crawling Online Social Networks. Mehmet YILDIZ. Outline. Social Network Who is interested in Social Network Data Social Network Analysis Crawling Social Network. Social Network.
E N D
Practical Recommendations on CrawlingOnline Social Networks Mehmet YILDIZ Dogus University
Outline • Social Network • Who is interested in Social Network Data • Social Network Analysis • Crawling Social Network Dogus University
Social Network • Social networking truly is the new way of the world. (bookmarking, sharing, blogging sites increasing every day) • In November 2010, Facebook, the most popular OSN, counted more than 500 million members; • The total combined membership in the top five OSNs (Facebook, QQ, Myspace, Orkut, Twitter) exceeded 1 billion users. • Putting this number into context, the population of OSN users is almost 20% of the world population and is more than 50% of the world’s Internet users. Dogus University
Social Network • Users worldwide currently spend over 110 billion minutes on social media sites per month • Facebook is the second most visited website on the Internet (second only to Google) with each user spending 30 minutes on average per day on the site. • Clearly, OSNs in general, and Facebook in particular, have become an important phenomenon on the Internet. Dogus University
Social Network Analysis • OSNs are of interest to several different communities. • For example, sociologists employ them as a venue for collecting relational data and studying online human behavior. • Marketers, by contrast, seek to exploit information about OSNs in the design of viral marketing strategies. • From an engineering perspective, understanding OSNs can enable the design of better networked systems. Dogus University
How about OSN Data? • However, the complete dataset is typically unavailable to researchers • Most OSNs are unwilling to share their company’s data even in an anonymized form, primarily due to privacy concerns. • Furthermore, the large size and access limitations of most OSN services (e.g., login requirements, limited view, API query limits) make it difficult or nearly impossible to fully cover the social graph of an OSN. • Instead, it would be desirable to obtain and use a small but representative sample. Dogus University
Social Network • The goal in this paper is to provide a framework for obtaining an asymptotically uniform sample (or one that can be systematically reweighted to approach uniformity) of OSN users by crawling the social graph. • Such a sample allows to estimate any user property and some topological properties as well • Aimed to provide practical recommendations for appropriately implementing the framework, • Including: • the choice of crawling technique; • the use of online convergence diagnostics; • the implementation of high-performance crawlers Dogus University
Crawling Social Network Dogus University
Crawling Social Network • We then apply our framework to an important case-study - Facebook. • More specifically, we make the following three contributions. • Graph-crawling techniques in terms of sampling bias and efficiency. • First, we consider Breadth-First-Search (BFS) • The most widely used technique for measuring OSNs including Facebook.BFS sampling is known to introduce bias towards high degree nodes, which is highly non-trivial to characterize analytically or to correct. Dogus University
Breadth-First-Search (BFS) Dogus University
Crawling Social Network • Second, we consider Random Walk (RW) sampling • which also leads to bias towards high degree nodes, but whose bias can be quantified by Markov Chain analysis and corrected via appropriate re-weighting (RWRW). • For example, the path traced by a molecule as it travels in a liquid or a gas, the search path of a foraging animal, the price of a fluctuating stock and the financial status of a gambler can all be modeled as random walks. • Then, we consider the Metropolis-Hastings Random Walk (MHRW) • That can directly yield a uniform stationary distribution of users. • This technique has been used in the past for P2P sampling, recently for a few OSNs , but not for Facebook. Dogus University
Metropolis-Hastings Random Walk (MHRW) • Method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. • This sequence can be used to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). • Metropolis–Hastings and other algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. • For single-dimensionaldistributions, other methods are usually available. Dogus University
Uniform Sample • Finally, we also collect a uniform sample of Facebook userIDs (UNI), selected by a rejection sampling procedure from Facebook’s 32-bit ID space, which serves as our “ground truth”. • We compare all sampling methods in terms of their bias and convergence speed. • We show that MHRW and RWRW(re-weighted random walk) are both able to collect asymptotically uniform samples, while BFS and RW result in a significant bias in practice. • We also compare the efficiency MHRW to RWRW, via analysis, simulation and experimentation and discuss their pros and cons. Dogus University
Sampling Methodology • We consider OSNs, whose social graph can be modeled asa graph G = (V, E)where V is a set of nodes (users) Eis a set of edges. • Assumptions : • A1 G is undirected. This is true in Facebook (its friendship relations are mutual), but in Twitter the edges aredirected, which significantly changes the problem. • A2 We are interested only in the publicly available part of G.This is not a big limitation inFacebook, because all theinformation we collect is publicly available under defaultprivacy settings. Dogus University
Sampling Methodology(cont.) • Breadth First Search (BFS): • At each new iteration theearliest explored but not-yet-visited node is selected next. • Asthis method discovers all nodes within some distance from thestarting point, an incomplete BFS is likely to densely coveronly some specific region of the graph. • Random Walk (RW): In the classic random walk,the next-hop node w is chosen uniformly at random amongthe neighbors of the current node v. I.e., the probability ofmoving from v to w is • Re-Weighted Random Walk (RWRW): A natural nextstep is to crawl the network using RW, but to correct for thedegree bias by an appropriate re-weighting of the measuredvalues. This can be done using the Hansen-Hurwitz estimatoras first shown in for random walks and alsolater used in. Dogus University
Sampling Methodology(cont.) • Metropolis-Hastings Random Walk (MHRW): Insteadof correcting the bias after the walk, one can appropriatelymodify the transition probabilities so that the walk convergesto the desired uniform distribution. Dogus University
Ground Truth: Uniform Sample of UserIDs (UNI) • Capitalized on a unique opportunity to obtain a uniform sample of Facebookusers by generating uniformly random 32-bit userIDs, and bypolling Facebook about their existence. If the userID exists(i.e., belongs to a valid user), we keep it, otherwise we discardit. • This simple method is a textbook technique known asrejection sampling and in general it allows to sample fromany istribution of interest, which in our case is the uniform. • In particular, itguarantees to select uniformly random userIDsfrom the allocated Facebook users regardless of their actualdistribution in the userID space, even when the userIDs arenot allocated sequentially or evenly across the userID space. Dogus University
Node Info Dogus University
Privacy Settings Dogus University
Privacy Concern Dogus University
Convergence • Using Multiple Parallel Walks: • Multiple parallel walksare used to improve convergence. • if we only have one walk, the walk may gettrapped in cluster while exploring the graph, which may lead toerroneous diagnosis of convergence. • Having multiple parallelwalks reduces the probability of this happening and allowsfor more accurate convergence diagnostics. • An additionaladvantage of multiple parallel walks, from an implementationpoint of view, is that it is amenable to parallel implementationfrom different machines or different threads in the same machine. Dogus University
Convergence(cont.) • Detecting Convergence with Online Diagnostics: • Validinferences are based on the assumption that thesamples are derived from the equilibrium distribution, whichis trueasymptotically. • In order to correctly diagnose whenconvergence to equilibrium occurs, we use standard diagnostictests developed within theMCMC(Monte Carlo Markov Chains ) literature. Dogus University
DATA COLLECTION • User properties of interest • Fig. 1 summarizes the information collected when visitingthe “show friends” web page of a sampled user u, which werefer to as basic node information. • Name and userID • Friend List • Networks • Privacy Settings • Profiles Dogus University
Crawling Process • In order to apply our methodology to real-life OSNs,we implemented high-performance distributed crawlers thatexplored the social graph in a systematic and efficient way. Dogus University
Crawling Result • The random walks considered in this paper, RW, RWRW and MHRW, are well-known inthe field of Monte Carlo Markov Chains (MCMC). • We applyand adapt these methods to Facebook, for the first time, and • we demonstrate that, when appropriately used, they performremarkably well on real-world OSNs. • Conclusion is that RWRW is more efficient than MHRW inmost topologies that are likely to arise in practice. Dogus University
Crawling Result • In this paper, aimed to developed a framework for unbiasedsampling of users in an OSN by crawling the social graph,and provided recommendations for its implementation inpractice. • We comparedseveral candidate techniques in terms of bias (BFS and RWwere significantly biased, while MHRW and RWRW providedunbiasedsamples) and efficiency (we found RWRW to be themost efficient in practice, while MHRW has the advantageof providing a ready-to-use sample). • We also introduced theuse of formal online convergence diagnostics. In addition,we performed an offline comparison of all crawling methodsagainst the ground truth (obtained through uniform samplingof userIDs via rejection sampling). • We also provided guidelinesfor implementing high performance crawlers for samplingOSNs. • Finally, we applied these methods to Facebookand obtained the first unbiased sample of Facebook users,which we used it to characterize several key user and structuralproperties of Facebook. Dogus University
Crawling Result • In this paper, aimed to developed a framework for unbiasedsampling of users in an OSN by crawling the social graph,and provided recommendations for its implementation inpractice. • We comparedseveral candidate techniques in terms of bias (BFS and RWwere significantly biased, while MHRW and RWRW providedunbiasedsamples) and efficiency (we found RWRW to be themost efficient in practice, while MHRW has the advantageof providing a ready-to-use sample). • We also introduced theuse of formal online convergence diagnostics. In addition,we performed an offline comparison of all crawling methodsagainst the ground truth (obtained through uniform samplingof userIDs via rejection sampling). • We also provided guidelinesfor implementing high performance crawlers for samplingOSNs. • Finally, we applied these methods to Facebookand obtained the first unbiased sample of Facebook users,which we used it to characterize several key user and structuralproperties of Facebook. Dogus University
References • Practical Recommendations on Crawling Online Social Networks, Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou, IEEE Member, IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 29, NO. 9, OCTOBER 2011 Dogus University
Thank You... Dogus University