Sampling Techniques for Large, Dynamic Graphs

Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research Subhabrata Sen – AT&T Labs—Research Walter Willinger – AT&T Labs—Research Global Internet Symposium Barcelona, Spain April 28th, 2006

Motivation • P2P systems are very popular in practice. • Several million simultaneous users collectively. • 60% of all Internet traffic [CacheLogic Research 2005] • Measurement studies aid understanding existing systems and user behavior. • Capturing global state is often infeasible. • P2P systems are large and rapidly changing. • Sampling is therefore a natural approach, and has been used in several earlier measurement studies. • But how do we know the samples are representative?

The Problem • We focus on sampling peer properties. • Peer degree • Link Bandwidth • Number of shared files • Remaining uptime • Sampling peer properties occurs in two steps: • Discover and select peers • Collect the measurements • Selecting peers uniformly at random is hard. • Peer dynamics can introduce bias. • The graph topology can introduce bias. • We examine these two problems separately.

Temporal Causes of Bias • Define Vt as the set of peers present at time t. • We gather samples over a measurement window of length Δ. • The most common approach is to gather peers from the set present during the window:

Example of Bias towards Short-Lived Peers Time Long-lived peer Short-lived peers • Consider a simple two-peer system, containing: • One long-lived peer • One rapidly-changing short-lived peer • The common approach over-selects short-lived peers. • XXX I plan to update this slide with animation to show how a particular measurement window selects too many short-lived peers

Handling Temporal Causes of Bias • The common approach is intuitive but incorrect. • Sampling peers is the wrong goal. • We want to sample peer properties. • Therefore, vi,t and vi,t’ are distinct, even though they come from the same peer. • Allow sampling the same peer more than once, at different points in time.

Example of avoiding bias towards Short-Lived Peers Time Long-lived peer Short-lived peers • Allowing re-selecting a peer solves the problem. • The long-lived peer will be selected half the time, reflecting the actual state of the system. • Now the problem remains, how do we select a peer uniformly a random at a particular moment? • XXX I plan to update this slide with animation

Topological Causes of Bias • Goal: Select a peer uniformly at time t • Begin with one peer. • Query peers to discover neighbors. • Prior work uses classic graph-discovery algorithms: • Breadth-First Search (BFS) • Depth-First Search (DFS) • Problems with these techniques: • Peers are correlated by their neighbor relationship • Peers with higher degree are more likely to be discovered. • A peer can only be selected once. • Random walks are a promising alternative. XXX Some kind of animation here showing the discovery process (using breadth-first search)

Random Walks • Basic idea of the random walk: • Select a neighbor randomly to explore • Explore that neighbor and “forget” the previous peer • Only two pieces of state are maintained: • The current peer • The length of the walk • A subset of visited peers are selected for sampling • The basic random walk selects a peer every r steps. • Graph theory suggests r≥ log(|V|). • Walking r steps between samples eliminates correlations. • Peers are selected with probability proportional to degree. • Peers can be selected more than once.

Variations on the Random Walk • Fixing the degree bias (“Degree Correction”) • Select a candidate peer with probability • Pro: Should result in uniform selection of peers • Con: Decreases efficiency • Improving efficiency (“Random Stroll”) • After the first r steps, select every peer instead of every r peers • Pro: Increases efficiency • Con: Introduces slight correlations

Evaluations • We simulated different techniques over two types of graphs: • A snapshot of the Gnutella ultrapeer topology [Stutzbach 05 IMC] • Random graphs (with the same number of vertices and edges as the Gnutella topology) • Metrics: • Bias: Is peer A more likely to be selected than peer B? • Correlation: If we select peer A, are we more likely to select peer B? • Efficiency: How easily can we collect a sample? • Techniques: • Oracle (uniformly random) • Breadth-First Search (BFS) • Random Walk (RW) • Random Walk with Degree Correction (RWDC) • Random Stroll (RS) • Random Stroll with Degree Correction (RSDC)

Bias • Collect k|V| samples and compare with Oracle. • Most peers should be selected around k times. • RSDC appears unbiased in both cases. • RWDC performs well, but exhibits slight bias on Gnutella. • BFS, RS, and RW are heavily biased. Figures 1(a) and 1(b) go here

Correlation • Even if unbiased, a technique may exhibit correlations. • We define a sampling session as 1,000 consecutive samples. • For pair (A, B), if A is selected, how often is B also selected? • A long tail indicates correlation. • RWDC and RSDC appear uncorrelated. • RW and RS exhibit slight correlations. • BFS exhibits strong correlation. Figures 2(a) and 2(b) go here

Efficiency • The basic operation is the neighbors-query. • Efficiency is: • BFS and RS are close to 100% efficient. • Unfortunately, they are also heavily biased. • RW, RWDC, and RSDC are 2% to 8% efficient. • RSDC is twice as efficient as RWDC (4% vs. 2%). • However, even the inefficient techniques are O(log |V|).

Summary of Results and Lessons Learned • Addressing temporal causes of bias • Avoid gathering a set of peers and collecting measurements in separate passes. • Select a peer, then collect the measurement. • Repeat and allow re-selecting the same peer. • Addressing topological causes of bias • Be careful to avoid bias towards high-degree. • Consider using a random walk or random stroll with degree correction.

Ongoing Work • This work is preliminary. • Additional types of random walks: • Weighting the selection of the next hop • Additional types of graphs: • Power-law • Small world • We have examined temporal and topological causes of bias separately. • To examine them concurrently, we are creating a dynamic overlay simulator. • XXX This slide feels too much like a laundry list

Sampling Techniques for Large, Dynamic Graphs

Sampling Techniques for Large, Dynamic Graphs

Presentation Transcript

Analyzing Large, Dynamic Communication Graphs

Sampling Techniques

Sampling Techniques

SAMPLING TECHNIQUES

SAMPLING TECHNIQUES

Sampling Techniques for epidemiological studies

Fast Dynamic Reranking in Large Graphs

Sampling Techniques

Sampling Techniques

Sampling Techniques

SAMPLING TECHNIQUES FOR SUSPENDED SEDIMENT

Dynamic Object Sampling for Pretenuring

Sampling Techniques

SAMPLING TECHNIQUES

Sampling Techniques

Sampling Techniques

Sampling in Graphs

SAMPLING TECHNIQUES

Sampling Techniques

Sampling Techniques