Streaming and MapReduce for Graphs

Streaming and MapReduce for Graphs

Sample problems • How do we solve these problems ? • finding connected components • estimating clustering coefficient • minm. spanning tree (weighted) • minm-cut, other partitioning • maximum matching (weighted) • random walks

Streaming Model • Stream = m elements from an universe of size n (possibly with some weights) …..(v1, 1), (v2, 2), (v2, 1), (v1, 300),…. • Vector interpretation • stream over universe [n] => vector of size n • Restrictions • Restricted memory, preferably logarithmic • Small number of passes over input, preferably constant • Fast update time • Different models • Simple: (e,w) – each element arrives only once • Cash register : multiple arrivals, i.e. updates (e, +w) arrive but are all increments • Turnstile : (e, ± w) -- both positive and negative updates

Estimating moments • stream over [n] => vector • Estimate moments • To a factor (1±)w.p. 1-  • (AMS) In order to (, ) estimate • space is sufficient for 0 < p < 2 • n1- 2/p space is necessary for p > 2

Estimating F2 • Pick a random hash function h:[n]  {+1,-1} • For each update (i, v) perform • At the end estimate X = Z2 • Finally, use median of means.

Estimating F0 • Define hash function • k = 1/2 • On element (x, v) • compute h(x) and maintain v = k-th minimum • Finally, output X = k*M/v h:[n]  [M]

Graph Streams and Problems • Stream = edges • e1, e2, e3,…. • other variants too • Space used = O(n*polylog(n)) • Problems • Connectivity • Matching • Spanners • Clustering coefficient • Moments of degree distribution

Connectivity • Doable in O(n log(n)) space • keep a label L(u) with every node u • same labels indicate same component • update label information as new edge (u, v) arrives • L(u)  L(w) for all w with label L(v) • At the end each connected component has same label

Connectivity • Not doable in space • P is a “balanced” property if for there exists G and node u such that • V1 = {v: G + (u, v) satisfies P} ; • V2 = {v: G + (u, v) does not satisfy P} • min( |V1|, |V2|) > O(n) • Any such P needs space

Spanners • = shortest path distance in G • Want a subgraph H = (V, E’) such that • H is -spanner • Can construct a (2t-1)-spanner in space

Spanner Algorithm • Initialize H = empty • For each new edge (u, v) • if current d(u, v) > 2t -1 , include (u, v) in H • Claim • H is (2t – 1) spanner. • Number of edges • Takes time O(n) per edge, but faster algorithms exist

Counting triangles • Clustering coefficient = (#closed triplets)/(#connected triplets) • signature of community structure • Different types of signed triangles measure the “balance” of the network ( +++ or --- vs. ++- ) • Algorithms • sampling based: sparsify the graph such that it fits into memory • streaming: reduce to frequency moments • linear algebra based: reduce triangle counting to a trace estimation problem and use randomized approximations

Naïve triangle counting • Time O(mn)

Improving Exact Counting(Alon, Yuster, Zwick) • Algorithm: • Divide vertices according to  • For all low-degree vertices • check neighbor-pairs and whether they are connected • For high-degree subgraph • use matrix multiplication to estimate number of triangles Asymptotically the fastest algorithm but not practical for large graphs.

AYZ triangle counting • Use threshold  • Time spend in counting triangle with low-degree pivots = E • Number of high degree vertices = 2E/ • Time spent in matrix multiplication = (2E/) • Total time = O(E + (2E/) ) • By appropriate choice of , minimized at

Naïve sampling • r independent samples of three distinct vertices = number of triplets with i edges Then the following holds: with probability at least 1-δ Works for dense graphs. e.g., T3 n2logn WAW '10

Edge sampling • Triangle Sparsifiers • Keep an edge with probability p. Count the triangles in sparsified graph and multiply by 1/p3. • If the graph has O(n*logc(n)) triangles we get concentration • Proof of concentration tricky • uses the Kim-Vu concentration result for multivariate polynomials which have bad Lipschitz constant but behave “well” on average • improved using colorability result by Hajnal-Szemeredi • works ; t = #triangles.  = max degree; d=avg.

Streaming Triangle counting • Consider a pseudo-array, where each element is a triplet • t1 = (a1,b1,c1) • Estimate F0, F1, F2 for this pseudo-array • using sketches • Use the relation to estimate T3 • Number of samples = • Better in the incidence model = number of triplets with i edges

Random Walks on a stream • Naïve method • For each step of random walk, do a pass over the network • Using space O(n), k steps need k passes • Sample O(kn) edges, one from every node • In one pass, can do walk of length k. • Main result: • Using space O(n), can do k steps of the random walk using only k1/2 passes • Uses this to approximate PR, conductance etc.

Random walks • Multiple start points: sample each node w.p. p and create a w-length random walk from there in w passes • Will try to stitch these together • Can get stuck as • Endpoint was not in original sample (i.e. no random walk from here) • Endpoint was already used (i.e. cannot take independent steps) • Handling stuck nodes: • Maintain the stuck node(s) and the set of “used” startpoints • Take a new random sample of s edges from each of these (maybe multiple times) • Crucial step: • Whenever stuck, either the new random sample is enough to make progress, or we discover new nodes (and there are not many of them)

Key-value groups map map reduce k k v v k v v v k v k v v group k k k v v v … k v MapReduce Input key-value pairs Intermediate key-value pairs Output key-value pairs … … k v [slide from J. Ullman cs345A]

MapReduce formalization • Number of machines = N1 - • Memory per machine = N1 - • Total communication = N2 - • Over all rounds • MRCk = problems that need <=k rounds • Each round has 2-phase map, then reduce structure • Ideally, want same “total work” as optimal sequential algorithm

Connectivity (via mst)

MST • Suppose |E| = |V|1 + c • Assume #machines = |V|1 - • memory per machine = |V|1 - • Claim: • number of iterations = c/ • total work = O(m*(m,n)/)

Back to triangle counting: curse of the last reducer • Naïve mapreduce algorithm • In the first pass, collect edge-pairs [(u,v), (v, x)] • In the second pass, count triangles • Problem • Reducers that deal with high degrees take a long time

Trick 1: pivoting on smallest degrees • Pivot on the smallest degree node of the triangle • Reduces counting time to O(m 3/2 ) • Intuition: • Similar to the AYZ proof, divide analysis by pivot degree threshold m1/2 • In the MapReduce setting, just use this trick to decide which vertices should be pivots

Trick 2 : Overlapping partitions • Divide vertices V = {V1, V2,…Vt} • Vijk= Vi Vj Vk . Eijk = corresponding edges • In the first pass, partition the graph and weight each triangle such that it is counted exactly once • Run the previous algorithm on each partition in parallel • Total work done is still O(m1/2)

Datasets

Runtimes • Note reduction in number of paths using Trick-1 • However, running it on MR requires overheads

Runtimes

Models +Bag of Algorithmic tricks • Models • streaming, semi-streaming, stream + sort, mapreduce • Algorithmic tricks • Moment estimation on data stream • Edge sampling > triplet sampling • Reducing triangle counting to moment estimation • Piecing together random walks • Pivoting on the smallest degree to count triangles • Overlapping partitions to fit graph into memory

Not covered • Streaming + dynamic: • Model in which graph edges can appear/disappear • How can we test connectivity? • Multigraph stream • How do we compute different function of node degrees • Streaming + sort • Can solve a number of the discussed problems in poly(log) space • Interesting only if there is an efficient way to do disk based sort • Clustering • Are these the right computational models?

Streaming and MapReduce for Graphs

Streaming and MapReduce for Graphs

Presentation Transcript

MapReduce

MPI and MapReduce

MapReduce

MapReduce

MapReduce for Repy

MapReduce

MapReduce

MapReduce and Hadoop

Mapreduce and Hadoop

Database and MapReduce

MapReduce

MapReduce

Airavat : Security and Privacy for MapReduce

MapReduce

A MapReduce-Based Maximum-Flow Algorithm for Large Small-World Network Graphs

Streaming Graph Partitioning for Large Distributed Graphs

MapReduce

MapReduce

MARISSA: MApReduce Implementation for Streaming Science Applications

MapReduce

MapReduce