320 likes | 436 Views
Streaming and MapReduce for Graphs. Sample problems. How do we solve these problems ? finding connected components estimating clustering coefficient minm . spanning tree (weighted) minm -cut, other partitioning maximum matching (weighted) random walks. Streaming Model.
E N D
Sample problems • How do we solve these problems ? • finding connected components • estimating clustering coefficient • minm. spanning tree (weighted) • minm-cut, other partitioning • maximum matching (weighted) • random walks
Streaming Model • Stream = m elements from an universe of size n (possibly with some weights) …..(v1, 1), (v2, 2), (v2, 1), (v1, 300),…. • Vector interpretation • stream over universe [n] => vector of size n • Restrictions • Restricted memory, preferably logarithmic • Small number of passes over input, preferably constant • Fast update time • Different models • Simple: (e,w) – each element arrives only once • Cash register : multiple arrivals, i.e. updates (e, +w) arrive but are all increments • Turnstile : (e, ± w) -- both positive and negative updates
Estimating moments • stream over [n] => vector • Estimate moments • To a factor (1±)w.p. 1- • (AMS) In order to (, ) estimate • space is sufficient for 0 < p < 2 • n1- 2/p space is necessary for p > 2
Estimating F2 • Pick a random hash function h:[n] {+1,-1} • For each update (i, v) perform • At the end estimate X = Z2 • Finally, use median of means.
Estimating F0 • Define hash function • k = 1/2 • On element (x, v) • compute h(x) and maintain v = k-th minimum • Finally, output X = k*M/v h:[n] [M]
Graph Streams and Problems • Stream = edges • e1, e2, e3,…. • other variants too • Space used = O(n*polylog(n)) • Problems • Connectivity • Matching • Spanners • Clustering coefficient • Moments of degree distribution
Connectivity • Doable in O(n log(n)) space • keep a label L(u) with every node u • same labels indicate same component • update label information as new edge (u, v) arrives • L(u) L(w) for all w with label L(v) • At the end each connected component has same label
Connectivity • Not doable in space • P is a “balanced” property if for there exists G and node u such that • V1 = {v: G + (u, v) satisfies P} ; • V2 = {v: G + (u, v) does not satisfy P} • min( |V1|, |V2|) > O(n) • Any such P needs space
Spanners • = shortest path distance in G • Want a subgraph H = (V, E’) such that • H is -spanner • Can construct a (2t-1)-spanner in space
Spanner Algorithm • Initialize H = empty • For each new edge (u, v) • if current d(u, v) > 2t -1 , include (u, v) in H • Claim • H is (2t – 1) spanner. • Number of edges • Takes time O(n) per edge, but faster algorithms exist
Counting triangles • Clustering coefficient = (#closed triplets)/(#connected triplets) • signature of community structure • Different types of signed triangles measure the “balance” of the network ( +++ or --- vs. ++- ) • Algorithms • sampling based: sparsify the graph such that it fits into memory • streaming: reduce to frequency moments • linear algebra based: reduce triangle counting to a trace estimation problem and use randomized approximations
Naïve triangle counting • Time O(mn)
Improving Exact Counting(Alon, Yuster, Zwick) • Algorithm: • Divide vertices according to • For all low-degree vertices • check neighbor-pairs and whether they are connected • For high-degree subgraph • use matrix multiplication to estimate number of triangles Asymptotically the fastest algorithm but not practical for large graphs.
AYZ triangle counting • Use threshold • Time spend in counting triangle with low-degree pivots = E • Number of high degree vertices = 2E/ • Time spent in matrix multiplication = (2E/) • Total time = O(E + (2E/) ) • By appropriate choice of , minimized at
Naïve sampling • r independent samples of three distinct vertices = number of triplets with i edges Then the following holds: with probability at least 1-δ Works for dense graphs. e.g., T3 n2logn WAW '10
Edge sampling • Triangle Sparsifiers • Keep an edge with probability p. Count the triangles in sparsified graph and multiply by 1/p3. • If the graph has O(n*logc(n)) triangles we get concentration • Proof of concentration tricky • uses the Kim-Vu concentration result for multivariate polynomials which have bad Lipschitz constant but behave “well” on average • improved using colorability result by Hajnal-Szemeredi • works ; t = #triangles. = max degree; d=avg.
Streaming Triangle counting • Consider a pseudo-array, where each element is a triplet • t1 = (a1,b1,c1) • Estimate F0, F1, F2 for this pseudo-array • using sketches • Use the relation to estimate T3 • Number of samples = • Better in the incidence model = number of triplets with i edges
Random Walks on a stream • Naïve method • For each step of random walk, do a pass over the network • Using space O(n), k steps need k passes • Sample O(kn) edges, one from every node • In one pass, can do walk of length k. • Main result: • Using space O(n), can do k steps of the random walk using only k1/2 passes • Uses this to approximate PR, conductance etc.
Random walks • Multiple start points: sample each node w.p. p and create a w-length random walk from there in w passes • Will try to stitch these together • Can get stuck as • Endpoint was not in original sample (i.e. no random walk from here) • Endpoint was already used (i.e. cannot take independent steps) • Handling stuck nodes: • Maintain the stuck node(s) and the set of “used” startpoints • Take a new random sample of s edges from each of these (maybe multiple times) • Crucial step: • Whenever stuck, either the new random sample is enough to make progress, or we discover new nodes (and there are not many of them)
Key-value groups map map reduce k k v v k v v v k v k v v group k k k v v v … k v MapReduce Input key-value pairs Intermediate key-value pairs Output key-value pairs … … k v [slide from J. Ullman cs345A]
MapReduce formalization • Number of machines = N1 - • Memory per machine = N1 - • Total communication = N2 - • Over all rounds • MRCk = problems that need <=k rounds • Each round has 2-phase map, then reduce structure • Ideally, want same “total work” as optimal sequential algorithm
MST • Suppose |E| = |V|1 + c • Assume #machines = |V|1 - • memory per machine = |V|1 - • Claim: • number of iterations = c/ • total work = O(m*(m,n)/)
Back to triangle counting: curse of the last reducer • Naïve mapreduce algorithm • In the first pass, collect edge-pairs [(u,v), (v, x)] • In the second pass, count triangles • Problem • Reducers that deal with high degrees take a long time
Trick 1: pivoting on smallest degrees • Pivot on the smallest degree node of the triangle • Reduces counting time to O(m 3/2 ) • Intuition: • Similar to the AYZ proof, divide analysis by pivot degree threshold m1/2 • In the MapReduce setting, just use this trick to decide which vertices should be pivots
Trick 2 : Overlapping partitions • Divide vertices V = {V1, V2,…Vt} • Vijk= Vi Vj Vk . Eijk = corresponding edges • In the first pass, partition the graph and weight each triangle such that it is counted exactly once • Run the previous algorithm on each partition in parallel • Total work done is still O(m1/2)
Runtimes • Note reduction in number of paths using Trick-1 • However, running it on MR requires overheads
Models +Bag of Algorithmic tricks • Models • streaming, semi-streaming, stream + sort, mapreduce • Algorithmic tricks • Moment estimation on data stream • Edge sampling > triplet sampling • Reducing triangle counting to moment estimation • Piecing together random walks • Pivoting on the smallest degree to count triangles • Overlapping partitions to fit graph into memory
Not covered • Streaming + dynamic: • Model in which graph edges can appear/disappear • How can we test connectivity? • Multigraph stream • How do we compute different function of node degrees • Streaming + sort • Can solve a number of the discussed problems in poly(log) space • Interesting only if there is an efficient way to do disk based sort • Clustering • Are these the right computational models?