240 likes | 335 Views
Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld. Lower bounds on data stream computations. Previously. We proved 3 theorems concerning space complexity of data stream algorithms.
E N D
Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld Lower bounds on data streamcomputations
Previously... • We proved 3 theorems concerning space complexity of data stream algorithms. • Using the streaming model discussed earlier, we found out some lower bounds for the MAX, MAXNEIGHBOR, MAXTOTAL and MAXPATH algorithms. • And now, for something completely different.
Today • In this lecture, I introduce lower bounds from communication complexity. • Trust me they are correct. • Using these bounds and (mostly) reductions, our goal is to prove even more theorems. Theorems are good. • I'll prove 3 of them. • Starting with “Theorem 4”.
Theorem 4 • Setting: Sequence of m numbers in {1,...,n}. • Multiple occurences are allowed. • Claim: Finding the k most frequent items requires Ω(n/k) space. • Moreover, random sampling yields an upper bound of O(n (log m + log n) / k). • We're going to use a blackbox to prove it.
Theorem 4 blackbox • Alon-Matias-Szegedy: Finding the most frequent number in a sequence of length m in range {1,...,n} takes Ω(n) space. • Proof outline: Reduction. Namely, we create a new stream that we can (ab)use this blackbox on. • The reduction will replace each number in the sequence with a sequence of numbers: • Each i in {1,...,n} is replaced with ki+1,...,ki+k. • In total, nk numbers.
Reduction example • Our data stream is {4,5,3,2,7,3,4,5,1} in range {1,...,10} and we want to obtain the 2 most occuring numbers. • The reduction will create the numbers: • {9,10}, {11, 12}, {7, 8}, {5, 6}, {15,16}, {7, 8}, {9, 10}, {11,12}, {3, 4} • The most occuring numbers in the original sequence are the most occuring number in the new sequence.
Proof outline • If xi=xj, then the sequences created by the reduction coincide. Otherwise, they are disjoint. • If xi occurs l times in the stream, it'll occur kl times in the new stream. • It follows that finding one of the k most frequent items in one pass requires Ω(n/k) space. Running this 'algorithm' k times we get the AMS theorem. • Great success.
As for the upper bound • Reminder: a Monte-Carlo algorithm is a randomized algorithm that succeeds with a high probability. • So we'll show a Monte-Carlo algorithm that succeeds with high probability to get the right upper bound.
The Monte-Carlo algorithm • Before reading the stream: • Sample each number with probability 1/k. • Only keep a counter for the sampled numbers. • Read the stream normally. • Output the successfully sampled number with largest count. • With constant probability, one of the k-th most frequent numbers has been sampled successfully. • This requires O(n (log m + log n) / k) space. Epic win.
And now for somethingcompletely different • Introducing the approximate median problem (AMP). • Reminder: The median is the value which separates the higher half of the set from the lower half. • We want to approximate that. Why? Because it's cool.
This slide isn't the median problem • First, a blackbox from communication complexity. • Consider the bit-vector probing problem: • Let A have a bit sequence of length m and B an index i. B needs to know xi, the i-th input bit. • But the communication is one way only, B can not send anything to A. • Ideas?
Blackbox cont. • Turns out there isn't a better method for A to send the i-th bit than to send the entire string to B. • So it takes Ω(m) space. • But what about randomization? • Too bad, any algorithm that succeeds in guessing xi • With probability better than (1+ε)/2 • Requires at least εm bits of communication.
Approximate median problem • Goal: Find a number whose rank is in the interval [m/2 – εm, m/2 + εm]. • It can be solved by a one-pass Monte-Carlo algorithm with 1/10 error probability. • Takes O(log n (log 1/ε)2 / ε) space. • I have a truly magnificent proof of this theorem. This slideshow is too small to contain it.
AMP cont. • Motivation: We want to prove a corresponding lower bound on this problem. • How: We show that any 1-pass Las Vegas algorithm that solves the ε-AMP requires Ω(1/ε) space. • We show a reduction from the bit-vector probing problem.
AMP lower bound proof • Let B be a bit vector, followed by a query index i. • This is translated to a sequence of numbers as follows: • First, output 2j+bj, for each j. • Then, upon getting the query, output n-i+1 copies of 0 and i+1 copies of 2(n+1).
Reduction example • B = (0,1,0,1,1,0,1,1,0,1), i=5. • The reduction maps: • 2j+bj: [2,5,6,9,11,12,15,17,18,21] • N-i+1=6 copies of 0: [0,0,0,0,0,0] • i+1=6 copies of 22=2(n+1): [22,22,22,22,22,22] • The median of this set is 11. It's LSB is 1. Which is exactly the value of b5.
AMP proof cont. • It is easily verified that the least significant bit of the median of this sequence is the value of bi (that is, the bit we seek). • Choose ε=1/2n. Therefore the ε-approximate median is the exact median. This is true because we have 2n numbers in the “reduced” stream. • Therefore any one-pass algorithm that requires fewer than 1/2ε = n bits of memory can be used...
AMP proof cont. • … to derive a communication protocol that requires fewer than n bits to be communicated from A to B in solving bit vector probing. • But every protocol that solves bit vector probing must communicate n bits. • Contradiction. Quod erat demonstratum.
Corollary • What's the point I've been trying to make? • Randomization can sometimes reduce space complexity significantly, at the cost of guarantee of output correctness. • Moving right along.
Some graph theory • A graph can be considered as a stream. • Example: Adjacency list. • This means some graph-theoretic problems can be approximated or solved using data stream and communication complexity techniques. • I'll address a small part of them.
Why is this good? • Suppose we can read the stream more than once (we don't have enough memory to store it but we do have access). • But the amount of times we can read the stream is finite. • What possible graph theoretic problems could we approximate with this method?
Theorem 6 • In P passes, the following problems on an n-node graph take Ω(n / P) space: • Computing connected components • Computing k-edge connected components. • Computing k-vertex connected components. • Testing graph planarity. • Finding the sinks of a directed graph. • I'll prove graph connectivity.
Connected components • Proof by reduction of DISJOINT to the graph connectivity problem. Reminder: DISJOINT(x,y) returns 1 iff there exists i such that xi=yi. • Given bit vectors A and B, construct a graph with vertices {a,b,1,...,n}. • Insert an edge (a,i) iff i is in A's vector and an edge (i,b) iff it's in B's vector. • The graph is connected iff there exists a bit that's set in both vectors.
Connectivity cont. • From communication complexity, we know that every DISJOINT-solving protocol sends Ω(n) bits. • So if we have P passes over the data, one of the passes must use Ω(n / P) space. This is a total cheating hack by the way. Blame HRR. • QED anyway. • That's all folks!