330 likes | 471 Views
Lower Bounds for Read / Write Streams. Paul Beame Joint work with Trinh Huynh (Dang-Trinh Huynh-Ngoc) University of Washington. Data stream Algorithms. Many huge successes No need to remind people at this workshop! Some problems provably hard
E N D
Lower Bounds for Read/Write Streams Paul Beame Joint work withTrinh Huynh (Dang-Trinh Huynh-Ngoc)University of Washington
Data stream Algorithms • Many huge successes • No need to remind people at this workshop! • Some problems provably hard • E.g. Frequency moments Fk, k > 2 require space Ω(n1-2/k) [Bar-Yossef-Jayram-Kumar-Sivakumar 02], [Chakrabarti-Khot-Sun 03]
Beyond Data Streams • Disk storage can be huge • Can stream data to/from disks in real time • Sequential access hides latency • Motivates multipass streams • Analyzed by similar methods to single pass • Why stop at a single copy? • Working with more than one copy at once may make computations easier • Why stream the data onto disks exactly as read? • Can make modifications to data while writing
Disks read/write streams Key Parameters: space, #passes=reversals Assume #streams is constant Introduced by [Grohe-Schweikardt 05] Read/write streams model 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 memory
Read/write streams model • Much more powerful than data-stream model • Sort with O(log n)passes, O(log n) space, 3 streams • MergeSort • Exactly compute any frequency moment • Data-stream requires passes space= Ω(n) • Θ(log n) passes, O(1) space gives all of LOGSPACE [Hernich-Schweikardt 08] What can be computedin o(log n)passes + small space?
Previous lower bounds for R/W streams • In o(log n) passes need Ω(n1-ε) space to • Sort n numbers [Grohe-Schweikardt 05] • Test set-equalityA=B, multiset equality, XQuery, XPath [Grohe-Hernich-Schweikardt 06] • Same lower bounds apply for randomized algorithms with one-sided error [Grohe-Hernich-Schweikardt 06]
Previous lower bounds for R/W streams • Lower bounds for general randomness and two-sided error: • Ino(log nlog log n)passes, needΩ(n1-ε)space to: • Approximate F*within factor 2 • Find Empty-Join, XQuery/XPath-Filtering etc. [B-Jayram-Rudra 07] What about approximating frequency moments Fkfor k 2?
Our Main Result Theorem: Any randomized R/W-stream algorithm using o(log n) passes needs Ω(n1-4/k-ε)space to 2-approximate Fk • Implies polynomial space for k>4 • Compare with:Θ(n1-2/k)on data streams R/W streams with o(log n) passes don’t help much for approximating frequency moments.
[Alon-Matias-Szegedy 96] approach to lower bounding Fk in data streams • Reduce testing t-party set-disjointness to Fk Easy! • Simulate any data-stream algorithm by amulti-party number-in-hand communication game Trivial! • ApplyΩ(n/t) communication lower bound on t-party set-disjointness [AMS 96,Saks-Sun 02,Bar-Yossef-Jayram-Kumar-Sivakumar 02, Chakrabarti-Khot-Sun 03,Grönemeier 09](tight!) Solved easily by R/W streams! Fails for R/W streams! Cannot be applied to R/W streams!
Promise Set-Disjointness (DISJ) 0, x1,…,xtare pair-wise disjoint DISJn,t(x1,…,xt) = 1, a s.t. axi for every i Undefined otherwise x1 x2 x3 x4 x5 • t-party NIH communication: Ω(nt) • Approximating Fk testing DISJn,tfor t n1/k
R/W streams easily solve DISJn,t • Testing DISJn,t with 2 streams,3 passes,O(log n) space • Input: x1,x2,…,xt{0,1}n x1 x2 xt-1 xt x2 xt-1 xt x1
How to prove lower bounds in R/W streams? • Lower bounds [GS05], [GHS05], [BJR07] for R/W streams don’t use [AMS96] outline • Introduce permuted 2-party versions of problems • Employ ad-hoc combinatorial arguments We take a more general approach related to [AMS96] directly using NIH comm. complexity
Our approach to lower bound Fk R/W streams algorithm for t-party-permuted-DISJon input size n Number-in-hand communication protocol for t-party-DISJon input size nt2
[Alon,Matias,Szegedy 96]’s approach to lower bound Fk in data stream Our approach to lower bound Fkin R/W streams • Reduce testing t-party set-disjointness to Fk Easy! • Simulate data-stream algorithms bymulti-party number-in-hand communication game Apply our simulation • Apply communication lower bound on t-party set-disjointness [AMS96,SS02,B-YJKS02,CKS03,G09] (tight!) • 1. Reduce testing permutedt-party DISJ to Fk • 2. Simulate R/W streams for permutedDISJ by NIH comm. for DISJ on slightly smaller input size
Segmenting DISJn,t Input: x1,x2,…,xt{0,1}n • View DISJn,tas an OR of m subproblems DISJn/m,t x1 x2 xt-1 xt 1 2 m 1 2 m nm nm
Permuted DISJ Fix 1,2,…,tpermutationson[m] Permuted-DISJn,m,t • View Permuted-DISJn,m,t as an OR of m subproblems DISJn/m,t DISJn/m,t DISJn/m,t t(xt) 1(x1) 2(x2) 1(1)1(2) 1(m) 1 2 m t(1) t(2) t(m) 1 2 m nm nm
Why is permuted-DISJ hard? • Intuitively, to solve a subproblem (e.g. blue), we need to compare at least two blue segments • Need to compare at least two segments of every color • If segments are shuffled, many passes are needed DISJn/m,t i(xi) j(xj) l(xl)
Permuted DISJ Good subproblem: computation always depends only on at most one of its t segments (and the memory/state) If segments are randomly shuffled: With o(log m) passes, t=o(m1/2) parties, 99% of the m subproblems are good Reduction idea:Try to embed an ordinary DISJn/m,t in one of the good subproblems Catch:Which subproblems are good depends on input
Simulation s-spaceR/W streams algo A for permuted-DISJn,m,t NIH comm. protocol for DISJn/m,t t players on input y1,y2,…,yt: • Generate m-1 DISJn/m,t’s that look like* y1,y2,…,yt • Shuffle with 1,2,…,t • (y1,y2,…,yt)isgoodw.h.p • Run A on 1(x1),…,t(xt) 1(x1) x1 y1 2(x2) x2 y2 *same sizes but don’t intersect
Generating the extended input Given y1,y2,…,yt, players • Exchange the sizes of each of the sets • O(tlogn) bits • Choose random consistent reordering of the indices of each y1,y2,…,yt • Generate m-1 random inputs to DISJn/m,twith same set sizes as y1,y2,…,yt but that are disjoint • Place y1,y2,…,yt in random position and then shuffle Key observation: If y1,y2,…,ytare disjoint then this resolves the catch • After shuffling, all the subproblems look the same so the probability that the subproblem where y1,y2,…,yt lands is good does not depend on the input
Simulating R/W stream algorithm A using NIH communication • As A executes on input v=1(x1),…,t(xt) players know all inputs except y1,…,yt • each player builds up copy of a dependency graphσ(v) for the elements of each stream so far • Using σ(v), at each step all players either • know the next move, or • know which one player knows next block of moves • that player communicates • know that need two players’ info: simulation “fails” • If subproblem y1,…,yt is good for v then simulation does not fail • If players detect failure they output “not disjoint” • If input was disjoint then only 1% chance of this
Stream R to L Stream L to R Stream L to R Dependency Graph Vertices: Elements of each stream in each pass Edges: From element to elements in previous pass that contained heads at same time it did pass 0 pass 1 pass j -1 pass j pass j+1
Why most subproblems are good • Simple case:algorithm just makes copies of the input stream and compares them • # of subproblems with > 1 segment read at same time on single pass through the streams (L-to-R or R-to-L on each stream) • ≤ # segments appearing in the same (or reversed) order • Almost surely, for random permutations 1,2,…,t no pair has a common subsequence or inverted subsequence longer than 2em1/2 • When t is o(m1/2) the total is o(m).
Why most subproblems are good • General case:May combine information about all streams onto a single stream in single pass • What is combined may depend on the input values • Each element depends on the segments that it can reach in the input stream via the dependency graph
Why most subproblems are good • For each fixed v, after p=o(log m) passes: • Each element can depend on only 2O(p) different input segments • For any one stream, the sequence of its elements’ dependencies on input segments is the interleaving of 2O(p) monotone subsequences from 1,2,…,t Only 2O(p) t m1/2=mo(1) bad subproblems on input v
Communication Cost of Simulation • For each fixed v, after p=o(log m) passes: • Only 2O(p) t elements depend on a segment and have a neighbor that does not depend on it • Players only need to communicate when segment dependencies change • only happens 2O(p)t times at cost of O(ps) bits per time
R/W streams algo for permuted-DISJn,m,t NIH CC protocol for DISJn/m,t Limitation of using permuted-DISJ • Gap from data stream due to loss in input size • Most of this loss is necessary • Need nm (t2) to use Ω(n/t) CC lower bound for DISJn/m,t • Efficient R/W algo for permuted-DISJn,m,tunless m ≥ t32 • Implies that n isΩ(mt2) which is Ω(t3.5) Since we need t≈n1/k, the lower bound Ω(n/t) is trivial for k 3.5
A longest-common-subsequence problem on permutations In any 3 permutations on [m] there is a pair with longest common subsequence length ≥m1/3. • Algorithm for permuted-DISJn,m,t follows from the following theorem: Proof: For each i [m] define a triple ti of integers: For each of the 3 pairs of permutations put length of the longest common subsequence for that pair that ends with valuei. Can show that all m triples are different. So some triple must contain a coordinate ≥m1/3 • Tight even for 4 permutations
R/W stream algorithm for permuted-DISJn,m,t for large t 1(x1) 2(x2) 3(x3) 4(x4) 5(x5) 6(x6) 1(x1) 2(x2) 3(x3) 4(x4) 5(x5) 6(x6) • Compare m1/3 blocks each time In any three permutations on [m] there is a pair with longest common subsequence length ≥m1/3. t m2/3, any : Testing permuted-DISJn,m,t with 2 streams, 3 passes, O(log nmt) space
Open problems • Is Ω(n1-4/k-ε)lower bound for R/W streams tight? • Gap from O(n1-2/k) upper bound in data stream • Can’t use permuted-DISJn,m,t to close it • Polynomial space to compute Fkfor 2 < k ≤ 4? • Other problems on R/W streams? • L(m,k) maximum LCS length that can be guaranteed between some pair in any set of k permutations on [m]. • We show L(m,3) L(m,4) m1/3 • What is L(m,k) for other values of k? • [B-Blais-Huynh 08]L(m,k)= m1/3+o(1)for kmO(1)