1-Pass Relative Error L p -Sampling with Applications

1-Pass Relative Error Lp-Sampling with Applications Morteza Monemizadeh TU Dortmund David Woodruff IBM Almaden

Given a stream of updates (i, a) to coordinates i of an n-dimensional vector x • |a| < poly(n) • a is an integer • stream length < poly(n) • Output i with probability |xi|p/Fp, where Fp = |x|pp = Σi=1n |xi|p • Easy cases: • p = 1 and updates all of the form (i, 1) for some i Solution: choose a random update in the stream, output the coordinate it updates [Alon, Matias, Szegedy] Generalizes to all positive updates • p = 0 and there are no deletions Solution: min-wise hashing, hash all distinct coordinates as you see them, maintain the minimum hash and item [Broder, Charikar, Frieze, Mitzenmacher] [Indyk] [Cormode, Muthukrishnan]

Our main result • For every 0 · p · 2, there is an algorithm that with probability · n-100 fails, and otherwise outputs an I in [n] for which for all j in [n] Pr[I = j] = (1 ± ε)|xj|p/Fp Condition on every invocation succeeding in any poly(n)-time algorithm Algorithm is 1-pass, poly(ε-1 log n)-space and update time, and also returns wi = (1 ± ε)|xj|p/Fp Generalizes to 1-pass n1-2/ppoly(ε-1 log n)-space for p > 2 • “additive-error” samplers Pr[I = j] = |xj|p/Fp ± εFpgiven • explicitly in [Jayram, W] • implicitly in [Andoni, DoBa, Indyk, W]

Lp-sampling solves and unifies many well-studied streaming problems:

Solves Sampling with Deletions: • [Cormode, Muthukrishnan, Rozenbaum] want importance sampling with deletions: maintain a sample i with probability |xi|/|x|1 • Set p = 1 in our theorem • [Chaudhuri, Motwani, Narasayya] ask to sample from the result of a SQL operation, e.g., self-join • Set p = 2 in our theorem • [Frahling, Indyk, Sohler] study maintaining approximate range spaces and costs of Euclidean spanning trees • They need and obtain a routine to sample a point from a set undergoing insertions and deletions • Alternatively, set p = 0 in our theorem

Alternative solution to Heavy Hitters Problem for any Fp: • Output all i for which |xi|p > Á Fp • Do not output any i for which |xi|p < (Á/2) Fp • Studied by Charikar, Chen, Cormode, Farach-Colton, Ganguly, Muthukrishnan, and many others • Invoke our algorithm O~(1/Á) times, use approximations to values • Optimal up to poly(ε-1 log n) factors

Solves Block Heavy Hitters: given an n x d matrix, return indices i of rows Ri with |Ri|pp > Á¢Σj |Rj|pp • [Andoni, DoBa, Indyk] study the case p = 1 • Used by [Andoni, Indyk, Kraughtgamer] for constructing a small-size sketch for the Ulam metric under the edit distance • Treat R as a big (nd)-dimensional vector • Sample an entry (i, j) using our theorem for general p • The probability a row i is sampled is |Ri|pp/ Σj |Rj|pp, so we can recover IDs of all the heavy rows. • We do not use Cauchy random variables or Nisan’s pseudorandom generator, could be more practical than [ADI]

Alternative Solution to Fk-Estimation for any k ¸ 2: • Optimal up to poly(ε-1 log n) factors • Reduction given by [Coppersmith, Kumar]: • Take r = O(n1-2/k)L2-samples wi1, … , wir • In parallel estimate F2, call it F2’ • Output (F2’/r) * Σj wijk-2 Proof: second moment method First algorithm not to use Nisan’s pseudorandom generator

Solves Cascaded Moment Estimation: • Given ann x dmatrixA,Fk(Fp)(A) = Σj |Aj|pkp • Problem initiated by [Cormode, Muthukrishnan] • Show F2(F0)(A) uses O(n1/2) space if no deletions • Ask about complexity for other k and p • For any p in [0,2], gives O(n1-1/k) space for Fk(Fp)(A) • We get entry (i, j) with probability |Ai, j|p/ Σi’, j’ |Ai’, j’|p • Probability row Ai is returned is Fp(Ai)/ Σj Fp(Aj) • If 2 passes allowed, take O(n1-1/k) samples Ai, in 1st pass, compute Fp(Ai) in 2nd pass, and feed into Fk AMS estimator • To get 1 pass, feed row IDs into an O(n1-1/k)-space algorithm of [Jayram, W] for estimating Fkbased only on item IDs • Algorithm is space-optimal [Jayram, W] • Our theorem with p = 0 gives O(n1/2) space for F2(F0)(A) with deletions

Ok, so how does it work?

General Framework [Indyk, W] 1. Form streams by subsampling • St = {i | |xi| in [ηt-1, ηt)} forη = 1 + £(ε) • Stcontributes if |St|ηpt¸³ Fp(x), where ³ = poly(ε/log n) • assume p > 0 in talk • Let h:[n] -> [n] be a hash function • Create log n substreams Stream1, Stream2, …, Streamlog n • Streamjis stream restricted to updates (i, c) with h(i) · n/2j • Suppose 2j¼ |St|. Then • Streamj contains about 1 item of St • Fp(Streamj) ¼ Fp(x)/2j • |St| ηpt¸³ Fp(x) meansηpt¸³ Fp(Streamj) • Can find the item in Stin Streamj with Fp-heavy hitters algorithm • Repeat the sampling poly(ε-1log n) times, count number of times there was an item in Streamjfrom St • Use this to estimate sizes of contributing St, and Fp(x) ¼Σt |St|ηpt 2. Run Heavy hitters algorithm on streams 3. Use heavy hitters to estimate contributing St

Additive Error Sampler [Jayram, W] • For contributing St, we also get poly(ε-1log n) items from the heavy hitters routine • If the sub-sampling is sufficiently random (Nisan’s generator, min-wise independent), these items are random in St • Since we have (1 ± ε)-approximations s’tto all contributing St, can: • Choose a contributing t with probability s’tηpt/Σt’ s’t’ηpt • Output a random heavy hitter found in St • For item i in contributing St, • Pr[i output] =[s’tηpt/Σt’ s’t’ηpt] ¢ 1/|St| = (1 ± ε)|xi|p/Fp • For item i in non-contributing St, • Pr[i output] = 0

Relative Error in Words • Force all classes to contribute • Inject additional coordinates in each class whose purpose is to make every class contribute • Inject just enough so that overall, Fp does not change by more than a (1+ε)-factor • Run [Jayram, W]-sampling on resulting vector • If the item sampled is an injected coordinate, forget about it • Repeat many times in parallel and take the first repetition that is not an injected coordinate • Since injected coordinates only contribute O(ε) to Fpmass, small # of repetitions suffice

Some Minor Points • Before seeing the stream, we don’t know which classes contribute, so we inject coordinates into every class • For St = {i | |xi| in [ηt-1, ηt)}, inject£(εFp/(ηpt # classes)) coordinates, where # classes = O(ε-1log n) • Need to know Fp - just guess it, verify at end of stream • For some classes, £(εFp/(ηpt # classes)) < 1, e.g. if t is very large, so we can’t inject any new coordinates • Find all elements in these classes and (1 ± ε)-approximations to their frequenciesseparately using a heavy hitters algorithm • When sampling, either choose a heavy hitter with the appropriate probability, or select from contributing sets using [Jayram, W]

There is a Problem • The [Jayram, W]-sampler fails with probability ¸ poly(ε/log n), in which case it can output any item • This is due to some of the subroutines of [Indyk, W] that it relies on, which only succeed with this probability • So the large poly(ε/log n) additive error is still there • Cannot repeat [Jayram, W] multiple times for amplification, since we get a collection of samples, and no obvious way of detecting failure • On the other hand, could just repeat [Indyk, W] and take the median for the simpler Fk-estimation problem • Our solution: • Dig into the guts of the [Indyk, W] algorithm • Amplify success probability to ¸ 1 – n-100 of subroutines

A Technical Point About [Indyk, W] • In [Indyk, W], • Create log n substreams Streamj, where Streamj includes each coordinate independently with probability 2-j • Can find the items in contributing Stin Streamj with Fp-heavy hitters • Repeat the sampling poly(ε-1log n) times, observe the fraction there is an item in Streamjfrom St • Can use [Indyk, W] to estimate every |St| since every class contributes • Issue of misclassification • St = {i | |xi| in [ηt-1, ηt)}, and Fp-heavy hitters algorithm only reports approximate frequencies of items i it finds • If |xi| = ηt, it may be classified into St or St+1 – it doesn’t matter • Simpler solution than in [Indyk, W] • If item misclassified, just classify it consistently if we see it again • Equivalent to sampling from x’ with |x’|p = (1 ±ε)|x|p • Can ensure with probability ¸ 1-n-100, we obtain st’ = (1 ±ε)|St| for all t

A Technical Point About [Jayram, W] • Since we have st’ = (1 ± ε)|St| for all t • Choose a class t with probability s’tηpt/Σt’ s’t’ηpt • Output a random heavy hitter found in St • How do we output a random item in St? • Min-wise independent hash function h • For each i in St, h(i) = minj in St h(j) with probability (1 ±ε)/|St| • h can be an O(log 1/ε)-wise independent hash function • We recover i* in St for which h(i*) is minimum • Compatible with sub-sampling, where Streamj is items i for which h(i) · n/2j • Our goal is to recover i* with probability ¸ 1-n-100 • We have st’, and look at the level j* where |St|/2j* = £(log n) • If h is O(log n)-wise independent, then with probability ¸ 1-n-100, i* is in Streamj* • A worry: maybe Fp(Streamj*) >>Fp(x)/2j* so Heavy Hitter algorithm doesn’t work • Can be resolved with enough independent repetitions

Lower Bounds For every 0 · p · 2, there is a randomized algorithm that with probability · n-100 outputs FAIL, and otherwise outputs an I in [n] for which for all j in [n] Pr[I = j] = (1 ± ε)|xj|p/Fp Algorithm is 1-pass, poly(ε-1 log n)-space and time, returns wi = (1 ± ε)|xj|p/Fp For p > 2, gives n1-2/ppoly(ε-1 log n)-space. Can we use less space for p > 2? Requires (n1-2/p) space for any ε. Reduction from L1-estimation Can improve to(n1-2/plog n) using augmented L1-estimation [Jayram, W] Can we output FAIL with probability 0? Requires (n) space for any ε. Reduction from 2-party equality testing with no error Given that we don’t output FAIL, can we get a sampler with ε = 0? Yes for 2-pass algorithms, using rejection sampling. 1-pass requires (n) space if algorithm outputs the corresponding probability wi (needed in many applications). Reduction from the 2-party INDEX problem

Some Open Questions • 1-pass algorithms for Lp-sampling • If we output FAIL with probability · n-100, and don’t require outputting the sampled item’s probability, can we get ε = 0 with low space? • ε and log n factors are large. What is the optimal dependence on them? • Useful for Fk-estimation for k > 2, and other applications • Sampling from other distributions • Given a vector (x1, …, xn) in a data stream, for which functions g can we sample from the distribution ¹(i) = |g(xi)|/Σj |g(xj)|? • E.g., random walks Thank you

1-Pass Relative Error L p -Sampling with Applications

1-Pass Relative Error L p -Sampling with Applications

Presentation Transcript

Sampling Error

Random Sampling Algorithms with Applications

Propagated and Relative Error

Sampling Error

Total Error = Systematic Error + Random Sampling Error

1-Pass Relative Error L p -Sampling with Applications

A relative-error CUR Decomposition for Matrices and its Data Applications

The problem of sampling error

Sampling algorithms and core-sets for L p regression and applications

Sampling algorithms for l 2 regression and applications

L OGIC P ROGRAMMING with P ROLOG

Deterministic Importance Sampling with Error Diffusion

Sampling and Error Rates

Applications of Relative Importance

L p -Sampling

Calculating sampling error

Sampling Error

Relative Perturbations for s.d.d. Matrices with applications

Applications of Relative Importance

A relative-error CUR Decomposition for Matrices and its Data Applications

Sampling algorithms and core-sets for L p regression and applications