An Optimal Algorithm for the Distinct Elements Problem

An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010

Problem Description • Given a long stream of values from a universe of size n • each value can occur any number of times • count the number F0 of distinct values • See values one at a time • One pass over the stream • Too expensive to store set of distinct values • Algorithms should: • Use a small amount of memory • Have fast update time (per value processing time) • Have fast recovery time (time to report the answer)

Randomized Approximation Algorithms 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, … • Consider algorithms that store a subsetSofdistinct values • E.g., S = {3, 9, 32, 265} • Main drawback is that S needs to be large to know if next value is a new distinct value • Any algorithm (whether it stores a subset of values or not) that is either deterministic or computes F0 exactly must use ¼ F0 memory • Hence, algorithms must be randomized and settle for an approximate solution: output F 2 [(1-ε)F0, (1+ε)F0] with good probability

Problem History • Long sequence of work on the problem • Flajolet and Martin introduced problem, FOCS 1983 • Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet, Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram, Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar, Szegedy, Tirthapura, Trevisan, Varghese, W • Previous best algorithm: • O(ε-2log log n + log n) bits of memory and O(ε-2) update and reporting time • Known lower bound on the memory: • (ε-2+ log n) • Our result: • Optimal O(ε-2+ log n) bits of memory and O(1) update and reporting time

Si = {1, 3, 7, 9, 265} Previous Approaches • Suppose we randomly hash F0values into a hash table of 1/ε2buckets and keep track of the number C of non-empty buckets • If F0 < 1/ε2, there is a way to estimate F0up to (1 ±ε) from C • Problem: if F0À 1/ε2, with high probability, every bucket contains a value, so there is no information • Solution: randomly choose Slog nµ Slog n - 1µ Slog n - 2µ S1µ {1, 2, …, n}, where |Si| ¼ n/2i Problem: It takes 1/ε2 log n bits of memory to keep track of this information stream: 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, … i-th substream: 3, 265, 3, 9, 7, 9, 3, … • Run hashing procedure on each substream • There is an i for which the # of distinct values in i-th substream ¼ 1/ε2 • Hashing procedure on i-th substream works

Our Techniques Observation: - Have 1/ε2 global buckets - In each bucket we keep track of the index i of the set Si for the largest i for which Si contains a value hashed to the bucket - This gives O(1/ε2log log n) bits of memory New Ideas: - Can show with high probability, at every point in the stream, most buckets contain roughly the same index - We can just keep track of the offsets from this common index - We pack the offsets into machine words and use known fast read/write algorithms to variable length arrays to efficiently update offsets - Occasionally we need to decrement all offsets. Can spread the work across multiple updates

An Optimal Algorithm for the Distinct Elements Problem

An Optimal Algorithm for the Distinct Elements Problem

Presentation Transcript

An Algorithm for the Steiner Problem in Graphs

An Algorithm for the Coalitional Manipulation Problem under Maximin

An Optimal Algorithm for the Distinct Elements Problem

An Optimal Broadcast Algorithm for Content- Addressable Networks

An Optimal Partial Decoding Algorithm for Rateless Codes

Distinct Elements Problem

Tight Lower Bounds for the Distinct Elements Problem

An Improved Algorithm for the Rectangle Enclosure Problem

An Algorithm for Optimal Winner Determination in Combinatorial Auctions

An optimal algorithm to solve the minimum weakly cooperative guards problem for 1-spiral polygons

Optimal algorithm for a special point-labeling problem

Estimating Distinct Elements, Optimally

An Algorithm for the Coalitional Manipulation Problem under Maximin

An Algorithm for the Coalitional Manipulation Problem under Maximin

An Improved Search Algorithm for Optimal Multiple-Sequence Alignment

An Incremental Sampling-based Algorithm for Stochastic Optimal Control

An Algorithm for the Steiner Problem in Graphs

An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem

An Optimal Algorithm for Online Square Detection

Estimating Distinct Elements, Optimally