60 likes | 205 Views
An Optimal Algorithm for the Distinct Elements Problem. Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010. Problem Description. Given a long stream of values from a universe of size n each value can occur any number of times count the number F 0 of distinct values
E N D
An Optimal Algorithm for the Distinct Elements Problem Daniel Kane, Jelani Nelson, David Woodruff PODS, 2010
Problem Description • Given a long stream of values from a universe of size n • each value can occur any number of times • count the number F0 of distinct values • See values one at a time • One pass over the stream • Too expensive to store set of distinct values • Algorithms should: • Use a small amount of memory • Have fast update time (per value processing time) • Have fast recovery time (time to report the answer)
Randomized Approximation Algorithms 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, … • Consider algorithms that store a subsetSofdistinct values • E.g., S = {3, 9, 32, 265} • Main drawback is that S needs to be large to know if next value is a new distinct value • Any algorithm (whether it stores a subset of values or not) that is either deterministic or computes F0 exactly must use ¼ F0 memory • Hence, algorithms must be randomized and settle for an approximate solution: output F 2 [(1-ε)F0, (1+ε)F0] with good probability
Problem History • Long sequence of work on the problem • Flajolet and Martin introduced problem, FOCS 1983 • Alon, Bar-Yossef, Beyer, Brody, Chakrabarti, Durand, Estan, Flajolet, Fisk, Fusy, Gandouet, Gemulla, Gibbons, P. Haas, Indyk, Jayram, Kumar, Martin, Matias, Meunier, Reinwald, Sismanis, Sivakumar, Szegedy, Tirthapura, Trevisan, Varghese, W • Previous best algorithm: • O(ε-2log log n + log n) bits of memory and O(ε-2) update and reporting time • Known lower bound on the memory: • (ε-2+ log n) • Our result: • Optimal O(ε-2+ log n) bits of memory and O(1) update and reporting time
Si = {1, 3, 7, 9, 265} Previous Approaches • Suppose we randomly hash F0values into a hash table of 1/ε2buckets and keep track of the number C of non-empty buckets • If F0 < 1/ε2, there is a way to estimate F0up to (1 ±ε) from C • Problem: if F0À 1/ε2, with high probability, every bucket contains a value, so there is no information • Solution: randomly choose Slog nµ Slog n - 1µ Slog n - 2µ S1µ {1, 2, …, n}, where |Si| ¼ n/2i Problem: It takes 1/ε2 log n bits of memory to keep track of this information stream: 3, 141, 59, 265, 3, 58, 9, 7, 9, 32, 3, 846264, 338, 32, 4, … i-th substream: 3, 265, 3, 9, 7, 9, 3, … • Run hashing procedure on each substream • There is an i for which the # of distinct values in i-th substream ¼ 1/ε2 • Hashing procedure on i-th substream works
Our Techniques Observation: - Have 1/ε2 global buckets - In each bucket we keep track of the index i of the set Si for the largest i for which Si contains a value hashed to the bucket - This gives O(1/ε2log log n) bits of memory New Ideas: - Can show with high probability, at every point in the stream, most buckets contain roughly the same index - We can just keep track of the offsets from this common index - We pack the offsets into machine words and use known fast read/write algorithms to variable length arrays to efficiently update offsets - Occasionally we need to decrement all offsets. Can spread the work across multiple updates