1 / 27

Counting Distinct Objects over Sliding Windows

Counting Distinct Objects over Sliding Windows. Presented by: Muhammad Aamir Cheema. Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin. University of New South Wales, Australia. Introduction. Counting distinct objects: Given a dataset D, return the number of distinct objects in D.

clio
Download Presentation

Counting Distinct Objects over Sliding Windows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of New South Wales, Australia

  2. Introduction Counting distinct objects: • Given a dataset D, return the number of distinct objects in D. Counting distinct objects against sliding windows: • Given a data stream, return the number of distinct objects that arrive at or after timestamp t. Applications • traffic management, call centers, wireless communication, stock market etc.

  3. Introduction Approximate counting: Let n be the actual number of distinct objects and n’ be the reported answer. Build a sketch s.t. every query is answered with the following guarantee; |n-n’|/n ≤ ε with confidence (1 – δ) Contribution: • FM based algorithms • SE-FM (accuracy guarantee + space usage guarantee) • PCSA-based algorithm (No accuracy guarantee (although practical) + more efficient) • k-Skyband (Accuracy guarantee + efficient + no space usage guarantee)

  4. FM Algorithm FM SKETCH Let h(x) be a uniform hash function • Let “pivot” p(y) be the position of left most 1-bit of h(x) • FM be an array of size k initialized to zero • For each record x in dataset • FM[pivot] = 1; • Let B=FMmin be the position of left most 0-bit of FM • Number of distinct elements = α * 2B where α = 1.2897385 • Each bit i of h(x) has 1/2 probability to be one k = 4 h(r1) h(r2) h(r3) FM P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985 FMmin = 1

  5. FM Algorithm • Each bit i of h(x) has 1/2 probability to be one • A h(x) with first i bits zero and (i+1)th bit one has a probability 1/2i+1 Let n be the number of distinct elements • FM[0] is accessed appx. n/2 times • FM[1] is accessed appx. n/4 times • …. • FM[i] is accessed appx. n/2i+1 times • If i >> log2 n • FM[i] will almost certainly be zero • If i << log2 n • FM[i] will almost certainly be one • If i ≈ log2 n • FM[i] may be zero or one • Hence, the first i for which FM[i] is zero may be used to approximate number of distinct elements n. h(r1) h(r2) h(r3) FM FMmin = 1

  6. FM Algorithm Use r hash functions to create r FM Sketches • Initialize each FM to zero • For each record x in dataset • For each hash function hi(x) • FMi[pivot] = 1; • Let Bi be the position of left most 0-bit of FMi • B = (B1 + B2 … + Br )/ r • Number of distinct elements = α * 2B where α = 1.2897385 FM1 B1 = 1 FM2 B2 = 2 Performance Guarantee: Let n be the actual number of distinct objects, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є2 log 1/δ) FM3 B3 = 2 B = (1 + 2 + 2)/3 = 1.67

  7. FM-based Algorithm Maintaining one FM sketch • For each record (x,t) in dataset • FM[pivot] = t; Answering a query • For any t, let B = FMmin (t) be the position of left most entry of FM with value less than t • Number of distinct elements arrived after (inclusive) t = α * 2B where α = 1.2897385 h(r1) h(r2) h(r3) FM FMmin (4) = 0

  8. FM-based Algorithm Maintain r FM sketches • Initialize each FM to zero • For each record (x,t) in dataset • For each hash function hi(x) • FMi[pivot] = t; Answering a query • For any t, let Bi (t) be the position of left most entry smaller than t in i-th FM • Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r • Number of distinct elements arrived after (inclusive) t = α * 2B where α = 1.2897385

  9. Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If n > 1/є and k = O(log m + log 1/є + log 1/δ ) and r = O(1/є2 log 1/δ) • Total Space: O(1/є2 log 1/δ log m) • Total maintenance cost for one record: O(1/є2 log 1/δ log log m) • Total query cost: O(1/є2 log 1/δ log log m)

  10. PCSA-based Algorithm Maintain r FM sketches but update j < r sketches • Generate j hash functions H(x) that map x to [1,r] • Initialize each FM to zero • For each record (x,t) in dataset • For each of the j hash functions H() • i = H(x) • Update i-th FM sketch Answering a query • For any t, let Bi (t) be the position of left most entry smaller than t in i-th FM • Let B = ( B1 (t) + B2 (t) … + Br(t) )/ r • Number of distinct elements arrived after (inclusive) t = (α * 2B)/ j where α = 1.2897385 • Inspired by PCSA technique in ”P.. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. JCSS 1985” NOTE: No accuracy guarantee but performs well in practice

  11. BJKST Algorithm • Main Idea • Let h() be a hash function to hash D to [1,m3] where m = |D| • For each record x, we generate its hash value h(x) • Maintain k-th smallest distinct hash value k_min • Number of distinct elements = n = km3/k_min • Improved algorithm • Use r hash functions • Compute ni for each hash function hi() as above • Report final answer as median of ni values • Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in datastream. In RANDOM'02.

  12. K-Skyband Technique • Main Idea • Let h() be a hash function to hash D to [1,m3] where m = |D| • For each record (x,t’) we generate h(x) and store record (x, h(x), t’) • Answering a query q(t): • Retrieve all records (x,h(x),t’) for which timestamp t’ ≥ t • Get the k-th smallest distinct hashed value and apply BJKST algorithm • Limitation: Requires storing all records

  13. K-Skyband Technique • For any time t, we need to find k-th smallest hash value arriving no later than t • A record x dominates another record y if x arrives after y and has smaller hash value • K-Skybands keeps only the objects that are dominated by at most (k-1) records • Maintaining K-Skyband: • Keep a counter for each record • When a new element (x,t) arrives, increment the counter of all records dominated by it • Remove the records with counter at least equal to k • We increment the counters of groups to improve efficiency (Domination aggregation search tree) k = 2 b e c t d a h(x)

  14. K-Skyband Technique • Answering Query: • Find k_min (the k-th smallest hash value among elements arriving no later than t) • Let z be the number of elements arrived before t • k_min is the (z+k)-th overall smallest hash value • Algorithm: • Maintain a binary search tree eT that stores elements according to t • Maintain a binary search tree eH that stores elements according to h(x) • When a query q(t) arrives • Compute z by using eT • Find (z+k)-th overall smallest hash value from eH k_min = 5th smallest h(x) k = 2 b e c t d a z = 3 f h(x)

  15. Performance Analysis Let n be the actual number of distinct objects arriving not before time t, n’ be the reported answer and m be the domain of elements then; P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Expected total space: O(1/є2 log 1/δ log n) Expected time complexity: O(log 1/δ (log 1/є+ log n))

  16. Experiments • Synthetic datasets following Uniform and Zipf distribution • Real dataset WorldCup 98 HTTP requests (20 M records) j

  17. Space Efficiency

  18. Space Efficiency

  19. Time Efficiency Maintenance cost

  20. Time Efficiency Query response time

  21. Accuracy

  22. Thanks

  23. P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Space usage: 1/ε2 log 1/δ m1/2 • Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias. Spatio-temporal aggregation using sketches. In ICDE 2004. Space usage: O(N/ε2 log 1/δ log m)

  24. Space Requirement (SE-FM) To guarantee the performance we require the following; • k = O(log m + log 1/є + log 1/δ ) • r = O(1/є2 log 1/δ) Let m > 1/є and m > 1/δ; then k = O(log m) Size of one sketch is k = O(log m); Size of r sketches is: O(r log m) = O(1/є2 log 1/δ log m); Total Space: O(1/є2 log 1/δ log m)

  25. Time Complexity (SE-FM) To guarantee the performance we require the following; • k = O(log m + log 1/є + log 1/δ ) • r = O(1/є2 log 1/δ) The elements in a sketch are stored in a min-heap to support logarithmic search/update; • Hence, cost of one search/update operation: O( log k) = O( log log m) • To maintain the sketches, we update r sketches for each record x • Total maintenance cost for one record: O( r log log m) = O(1/є2 log 1/δ log log m) • To answer a query, we search in r sketches • Total cost: O( r log log m) = O(1/є2 log 1/δ log log m)

  26. Space Usage (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Expected size of k-skyband = O (k ln (n/k) ) Expected size of r k-sybands = O(rk log (n/k) ) = O(1/є2 log 1/δ log n)

  27. Time Complexity (K-Skyband) Performance guarantee: P( |n’ – n|/n ≤ є ) ≥ 1 - δ If m > 1/ δ and n > k and k = O(1/є2) and r = O(log 1/δ) Answering Query q(t): Search eT to compute z: log (k log n) = O(log k + log n) Search eH to find (z+t)-th element: O(log k + log n) We require this for all r sketches: O (r (log k + log n)) = O(log 1/δ (log 1/є+ log n))

More Related