Optimizing Data Popularity with Conscious Bloom Filters

Optimizing Data Popularity Conscious Bloom Filters Ming Zhong Pin Lu Kai Shen Joel Seiferas University of Rochester

Problem Overview • Bloom filters: • compact set representation in which each object is hashed into several bits in the filter; • allows possible false positives in membership queries; • useful in distributed applications communicating sets. • Highly skewed data popularity distributions. • Data popularity conscious Bloom filters: • use a large number of hashes for likely false positive candidates – popular objects in queries; unpopular objects in sets. • Goal: customize the hash number for each object to minimize the false positive prob. PODC 2008

Object Popularity Stability • Stable object popularity is important for learning the object popularity and for low adjustment overhead. • Illustration of stability across month-long trace segments: PODC 2008

Problem Formulation and Result • Problem formulation: • in a universe of N objects, an n-object set is represented by an m-bit filter; • object i’s membership pop. is pi, non-member query pop. is q’i; • find object hash numbers k1, k2, …, kN to minimize the false positive probability ∑1≤i≤N q’i ∙ pow(B,ki); • B is the probability for an arbitrary filter bit to be 1, therefore ∑1≤i≤N pi ∙ ki = K = ln(1-B) / (n ∙ ln(1-1/m)). • Result (assume ki‘s are unrestricted real numbers): • Lagrangian function: ∑1≤i≤N q’i ∙ pow(B,ki) + λ∙ (∑1≤i≤N pi ∙ ki – K); • optimization is reached when the function’s partial derivatives on ki’sand λ are all zero; • we find ki = C + log1/B(q’i/pi), C is a constant; • also B = 0.5. PODC 2008

Ranged Integer Problem • Practical constraint: • object i’s hash number ki must be a positive integer, and often upper-bounded by kmax. • Rounding real-number solutions to integers: • may increase the false positive rate; • no understanding on how much the increase may be. • Overview of our approach: • introduce an importance score for each object (intuitively more important objects desire more hashes); • the importance ranking helps produce fast approximation solutions. PODC 2008

Object Importance Score • Intuition: • revisit the optimal real-number solution: ki = C + log2(q’i/pi); • Hint: q’i/pi provides a ranking on object hash numbers in a “good” solution. • Results: • for the ranged real-number problem, an optimal solution k1, k2, …, kNmust follow the importance ranking; • └k1┘, └k2┘, …,└kN ┘is a 2-approximation solution to the ranged integer problem; it also follows the importance ranking. PODC 2008

Polynomial-Time 2-Approximation • Our result indicates that at least one solution that follows the importance score ranking is provably 2-approximation. • ⇒ If we enumerate all importance-ranked solutions, the best is a 2-approximation. • O(Nkmax) time 2-approximation: • no more than (N+1)kmax-1 importance-ranked solutions in total; • it takes O(N) to check constraint and calculate the false positive rate for each solution. • Practically expensive: • N can be huge; • the constant kmax may not be very small (e.g., 20). PODC 2008

Faster Solutions • (2+ε)-approximation: • the problem of identifying the best importance-ranked solution can be transformed into a knapsack problem; • dynamic programming produces (2+ε)-approximation solution in O(N2/ε) time. • Coarse-grained optimization: • partition large number of objects into a small number of groups (objects in each group have similar importance scores); • optimize at the group granularity (then assign equal hash number to objects within one group) ⇒much smaller N. PODC 2008

Evaluation on Synthetic Data • Non-member query pop. q’i follows Zipf-like distribution. • Membership pop. pi follows a uniform distribution. • Our integer approximation solution significantly outperforms the real-rounding solution, particularly at high popularity skewness. PODC 2008

Trace-driven Evaluation on Distributed Caching • Distributed caches exchange their content (set of cached web objects) to cooperate. • Evaluation driven by web access traces from IRCache.net. PODC 2008

Trace-driven Evaluation on Distributed Keyword Searching • Distributed search engines pass keyword indexes to support distributed joins. False positives resolved by additional comm. • Evaluation driven by web page listing at dmoz.com and keyword query traces at Ask.com. PODC 2008

Related Work • Compressed Bloom filters [Mitzenmacher 2002]. • Bloom filters with additional functionalities: • deletion [Fan et al. 2000]; • frequency queries [Cohen and Matias 2003]; • associating objects with values [Chazelle et al. 2004]. • Alternative data structure [Pagh et al. 2005]. • Weighted Bloom filters [Bruck et al. 2006]: • optimal real-number solution with integer rounding; • analytically, the rounding-induced error increase is unbounded; • practically, the error increase can be substantial. PODC 2008

Conclusions • Popularity conscious Bloom filters: • motivated by skewed, stable data popularity distributions; • customize each object’s hash number according to its popularity in sets and queries. • Unrestricted real-number problem: • optimal solution when object hash number is linear to log(query-pop’/set-pop). • Ranged integer problem: • query-pop’/set-pop serves as an object importance indicator; • O(Nkmax) time 2-approximation; • O(N2/ε) time (2+ε)-approximation. • Quantitative evaluations driven by real distributed application traces. PODC 2008

Optimizing Data Popularity with Conscious Bloom Filters

Optimizing Data Popularity with Conscious Bloom Filters

Presentation Transcript

Codes, Bloom Filters, and Overlay Networks

Bloom Filters: A History and Modern Applications

Bloom Based Filters for Hiera r chical Data

Bloom Filters

Bloom filters

Network Applications of Bloom Filters: A Survey

Bloom Filters

Bloom Filters

Payload Attribution via Hierarchical Bloom Filters

Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters

Bloom Filters

Cache-Conscious Data Placement

Measurement Algorithms: Bloom Filters and Beyond

Fast Packet Classification Using Bloom filters

Improving State Coverage Using Bloom Filters

Service Identifiers and Bloom Filters

Bloom Filters

Bloom Filters

Deep Packet Inspection Using Parallel Bloom Filters

Bloom Filters

Bloom filters

Beyond Bloom Filters: Approximate Concurrent State Machines