130 likes | 160 Views
Explore optimizing data popularity using conscious Bloom filters with importance scores for efficient object hashing. Learn about importance ranking, ranging integer problems, and 2-approximation solutions for better data handling.
E N D
Optimizing Data Popularity Conscious Bloom Filters Ming Zhong Pin Lu Kai Shen Joel Seiferas University of Rochester
Problem Overview • Bloom filters: • compact set representation in which each object is hashed into several bits in the filter; • allows possible false positives in membership queries; • useful in distributed applications communicating sets. • Highly skewed data popularity distributions. • Data popularity conscious Bloom filters: • use a large number of hashes for likely false positive candidates – popular objects in queries; unpopular objects in sets. • Goal: customize the hash number for each object to minimize the false positive prob. PODC 2008
Object Popularity Stability • Stable object popularity is important for learning the object popularity and for low adjustment overhead. • Illustration of stability across month-long trace segments: PODC 2008
Problem Formulation and Result • Problem formulation: • in a universe of N objects, an n-object set is represented by an m-bit filter; • object i’s membership pop. is pi, non-member query pop. is q’i; • find object hash numbers k1, k2, …, kN to minimize the false positive probability ∑1≤i≤N q’i ∙ pow(B,ki); • B is the probability for an arbitrary filter bit to be 1, therefore ∑1≤i≤N pi ∙ ki = K = ln(1-B) / (n ∙ ln(1-1/m)). • Result (assume ki‘s are unrestricted real numbers): • Lagrangian function: ∑1≤i≤N q’i ∙ pow(B,ki) + λ∙ (∑1≤i≤N pi ∙ ki – K); • optimization is reached when the function’s partial derivatives on ki’sand λ are all zero; • we find ki = C + log1/B(q’i/pi), C is a constant; • also B = 0.5. PODC 2008
Ranged Integer Problem • Practical constraint: • object i’s hash number ki must be a positive integer, and often upper-bounded by kmax. • Rounding real-number solutions to integers: • may increase the false positive rate; • no understanding on how much the increase may be. • Overview of our approach: • introduce an importance score for each object (intuitively more important objects desire more hashes); • the importance ranking helps produce fast approximation solutions. PODC 2008
Object Importance Score • Intuition: • revisit the optimal real-number solution: ki = C + log2(q’i/pi); • Hint: q’i/pi provides a ranking on object hash numbers in a “good” solution. • Results: • for the ranged real-number problem, an optimal solution k1, k2, …, kNmust follow the importance ranking; • └k1┘, └k2┘, …,└kN ┘is a 2-approximation solution to the ranged integer problem; it also follows the importance ranking. PODC 2008
Polynomial-Time 2-Approximation • Our result indicates that at least one solution that follows the importance score ranking is provably 2-approximation. • ⇒ If we enumerate all importance-ranked solutions, the best is a 2-approximation. • O(Nkmax) time 2-approximation: • no more than (N+1)kmax-1 importance-ranked solutions in total; • it takes O(N) to check constraint and calculate the false positive rate for each solution. • Practically expensive: • N can be huge; • the constant kmax may not be very small (e.g., 20). PODC 2008
Faster Solutions • (2+ε)-approximation: • the problem of identifying the best importance-ranked solution can be transformed into a knapsack problem; • dynamic programming produces (2+ε)-approximation solution in O(N2/ε) time. • Coarse-grained optimization: • partition large number of objects into a small number of groups (objects in each group have similar importance scores); • optimize at the group granularity (then assign equal hash number to objects within one group) ⇒much smaller N. PODC 2008
Evaluation on Synthetic Data • Non-member query pop. q’i follows Zipf-like distribution. • Membership pop. pi follows a uniform distribution. • Our integer approximation solution significantly outperforms the real-rounding solution, particularly at high popularity skewness. PODC 2008
Trace-driven Evaluation on Distributed Caching • Distributed caches exchange their content (set of cached web objects) to cooperate. • Evaluation driven by web access traces from IRCache.net. PODC 2008
Trace-driven Evaluation on Distributed Keyword Searching • Distributed search engines pass keyword indexes to support distributed joins. False positives resolved by additional comm. • Evaluation driven by web page listing at dmoz.com and keyword query traces at Ask.com. PODC 2008
Related Work • Compressed Bloom filters [Mitzenmacher 2002]. • Bloom filters with additional functionalities: • deletion [Fan et al. 2000]; • frequency queries [Cohen and Matias 2003]; • associating objects with values [Chazelle et al. 2004]. • Alternative data structure [Pagh et al. 2005]. • Weighted Bloom filters [Bruck et al. 2006]: • optimal real-number solution with integer rounding; • analytically, the rounding-induced error increase is unbounded; • practically, the error increase can be substantial. PODC 2008
Conclusions • Popularity conscious Bloom filters: • motivated by skewed, stable data popularity distributions; • customize each object’s hash number according to its popularity in sets and queries. • Unrestricted real-number problem: • optimal solution when object hash number is linear to log(query-pop’/set-pop). • Ranged integer problem: • query-pop’/set-pop serves as an object importance indicator; • O(Nkmax) time 2-approximation; • O(N2/ε) time (2+ε)-approximation. • Quantitative evaluations driven by real distributed application traces. PODC 2008