380 likes | 536 Views
Bloom Based Filters for Hiera r chical Data. Georgia Koloniari and Evaggelia Pitoura University of Ioannina, Greece. Outline. Motivation Problem Description Related Work Our approach: Multi-Level Bloom Filters Performance Evaluation Hierarchical Distribution of Filters
E N D
Bloom Based Filters for Hierarchical Data Georgia Koloniari and Evaggelia PitouraUniversity of Ioannina, Greece
Outline • Motivation • Problem Description • Related Work • Our approach: Multi-Level Bloom Filters • Performance Evaluation • Hierarchical Distribution of Filters • Experimental Results • Conclusions • Future Work
Motivation • Evolution of peer-to-peer systems as an effective way of sharing data • Wide use of XML for data representation and exchange in the Internet • Service Descriptions in XML-based languages • Growing interest in content-based routing of data Challenge: How to efficiently discover the appropriate data based on their content?
The Problem • A peer-to-peer system where each node stores a set of XML documents • A query issued at a node may need results from multiple nodes in the system • Use data summaries at each node to assist query routing B SumB A C SumC
Summaries Requirements • Scalability: summaries should be able to scale to a large number of users and shared documents. • Distribution: should be distributed across the nodes of the peer-to-peer system without requiring any central point of control. • Dynamic:should support updates, since in a peer-to-peer system, users join and leave the system at will.
Related Work • XML Indices • The Index Fabric [Cooper & Shadmon, RightOrder Inc 2001] • XSKETCH Synopsis [Polyzotis & Garofalakis, VLDB 2002] • APEX [Chung, Min & Chim, ACM SIGMOD 2002 ] • Path Tree [Aboulnaga, Alameldeen & Naughton, VLDB 2001] • Signature-based Indices [Park & Kim, DASFAA 2001] • Routing in P2P • Secure Service Discovery [Hodes et al, Mobicom ’99] • Routing indices [Crespo & Garcia-Molina, ICDCS 2002]
device printer camera color postscript digital Data Model <xml> <device> <printer> <color></color> <postscript></postscript> </printer> <camera> <digital></digital> </camera> </device> </xml>
Querying • XML-based data or service descriptions • Find the documents that satisfy a given query • Queries that exploit content and structure of the data • Membership Queries:“Is element X in set Y?” • Path Queries:consisting ofregular path expressions, i.e. device/*/camera
Bloom Filters • Compact data structures for a probabilistic representation of a set • Appropriate to answer membership queries
Bloom Filters (cont’d) Query for b: check the bits at positions H1(b), H2(b), ..., H4(b).
Bloom Filters (cont’d) • Appearance of false positives. False positive:the probabilty that the filter recognizes an elemnt as belonging to the set although it does not. P = (1 - e-kn/m)k • Ease of updates with the use of an array of counters • Unable to represent relationships between elements
Our approach: • Bloom filters suitable for distributed environments • Main drawback: Unable to represent hierarchies • Extend to multi-level Bloom Filters in order to support path queries • Two approaches: • Breadth Bloom Filters • Depth Bloom Filters
Breadth Bloom Filters • One Bloom Filter BBFi for each level of the tree i • In each filter BBFi we insert the elements of all the nodes of level i. • An additional BBF0 with all the elements to improve performance • Different sizes of the filter for each filter Look-up: • check BBF0 for all elements of the path • check each element ai of the path to the corresponding level
device printer camera color postscript digital Breadth Bloom Filters BBF0 (deviceprintercamera colorpostscriptdigital) BBF1 device BBF2 printer camera BBF3 (colorpostscriptdigital) • Queries: $device/printer/color • /printer/postscript
Depth Bloom Filters • One Bloom Filter DBFi for each path of the tree with length i, i.e. each path with i+1 nodes • In each DBFi we insert all paths of the tree with length i. Look-up for path of length p: • Check all elements of the query in DBF • Check for every sub-path of length 2 to p • For * split the path at the positition of * and check each sub-path seperately
DBF Paths of length 0 0 device 1 0 1 1 0 0 1 0 1 1 0 0 DBF Paths of length 1 1 printer camera 0 1 1 1 0 0 1 0 0 1 0 1 DBF Paths of length 2 2 1 0 0 1 1 0 0 1 0 1 1 0 color postscript digital Depth Bloom Filters (deviceprintercamera colorpostscriptdigital) (device/printerdevice/camera camera/digitalprinter/color printer/postscript) (device/camera/digital device/printer/color device/printer/postscript) • Queries: /device/printer/color • /device/*/postscript
Experimental Evaluation • 200 XML documents produced by the Niagara Generator (www.cs.wisc.edu/niagara) • 4 hash functions using the MD5 message digest algorithm (RFC1321) • Size of the filter: 78000 bits, about 2% of the size of the documents • Levels of the documents: 4 • Elements per document: 50 • No repetition between element names • Length of queries: 3 (e.g. /device/camera/digital) • 90% of the elements forming the queries were contained in the documents • Metric: Percentage of false positives
Varying the query workload Workload type: /printer/digital
Summary of Results • Multi-level Bloom filters outperform Simple Bloom filters in evaluating path queries. • For 2% of the total size of the data, multi-level Bloom filters evaluate path queries for a false positives ratio below 3%, while Simple Blooms fail to recognize the correct paths, no matter how much the filter size increases. • Breadth Blooms work better than Depth Blooms. • Depth Blooms require more space but are suitable for handling queries for which Breadth Blooms present a high ratio of false positives (exp. 5)
Distribution • Each node stores: • local summary • merged summary of neighbours • merged summary constructed by applying the bit-wise OR per level • Nodes organized according to topological proximity • Two organizations of nodes: • hierarchical • horizons
Distribution: Hierarchical Organization Node C: Local filter Merged filter :E F G H Root filters: A, B, D
Bloom Filter Similarity • Nodes organized according to Bloom Filter Similarity • Measure: similarity measure based on theManhattan distance metric. Let two filters Band C of size m d(B, C) = |B[1] –C[1]| + |B[2] – C[2]| + … |B[m] – C[m]|. similarity(B, C) = m – d(B, C).
Bloom Filter Similarity (cont’d) B 1 0 0 1 0 0 1 1 C 0 1 1 0 1 0 0 1 similarity(B, C) =8 - (1 + 0 + 0 + 1 + 0 + 1+ 0 + 1) = 4 For multi-level Bloom filters similarity is defined as the sum of each pair of corresponding levels
Content-Based Organization • When a node joins the system: • it broadcasts its local summary and attaches to the most «similar» node available
Performance in Distributed Setting • Hierarchical organization of nodes • Metric: Number of hops • Parameters: • Variable number of nodes • Number of hierarchies: 5 • Maximum out-degree: 5 • Every 10% of all docs 70% similar • Length of queries: 2 • 10% of the documents have results • 70% of the documents contain the elements of the path query • One document per node
Summary of Results • The content-based organization is much more efficient in finding all the results for a query, than the proximity organization. • They both perform similarly in discovering the first result. • The content-based organization outperforms the proximity one when the nodes that satisfy a given query are limited. • Both Simple and multi-level Blooms can be efficiently used as distributed filters. • For path queries, multi-level Blooms outperform Simple ones.
Conclusions • We introduced two novel data structures: Breadth and Depth Bloom Filters that exploit both the content and structure of the XML documents given a small space overhead. • The new data structures outperform simple Bloom Filters with respect to false positives when addresing regular path expression queries • Distributed in large-scale systems to support efficient service discovery • Extended the use of Bloom filters to organize the nodes according to their content.
Future Work • Explore different policies for the filters distribution. • Explore different types of data summaries (e.g. Signatures) • Extend the data model to XML graphs and incorporate values into the indexes