Statistical structures for Internet-scale data management

Statistical structures for Internet-scale data management FatemeShirazi Spring 2010 Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum

Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Implementation • Results • Conclusion

Peer-to-Peer (P2P) • File sharing in overlay networks • Millions of users (peers) provide storage and bandwidth for searching and fetching files

Motivation • In P2P file-sharing often the total number of (unique) documents shared by their users is needed • Distributed P2P search engines need to evaluate the significance of keywords • the ratio of indexed documents containing each keyword to the total number of indexed documents

Motivation • Internet-scale information retrieval systems need a method to deduce the rank/score of data items. • Sensor networks need methods to compute aggregates • Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results

Overview Sketch • A large number of nodes, form the system’s infrastructure • Contribute and/or store data items ,involved in operations such as computing synopses and building histograms • In general, queries do not affect all nodes • Compute aggregation functions over data sets dynamically by a filter predicate of the query

Problem Formulation • Relevant data items stored in unpredictable ways in a subset of all nodes • A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network • And, relevant queries and synopses may be built and used over any of these data sets

Computational Model • Data stored in P2P network is structured in relations • Each R consists of (k+l) attr. or columns R(a1,…,ak,b1,…,bl) • Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes)

Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Experimental setup • Results • Conclusion

Distributed Hash Tables • A family of structured P2P network overlays exposing a hash-table-like interface(lookup service) • Examples of DHTs include Chord, Kademlia, Pastry, CAN… • Any node can efficiently retrieve a value with given key

Chord • Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address • Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers • Each item is also assigned a unique identifier from the same ID space • Stored at the node whose ID is closest to the item’s ID

HashSketches • Estimating the number of distinct items in D of data in a database • For application domains which need counting distinct elements: • Approximate query answering in very large databases, • Data mining on the Internet graph • Stream processing

LSB 1 1 d1 0 d2 0 d3 0 d4 0 MSB HashSketches • A hash sketch consists of a bit vector B[·] of length L • In order to estimate the number n of distinct elements in D ,ρ(h(d)) is applied to all d ∈ Dand record the results in the bitmap vector B[0 . . . L−1] Partially copied from slides of the author

L-bit Pseudo-Random Numbers Data Items Hash sketch (Bit vector B) d1 PRN1 h() () d2 PRN2 b0 d3 PRN3 b1 d4 PRN4 . . . n L+1 . . . . . . bL-1 dn-1 PRNn-1 bL LSB dn PRNn 0 0 “my item 1 key” 0 “my item 2 key” 0 “my item 3 key” 0 “my item 4 key” 0 MSB Hash sketches: Insertions 1 10111 1 h() 10010 () 01101 10011 Copied from slides of the author

d1 d2 d3 d4 HashSketches • Since h() distributes values uniformly over [0, 2L ) P(ρ(h(d)) = k) = 2−k−1 • R =position of the least-significant 0-bit in B, then 2R ~ n |D| ~ 22 = 4 Partially copied from slides of the author

Distributing Data Synopses • (1) the “conservative” but popular rendezvous based approach • (2) the decentralized way of DHS, in which no node has some sort of special functionality Partially copied from slides of the author

N1 Bit … N56 N8 Bit 3 N51 Bit 2 N14 N48 Bit 0 N21 Bit 1 N42 N38 N32 Mapping DHS bits to DHT Nodes Copied from slides of the author

N1 N56 N8 N51 N14 N48 N21 N42 N38 N32 DHS : Counting Bits >3 not set Bit 2 not set. Retrying… Counting node Bit 2 not set Bit 1 set! Bit 1 not set. Retrying… Copied from slides of the author

Computing Aggregates • COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set • COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question • SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch • AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratio

COUNT-DISTINCT • Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset • Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted

Counting with the Rendezvous Approach • Nodes first compute a rendezvous ID. (attr1h() 47 ) • Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (“rendezvous node”) • The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis • Interested nodes can then acquire the global synopsis by querying the rendezvous node

Step 1

Step 2

Step 3

Counting with DHS • In the DHS-based case, nodes storing tuples of R insert them into the DHS, by: • (1)Nodes hash their tuples and compute ρ (hash) for each tuple • (2) For each tuple,nodes send a “set-to-1” to a random ID in the corresponding arc • (3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is found

Step 1

Step 2

Step 3

Histograms • The most common technique used by commercial databases as a statistical summary • An approximation of the distribution of values in base relations. • For a given attribute/column, a histogram is a grouping of attribute values into “buckets” Salary Salary Age

Constructing histogram types • Equi-Width histograms • The most basic histogram variant • Partitions the attribute value domain into cells (buckets) of equal spread • Assigns to each the number of tuples with an attribute value.

Other histogram types • Average shifted Equi-Width histograms ,ASH • Consist of several EWH with different starting positions in value space • Frequency of each value in a bucket computed as the average of estimations given by histogram • Equi-Depth histograms • In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads

Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Implementation • Results • Conclusion

Implementation • 1.Generating the workload • 2. Populating the network with peers • 3. Randomly assigning data tuples from the base data to nodes in the overlay • 4. Then inserting all nodes into the P2P • 5. Selecting random nodes ,reconstructing histograms and computing aggregates

Measures of Interest • To consider • (1) The fairness of the load distribution across nodes in the network • (2)The accuracy of the estimation itself • (3)The number of hops are considered to do the estimation • To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches

Fairness • To compute the fairness, the load on any given node as the insertion /query/probe “hits” on the node is measured • Number of times this node is target of insertion/query/probe opera • A multitude of metrics are used. More specifically : • The Gini Coefficient • The Fairness Index • Maximum and total loads for DHS- and rendezvous based approaches

The Gini Coefficient • Mean of the absolute difference of every possible pair . • Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst • The Gini Coefficient roughly represents the amount of imbalance in the system • Gini = A/(A+B) A B

Estimation error • Mean error of the estimation is reported • Computed as percentage • By the distributed estimation differed to the estimated aggregate computed in a centralized manner • (i.e. as if all data was stored on a single host)

Hop-count Costs • The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown • The per-node hop count costs are higher for the DHS-based approach

Outline • Introduction • Background • Compute aggregates and building histograms • Implementation • Results • Conclusion

Results • The hop-count efficiency and the accuracy of rendezvous-based hash sketches and of the DHS is measured • Initially single-attribute relations is created, with integer values in the intervals [0, 1000) • following either a uniform distribution (depicted as a Zipf with θ equal to 0.0) • or a shuffled Zipf distribution with θ equal to 0.7, 1.0, and 1.2

Total query load (node hits) over time

Load distribution • The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness • The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.

Gini Coefficient Rendezvous approach DHS approach

Evolution of the Gini coefficient • In the rendezvous based approach a single node has all the query load • The DHS-based approaches ,≈0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space • Thus the best respective values by any algorithm using randomized assignment of items to nodes

Evolution of the Gini coefficient

Error for Computing COUNT Aggregate Rendezvous approach DHS approach • In both cases, error due to use of hash sketches • Both approaches exhibit the same average error • As expected, the higher the number of bitmaps in the synopsis, the better the accuracy

Insertion hop count Rendezvous approach DHS approach • The insertion hop-count cost for all aggregates • Hop count costs are higher for the DHS-based approach by appr.8× for both the insertion and query cases

Conclusion • A framework for distributed statistical synopses for Internet-scale networks such as P2P systems • Extending centralized settings techniques towards distributed settings • Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms

Statistical structures for Internet-scale data management

Statistical structures for Internet-scale data management

Presentation Transcript

Event history data structures and data management

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Service Primitives for Internet Scale Applications

Statistical Methods for Detecting Computer Attacks from Streaming Internet Data

Requirements for Internet Scale Event Notifications

Large-Scale Internet Measurements for data-driven public policy

iRODS and Large-Scale Data Management

Digging for Data Structures

Data Structures

Large- scale Linked Data Management

Data Management for Internet Backplane Protocol

Data Structures

Data Structures

Architectures and Algorithms for Internet-Scale (P2P) Data Management

Internet-Scale Interoperability

Optimal Tree Structures for Large-Scale Grids

Google Scale Data Management

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Data Structures for Graphs