1 / 53

Statistical structures for Internet-scale data management

Statistical structures for Internet-scale data management. Fateme Shirazi Spring 2010. Authors: Nikos Ntarmos , Peter Triantafillou, G. Weikum. Outline. Introduction Background : Hash sketches Compute aggregates and building histograms Implementation Results Conclusion.

gilon
Download Presentation

Statistical structures for Internet-scale data management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical structures for Internet-scale data management FatemeShirazi Spring 2010 Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum

  2. Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Implementation • Results • Conclusion

  3. Peer-to-Peer (P2P) • File sharing in overlay networks • Millions of users (peers) provide storage and bandwidth for searching and fetching files

  4. Motivation • In P2P file-sharing often the total number of (unique) documents shared by their users is needed • Distributed P2P search engines need to evaluate the significance of keywords • the ratio of indexed documents containing each keyword to the total number of indexed documents

  5. Motivation • Internet-scale information retrieval systems need a method to deduce the rank/score of data items. • Sensor networks need methods to compute aggregates • Traditionally query optimizers rely on histograms over stored data, to estimate the size of intermediate results

  6. Overview Sketch • A large number of nodes, form the system’s infrastructure • Contribute and/or store data items ,involved in operations such as computing synopses and building histograms • In general, queries do not affect all nodes • Compute aggregation functions over data sets dynamically by a filter predicate of the query

  7. Problem Formulation • Relevant data items stored in unpredictable ways in a subset of all nodes • A large number of different data sets expected to exist, stored at (perhaps overlapping) subsets of the network • And, relevant queries and synopses may be built and used over any of these data sets

  8. Computational Model • Data stored in P2P network is structured in relations • Each R consists of (k+l) attr. or columns R(a1,…,ak,b1,…,bl) • Either one of the attributes of the tuple, or calculated otherwise (e.g. a combination of its attributes)

  9. Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Experimental setup • Results • Conclusion

  10. Distributed Hash Tables • A family of structured P2P network overlays exposing a hash-table-like interface(lookup service) • Examples of DHTs include Chord, Kademlia, Pastry, CAN… • Any node can efficiently retrieve a value with given key

  11. Chord • Nodes are assigned identifiers from a circular ID space, computed as the hash of IP address • Node-ID space among nodes partitioned, so that each node is responsible for a well-defined set (arc) of identifiers • Each item is also assigned a unique identifier from the same ID space • Stored at the node whose ID is closest to the item’s ID

  12. HashSketches • Estimating the number of distinct items in D of data in a database • For application domains which need counting distinct elements: • Approximate query answering in very large databases, • Data mining on the Internet graph • Stream processing

  13. LSB 1 1 d1 0 d2 0 d3 0 d4 0 MSB HashSketches • A hash sketch consists of a bit vector B[·] of length L • In order to estimate the number n of distinct elements in D ,ρ(h(d)) is applied to all d ∈ Dand record the results in the bitmap vector B[0 . . . L−1] Partially copied from slides of the author

  14. L-bit Pseudo-Random Numbers Data Items Hash sketch (Bit vector B) d1 PRN1 h() () d2 PRN2 b0 d3 PRN3 b1 d4 PRN4 . . . n L+1 . . . . . . bL-1 dn-1 PRNn-1 bL LSB dn PRNn 0 0 “my item 1 key” 0 “my item 2 key” 0 “my item 3 key” 0 “my item 4 key” 0 MSB Hash sketches: Insertions 1 10111 1 h() 10010 () 01101 10011 Copied from slides of the author

  15. d1 d2 d3 d4 HashSketches • Since h() distributes values uniformly over [0, 2L ) P(ρ(h(d)) = k) = 2−k−1 • R =position of the least-significant 0-bit in B, then 2R ~ n |D| ~ 22 = 4 Partially copied from slides of the author

  16. Distributing Data Synopses • (1) the “conservative” but popular rendezvous based approach • (2) the decentralized way of DHS, in which no node has some sort of special functionality Partially copied from slides of the author

  17. N1 Bit … N56 N8 Bit 3 N51 Bit 2 N14 N48 Bit 0 N21 Bit 1 N42 N38 N32 Mapping DHS bits to DHT Nodes Copied from slides of the author

  18. N1 N56 N8 N51 N14 N48 N21 N42 N38 N32 DHS : Counting Bits >3 not set Bit 2 not set. Retrying… Counting node Bit 2 not set Bit 1 set! Bit 1 not set. Retrying… Copied from slides of the author

  19. Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Experimental setup • Results • Conclusion

  20. Computing Aggregates • COUNT-DISTINCT: Estimation of the number of (distinct) items in a multi-set • COUNT: adding the tuple IDs to the corresponding synopsis, instead of the values of the column in question • SUM : each node locally computes the sum of values of the column tuples it stores, populates a local hash sketch • AVG: Consists of estimating the SUM and COUNT of the column and then taking their ratio

  21. COUNT-DISTINCT • Both rendezvous-based hash sketches and DHS applicable to estimation of the number of (distinct) items in a multiset • Assume the estimation of the number of distinct values in a column C of a relation R stored in our Internet-scale data management system is wanted

  22. Counting with the Rendezvous Approach • Nodes first compute a rendezvous ID. (attr1h() 47 ) • Then compute locally the synopsis and send it to the node whose ID is closest to the above ID (“rendezvous node”) • The rendezvous node responsible for combining the individual synopses (by bitwise OR) into the global synopsis • Interested nodes can then acquire the global synopsis by querying the rendezvous node

  23. Step 1

  24. Step 2

  25. Step 3

  26. Counting with DHS • In the DHS-based case, nodes storing tuples of R insert them into the DHS, by: • (1)Nodes hash their tuples and compute ρ (hash) for each tuple • (2) For each tuple,nodes send a “set-to-1” to a random ID in the corresponding arc • (3) Counting consists of probing random nodes in arcs corresponding to increasing bit positions until 0-bit is found

  27. Step 1

  28. Step 2

  29. Step 3

  30. Histograms • The most common technique used by commercial databases as a statistical summary • An approximation of the distribution of values in base relations. • For a given attribute/column, a histogram is a grouping of attribute values into “buckets” Salary Salary Age

  31. Constructing histogram types • Equi-Width histograms • The most basic histogram variant • Partitions the attribute value domain into cells (buckets) of equal spread • Assigns to each the number of tuples with an attribute value.

  32. Other histogram types • Average shifted Equi-Width histograms ,ASH • Consist of several EWH with different starting positions in value space • Frequency of each value in a bucket computed as the average of estimations given by histogram • Equi-Depth histograms • In an Equi-Depth histogram all buckets have equal frequencies but not (necessarily) equal spreads

  33. Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Implementation • Results • Conclusion

  34. Implementation • 1.Generating the workload • 2. Populating the network with peers • 3. Randomly assigning data tuples from the base data to nodes in the overlay • 4. Then inserting all nodes into the P2P • 5. Selecting random nodes ,reconstructing histograms and computing aggregates

  35. Measures of Interest • To consider • (1) The fairness of the load distribution across nodes in the network • (2)The accuracy of the estimation itself • (3)The number of hops are considered to do the estimation • To show the trade-off of scalability vs. performance/load distribution between the DHS and rendezvous-based approaches

  36. Fairness • To compute the fairness, the load on any given node as the insertion /query/probe “hits” on the node is measured • Number of times this node is target of insertion/query/probe opera • A multitude of metrics are used. More specifically : • The Gini Coefficient • The Fairness Index • Maximum and total loads for DHS- and rendezvous based approaches

  37. The Gini Coefficient • Mean of the absolute difference of every possible pair . • Takes values in the interval [0, 1), where a GC value of 0.0 is the best possible state, with 1.0 being the worst • The Gini Coefficient roughly represents the amount of imbalance in the system • Gini = A/(A+B) A B

  38. Estimation error • Mean error of the estimation is reported • Computed as percentage • By the distributed estimation differed to the estimated aggregate computed in a centralized manner • (i.e. as if all data was stored on a single host)

  39. Hop-count Costs • The per-node average hop count for inserting all tuples to the distributed synopsis is measured and shown • The per-node hop count costs are higher for the DHS-based approach

  40. Outline • Introduction • Background • Compute aggregates and building histograms • Implementation • Results • Conclusion

  41. Results • The hop-count efficiency and the accuracy of rendezvous-based hash sketches and of the DHS is measured • Initially single-attribute relations is created, with integer values in the intervals [0, 1000) • following either a uniform distribution (depicted as a Zipf with θ equal to 0.0) • or a shuffled Zipf distribution with θ equal to 0.7, 1.0, and 1.2

  42. Total query load (node hits) over time

  43. Load distribution • The extra hop-count cost of the DHS-based approach pays back when it comes to load distribution fairness • The load on a node, the number of times it is visited (a.k.a. node hits) during data insertion and/or query processing.

  44. Gini Coefficient Rendezvous approach DHS approach

  45. Evolution of the Gini coefficient • In the rendezvous based approach a single node has all the query load • The DHS-based approaches ,≈0.5, which equal the GC values of the distribution of the distances between consecutive nodes in the ID space • Thus the best respective values by any algorithm using randomized assignment of items to nodes

  46. Evolution of the Gini coefficient

  47. Error for Computing COUNT Aggregate Rendezvous approach DHS approach • In both cases, error due to use of hash sketches • Both approaches exhibit the same average error • As expected, the higher the number of bitmaps in the synopsis, the better the accuracy

  48. Insertion hop count Rendezvous approach DHS approach • The insertion hop-count cost for all aggregates • Hop count costs are higher for the DHS-based approach by appr.8× for both the insertion and query cases

  49. Outline • Introduction • Background : Hash sketches • Compute aggregates and building histograms • Experimental setup • Results • Conclusion

  50. Conclusion • A framework for distributed statistical synopses for Internet-scale networks such as P2P systems • Extending centralized settings techniques towards distributed settings • Developing DHT based higher-level synopses like Equi-Width, ASH, and Equi-Depth histograms

More Related