1 / 21

Anthony Okorodudu CSE 6392 2006-4-25

Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks By Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum. Anthony Okorodudu CSE 6392 2006-4-25. Outline. Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches

robert
Download Presentation

Anthony Okorodudu CSE 6392 2006-4-25

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum Anthony Okorodudu CSE 6392 2006-4-25

  2. Outline • Introduction • Motivation • Related Work • Distributed Hash Tables (DHT) • Hash Sketches • Distributed Hash Sketches (DHS) • Counting with DHS • Conclusion Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  3. Introduction • Peer-to-peer (P2P) started as a way of sharing files/CPU cycles among end-users • Evolved into cutting networks of today • Distributed Hash Tables (DHT) made this feasible • Probabilistic guarantees for degree of efficiency, fault tolerance, and availability • Data management systems of huge scale Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  4. Motivation • Need for distributed counting mechanisms • File-sharing P2P systems: total number of documents shared by users • Sensor networks: compute aggregates in a duplicate-insensitive manner • Internet-scale DB system: build histograms for query access plans Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  5. Central Goals • Efficiency: number of nodes contacted for counting must be small • Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set • Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  6. Central Goals (continued) • Accuracy: tunable, robust, and highly accurate cardinality estimation • Simplicity and ease of integration: special, solution-based indexing structures should be avoided • Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  7. Distributed Counting Protocols • One-node-per-counter protocols • Gossip-based protocols • Broadcast/convergecast-type protocols • Sampling-based protocols Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  8. One-node-per-counter • Select a node in the overlay of the DHT and use it to maintain counter value • Poor scalability • Resembles a centralized system Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  9. Gossip-based • Provide weak probabilistic semantics of “eventual consistency” for outcome • Every node exchanges information with a set of nodes • Low bandwidth • Not efficient in terms of number of nodes to be contacted • Low accuracy Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  10. Broadcast/Convergecast-type • Broadcast phase • Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay • Convergecast phase • Node sends its local part of the answer along with answers received from nodes deeper down the tree to “parent” node • Similar to gossip-based Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  11. Sampling-based • Estimate the value of the counter by selectively querying a set of nodes in the network • Sampling based techniques suffer from accuracy issues • Large samples lead to higher accuracy but more nodes need to be contacted • Sampling based techniques are usually duplicate-sensitive Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  12. Distributed Hash Tables (DHT) • Family of structured P2P network overlays exposing hash-table like interface • insert(key, value) • lookup(key) • Highly efficient for point queries Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  13. Hash Sketches • First proposed as a means of estimating the cardinality of a multiset in a database • Used in many application domains for counting distinct elements in multi-sets • Approximate query answering in very large DBs, data mining on the internet graph, stream processing Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  14. Hash Sketches (continued) • PCSA (Probabilistic Counting with Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function • Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  15. Distributed Hash Sketches (DHS) • Fully decentralized, scalable, and efficient mechanism capable of providing estimates on the cardinality of multi-sets • Satisfy all the central goals • Implemented using PCSA (DHS-PCSA) or super-LogLog (DHS-sLL) hash sketches Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  16. DHS • O(log N) cost to insert object in an N-node DHS • O(b * log N) bandwidth consumption if size of data is b bytes • Data items are deleted if not updated within time-to-live so deleting an item incurs no extra cost Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  17. DHS (continued) • Accuracy of hash sketches increases with multiple bitmap vectors • Either PCSA or super-LogLog algorithm is applied for counting Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  18. Counting with DHS Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  19. Conclusion • Distributed Hash Sketches is a fully decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems • DHS implemented using either PCSA or the super-LogLog hash sketches • DHS histograms can introduce great performance savings during query optimization Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  20. References • N. Ntarmos, P. Triantafillou, and G. Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

  21. Thanks Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks

More Related