210 likes | 315 Views
Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks By Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum. Anthony Okorodudu CSE 6392 2006-4-25. Outline. Introduction Motivation Related Work Distributed Hash Tables (DHT) Hash Sketches
E N D
Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data NetworksBy Nikon Ntarmos, Peter Triantafillou, and Gerhard Weikum Anthony Okorodudu CSE 6392 2006-4-25
Outline • Introduction • Motivation • Related Work • Distributed Hash Tables (DHT) • Hash Sketches • Distributed Hash Sketches (DHS) • Counting with DHS • Conclusion Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Introduction • Peer-to-peer (P2P) started as a way of sharing files/CPU cycles among end-users • Evolved into cutting networks of today • Distributed Hash Tables (DHT) made this feasible • Probabilistic guarantees for degree of efficiency, fault tolerance, and availability • Data management systems of huge scale Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Motivation • Need for distributed counting mechanisms • File-sharing P2P systems: total number of documents shared by users • Sensor networks: compute aggregates in a duplicate-insensitive manner • Internet-scale DB system: build histograms for query access plans Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Central Goals • Efficiency: number of nodes contacted for counting must be small • Scalability and availability: large numbers of nodes may need to add elements to a (multi-) set • Access and storage load balancing: counting and related overheads should be fairly distributed across all nodes Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Central Goals (continued) • Accuracy: tunable, robust, and highly accurate cardinality estimation • Simplicity and ease of integration: special, solution-based indexing structures should be avoided • Duplicate (in)sensitivity: count total number of items as well as the number of unique items in multi-sets Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Distributed Counting Protocols • One-node-per-counter protocols • Gossip-based protocols • Broadcast/convergecast-type protocols • Sampling-based protocols Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
One-node-per-counter • Select a node in the overlay of the DHT and use it to maintain counter value • Poor scalability • Resembles a centralized system Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Gossip-based • Provide weak probabilistic semantics of “eventual consistency” for outcome • Every node exchanges information with a set of nodes • Low bandwidth • Not efficient in terms of number of nodes to be contacted • Low accuracy Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Broadcast/Convergecast-type • Broadcast phase • Querying node broadcasts query through network, creating tree of nodes as query propagates the overlay • Convergecast phase • Node sends its local part of the answer along with answers received from nodes deeper down the tree to “parent” node • Similar to gossip-based Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Sampling-based • Estimate the value of the counter by selectively querying a set of nodes in the network • Sampling based techniques suffer from accuracy issues • Large samples lead to higher accuracy but more nodes need to be contacted • Sampling based techniques are usually duplicate-sensitive Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Distributed Hash Tables (DHT) • Family of structured P2P network overlays exposing hash-table like interface • insert(key, value) • lookup(key) • Highly efficient for point queries Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Hash Sketches • First proposed as a means of estimating the cardinality of a multiset in a database • Used in many application domains for counting distinct elements in multi-sets • Approximate query answering in very large DBs, data mining on the internet graph, stream processing Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Hash Sketches (continued) • PCSA (Probabilistic Counting with Stochastic Averaging) algorithm assumes of a pseudo-uniform hash function • Super-LogLog algorithm relaxes pseudo-uniform hash function constraints of PCSA Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Distributed Hash Sketches (DHS) • Fully decentralized, scalable, and efficient mechanism capable of providing estimates on the cardinality of multi-sets • Satisfy all the central goals • Implemented using PCSA (DHS-PCSA) or super-LogLog (DHS-sLL) hash sketches Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
DHS • O(log N) cost to insert object in an N-node DHS • O(b * log N) bandwidth consumption if size of data is b bytes • Data items are deleted if not updated within time-to-live so deleting an item incurs no extra cost Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
DHS (continued) • Accuracy of hash sketches increases with multiple bitmap vectors • Either PCSA or super-LogLog algorithm is applied for counting Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Counting with DHS Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Conclusion • Distributed Hash Sketches is a fully decentralized, scalable, and efficient mechanism for providing estimates on the cardinality of multi-sets in internet-scale information systems • DHS implemented using either PCSA or the super-LogLog hash sketches • DHS histograms can introduce great performance savings during query optimization Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
References • N. Ntarmos, P. Triantafillou, and G. Weikum. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks. ICDE 2006. Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks
Thanks Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks