70 likes | 87 Views
This article explores methods for finding hot lists and the most popular items in data streams, with applications in retail sales, intrusion detection, fraud detection, and network congestion. It also discusses frequently used blocks in code execution and negative results in finding hot lists. The text language is English.
E N D
Finding hot-lists Given a data stream S, find the k most popular items Most popular product in retail sales data Hot-spots for intrusion detection, fraud detection, network congestion etc. Frequently used blocks in executed code
Finding hot-lists – negative result [AMS96]: Approximating the frequency of the most frequent item in a sequence requires Omega(n) memory bits. Proof using Razborov’s element disjointness in communication complexity.
Communication Complexity ALICE input A BOB input B Cooperatively compute function f(A,B) Minimize bits communicated Unbounded computational power Communication Complexity C(f) – bits exchanged by optimal protocol Π Protocols? 1-way versus 2-way deterministic versus randomized Cδ(f) – randomized complexity for error probability δ
Adaptive Sampling [GM98] - Sample elements from the input set - Frequently occurring elements will be sampled more often - Sampling probability determined at runtime, according to the allowed memory usage - Tradeoff between overhead and accuracy - Give an estimate of the sample’s accuracy
Concise Samples - Uniform random sampling - Maintain an <id, count> pair for each element - The sample size can be much larger than the memory size - For skewed input sets the gain is much larger - Sampling is not applied at every block - Vitter’s reservoir sampling
Comparison of Hot List Algorithms 500K values in [1,500] Zipf parameter = 1.5 Footprint = 100 20