1 / 7

Finding hot-lists

This article explores methods for finding hot lists and the most popular items in data streams, with applications in retail sales, intrusion detection, fraud detection, and network congestion. It also discusses frequently used blocks in code execution and negative results in finding hot lists. The text language is English.

warkentin
Download Presentation

Finding hot-lists

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding hot-lists Given a data stream S, find the k most popular items Most popular product in retail sales data Hot-spots for intrusion detection, fraud detection, network congestion etc. Frequently used blocks in executed code

  2. Finding hot-lists – negative result [AMS96]: Approximating the frequency of the most frequent item in a sequence requires Omega(n) memory bits. Proof using Razborov’s element disjointness in communication complexity.

  3. Communication Complexity ALICE input A BOB input B Cooperatively compute function f(A,B) Minimize bits communicated Unbounded computational power Communication Complexity C(f) – bits exchanged by optimal protocol Π Protocols? 1-way versus 2-way deterministic versus randomized Cδ(f) – randomized complexity for error probability δ

  4. Adaptive Sampling [GM98] - Sample elements from the input set - Frequently occurring elements will be sampled more often - Sampling probability determined at runtime, according to the allowed memory usage - Tradeoff between overhead and accuracy - Give an estimate of the sample’s accuracy

  5. Concise Samples - Uniform random sampling - Maintain an <id, count> pair for each element - The sample size can be much larger than the memory size - For skewed input sets the gain is much larger - Sampling is not applied at every block - Vitter’s reservoir sampling

  6. Concise Samples

  7. Comparison of Hot List Algorithms 500K values in [1,500] Zipf parameter = 1.5 Footprint = 100 20

More Related