Fishing for Patterns in Data Streams

Fishing for Patterns in Data Streams Subhash Suri (with Hershberger, Shrivastava, Toth)

2 9 9 9 7 6 4 9 9 9 3 9 N = 12; item 9 is majority An Old Chestnut: Majority • A sequence of N items. • You have constant memory. • In one pass, decide if some item is in majority (occurs > N/2 times)?

Misra-Gries Algorithm (‘82) • A counter and an ID. • If new item is same as stored ID, increment counter. • Otherwise, decrement the counter. • If counter 0, store new item with count = 1. • If counter > 0, then its item is the only candidate for majority.

ID ID1 ID2 . . . . IDk count . . A generalization: Frequent Items Find k items, each occurring at least N/(k+1) times. • Algorithm: • Maintain k items, and their counters. • If next item x is one of the k, increment its counter. • Else if a zero counter, put x there with count = 1 • Else (all counters non-zero) decrement all k counters

Frequent Elements: Analysis • A frequent item’s count is decremented if all counters are full: it erases k+1 items. • If x occurs > N/(k+1) times, then it cannot be completely erased. • Similarly, x must get inserted at some point, because there are not enough items to keep it away.

Data Stream Algorithms • Majority and Frequent are examples of data stream algorithms. • Data arrives as an online sequence x1, x2, …, potentially infinite. • Algorithm processes data in one pass (in given order) • Algorithm’s memory is significantly smaller than input data • Summarize the data: compute useful patterns

Streaming Data Sources • Internet traffic monitoring • Web logs and click streams • Financial and stock market data • Retail and credit card transactions • Telecom calling records • Sensor networks, surveillance • RFID • Instruction profiling in microprocessors • Data warehouses (random access too expensive).

Stream of IP-Packets Internet Traffic Analysis • Usage trends for engineering, provisioning, abuse detection, etc. • Discover sources of large traffic • Items = IP packets • Item ID = Flow ID • E.g. sender’s IP address • Frequent items = Heavy Hitters • E.g. report all flows that consume more than 1% of the link bandwidth. • Counting bytes, instead of number of occurrence.

Stream of IP-Packets Stream Data • Rapid, continuous arrival: • Several million packets/sec • Huge volume: • > 50 TB of header data per day for Gigabit router • Real time response • Small memory: fast but costly SRAM • In the sea of data, spot unusual traffic patterns and anomalies

Problem of False Positives • False positives in Misra-Gries algorithm • It identifies all true heavy hitters, but not all reported items are necessarily heavy hitters. • How can we tell if the non-zero counters correspond to true heavy hitters or not? • A second pass is needed to verify. • False positives are problematic if heavy hitters are used for billing or punishment. • What guarantees can we achieve in one pass?

Approximation Guarantees • Find heavy hitters with a guaranteed approximation error [Demaine et al., Manku-Motwani, Estan-Varghese…] • Manku-Motwani (Lossy Counting) • Suppose you want -heavy hitters--- items with freq > N • An approximation parameter , where << .(E.g.,  = .01 and  = .0001;  = 1% and  = .01% ) • Identify all items with frequency >  N • No reported item has frequency < ( - )N • The algorithm uses O(1/ log (N)) memory

Misra-Gries revisited • Running MG algorithm with k = 1/ counters also achieves the -approximation. • Undercounts any item by at most N. • In fact, MG uses only O(1/) memory. • Manku-Motwani slightly better in per-item processing cost • MG requires extra data structure for decrementing all counters: O(log(1/)) per item • Manku-Motwani is O(1) amortized per item. • See Demaine et al. for more details.

Patterns in Internet Traffic • No flow ID. • Knowledge of applications, connections etc. has to be inferred by analysis. • Raw stream data is too low level • Patterns are visible only in multi-dimensions. • Useful patterns require paying attention to Internet hierarchy.

Hierarchy in Data UCSB • An example IP address hierarchy. • We are interested in subgroups that emerge as heavy hitters. • Heavy hitter can be a single machine, or it can be formed by a group. • To avoid redundancy, a higher level entity should exclude a lower node already tagged as heavy. • Is CS a heavy hitter even without Web Server or only because of it? PHY CS GSL CSIL Lab1 CS WebServer

Dimensions in Data • IP Packets can be summarized along multiple dimensions: src, dst, protocol, ports, etc. • Useful patterns may involve multiple dimensions. • Aggregation by IP src identifies servers; aggregation by ports identifies applications. • To learn which servers generate which kind of traffic requires fishing on both fields simultaneously. Src1 DestIP B Dest1 C A Subnet2 WebServer Subnet1 SrcIP

A geometric formulation • A stream of points • E.g. IP packets in header space • Set of implicitly defined boxes • Patterns or classification rules • A box B is heavy if it contains > N points. • Discounted frequency of B: exclude the points that are in heavy boxes properly contained in B. • Hierarchical Heavy Hitters: a box is a -HHH if its discountedfrequency > N B A C Discounted Frequency C

Computing HHH • The number of -HHHs is at most A/, where A is size of largest anti-chain in the hierarchy. • Max number of overlapping but incomparable boxes. • Assume constant. • Estan-Savage-Varghese (sigcomm ‘03): • Offline: require multiple passes • Do not find true HHHs: difficulty of overlapping boxes • Heuristic optimizations: first find 1D heavy hitters and then do the cross product for multidim-HHH, etc. • Cormode et al. (vldb ‘03, sigmod ‘04): • Extends Lossy Counting (Manku-Motwani) algorithm • Heuristics to deal with overlap problems • Online and space-efficient, but no bounds on discounted freq

Discounted frequency Problem • No existing space-efficient scheme offers a provable guarantee on the discounted frequency • Worst-case bounds hold only for total freq • Thus, the problem of false positives. • Two sources of problem: • Loss of information during merges • Overlapping boxes C C B B A A

Complexity of HHHs • Can we have provable guarantees just like flat heavy hitters? • Guaranteed separation between heavy and non-heavy boxes • Every box identified as -HHH should have discounted frequency > N • All other boxes have discounted frequency < ( - ) N • Hershberger-Shrivastava-Suri-Toth [PODS 05] • Any -HHH algorithm with fixed approximation guarantee in d-dim must use W(d+1) memory in worst-case.

The Lower Bound in 1-D • r intervals of length 2 each (call them literals) • Union of the r intervals is B. • Each interval split into two unit length sub-intervals. • If stream points fall in the left (resp. right) subinterval, we say the literal has orientation 0 (resp. 1). B 2r Literal 0 1

The Construction • Stream arrives in 2 phases. • In 1st phase: • Put 3N/r points in each interval, either in left or right half. • In 2nd phase • Adversary chooses either left or right half for each sub-interval and puts N points. Call these intervals sticks. • Heavy hitters: • Each stick is a -HHH • Discounted frequency of B (the union interval) depends on literals whose orientations in 1st and 2nd phase differ Algorithms must keep track of W(r) orientations after 1st phase B

The Lower Bound • Suppose an algorithm A uses at most 0.01r bits of space. • After phase 1, it encodes orientations of the r literals in 0.01r bits. • There are 2r distinct orientation • Two orientations that differ in at least r/3 literals map to the same (0.01r)-bit code ==> indistinguishable to A. • If orientations in 1st and 2nd phase are same, frequency of B = 0, not a HHH. • If r/3 literals differ, frequency of B = r/3 * 3N/r = N, so B is -HHH • So, A misclassifies B in one sequence. B

Completing the Lower Bound 2r • Make r independent copies of the construction • Use only one of them to complete the construction in the 2nd phase • Need O(r2) bits to track all orientations • For r = 1/4, this give W(-2)lower bound B r

Multi-dimensional lower bound • The 1-D lower bound is information-theoretic; applies to all algorithms. • For higher dimensions, need a more restrictive model of algorithms. • Box Counter Model. • Algorithm with memory m has m counters • These counters maintain frequency of boxes • All deterministic heavy hitter algorithms fit this model • In the box counter model, finding -HHH in d-dim with any fixed approximation requires W(d+1) memory

0 1 Literal Diagonal Uniform Stick 2r 2D (Multi-Dim) Construction • A box B and a set of descendants. • B has side length 2r. • 1st phase • 2x2 (literal) boxes in upper left quadrant (orientation 0 or 1) • 2nd phase • Diagonal: boxes in upper left quadrant; all orientation 0 • Sticks: 1xr (or rx1) boxes • Uniform: lower right quadrant

FullyCovered Half Covered Multi-dimensional lower bound • Intuition: • Each stick combines with a diagonal box to form a skinny -HHH box • Diagonal boxes pair-up to form -HHH • Skinny boxes form a checker-board pattern in upper left quadrant • Each literal is either fully covered or half covered • As in 1-D, adversary picks sticks • Discounted frequency of B has • Half covered literals and • Points in the Uniform quadrant Diagonal Uniform Stick 2r

The Lower Bound • The algorithm must remember the W(r2) literal orientations. • Otherwise, it cannot distinguish between two sequences, where discounted frequency of B is m or 3m/2, resp. (for m = 20/29 N). • Like before, by making r copies of the construction, we get the lower bound of W(r3). • The basic construction generalized to d dimensions. • Adjusting the hierarchy to get lower bound for any arbitrary approximation

Hierarchical Heavy Hitters • In the most general setting, no space-efficient scheme may be possible for -HHH with guaranteed approximation quality. • W(d+1) memory in worst-case. • Some tractable summaries: Adaptive Spatial Partitioning; Median and Quantiles; Geometric Summaries.

Conclusions • Age of data glut. • Growing need for real time analytics. • Locality and geometric structures. • Which geometric patterns are hard to detect in streams? • Which data mining tasks are feasible in streams?

Fishing for Patterns in Data Streams

Fishing for Patterns in Data Streams

Presentation Transcript

Managing Data Streams

Data Streams

Fishing for Patterns in Data Streams

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Data Streams

Alternate Data Streams in Windows

Data Streams

Mining Data Streams

Algorithms for Data Streams

Algorithms for geometric data streams

Fishing for Patterns in (Shallow) Geometric Streams

Privacy Preservation for Data Streams

Estimating Entropy for Data Streams

Fishing for Data QELP Clams

Data Streams

Fishing for Patterns in (Shallow) Geometric Streams

Data Mining for Data Streams

Patterns in Sequences and Data Streams

Algorithms for geometric data streams

Mining Frequent Patterns in Data Streams at Multiple Time Granularities

Mining Unusual Patterns in Data Streams in Multi-Dimensional Space