A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks

A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks Tian Bu1, Jin Cao1, Aiyou Chen1, Patrick P. C. Lee2 Bell Labs, Alcatel-Lucent1 Columbia University2 May 10, 2007

Outline • Motivation • Why heavy-key detection? • What are the challenges? • Sequential hashing scheme • Allows fast, memory-efficient heavy-key detection in high-speed networks • Results of trace-driven simulation

Motivation • Many anomalies in today’s networks: • Worms, DoS attacks, flash crowds, … • Input: a stream of packets in (key, value) pairs • Key: e.g., srcIPs, flows,… • Value: e.g., data volume • Goal: identify heavy keys that cause anomalies • Heavy hitters: keys with massive data in one period • E.g., flows that violate service agreements • Heavy changers: keys with massive data change across two periods • E.g., sources that start DoS attacks

Challenge • Keeping track of per-key values is infeasible … Counter value v1 v2 v3 vN Key 1 2 3 N • Number of keys = 232 if we keep track of source IPs • Number of keys = 2104 if we keep track of 5-tuples (srcIP, dstIP, srcPort, dstPort, proto)

Goal • Find heavy keys using a “smart” design: • Fast per-packet update • Fast identification of heavy keys • Memory-efficient • High accuracy

Previous Work • Multi-stage filter [Estan & Varghese, 03] • Covers only heavy hitter detection, but not heavy changer detection • Deltoids [Cormode & Muthukrishnan, 04] • Covers both heavy hitter and heavy changer detections, but is not memory-efficient in general • Reversible sketch [Schweller et al., 06] • Space and time complexities of detection are sub-linear in the key space size

Our Contributions • Derive the minimum memory requirement subject to a targeted error rate • Propose a sequential hashing scheme that is memory-efficient and allows fast detection • Propose an accurate estimation method to estimate the values of heavy keys • Show via trace-driven simulation that our scheme is more accurate than the existing work

Hash array bucket 1 2 : K Table 1 2 … M Minimum Memory Requirement • How to feasibly keep track of per-key values? • Use a hash array[Estan & Varghese, 2003] • M independent hash tables • K buckets in each table

Packet Key x value v h1 hM h2 bucket +v 1 2 +v +v : +v +v K Minimum Memory Requirement • For each packet of key x, • Find bucket in Table i by hashing x: hi(x) • Increment the counter of each hash bucket by value v Record step Table 1 2 … M

Minimum Memory Requirement • Find heavy buckets, whose values (changes) > threshold • Heavy keys: associated buckets are heavy buckets Heavy bucket Detection step bucket 1 2 : K Table 1 2 … M

Minimum Memory Requirement • Input parameters: • N = size of the key space • H = max. number of heavy keys •  = error rate, Pr(a non-heavy key is treated as a heavy key) • Objective: Find all heavy keys subject to a targeted error rate . • Minimum memory requirement: Size of a hash array, given by M*K, is minimized when • K = H / ln(2) • M = log2(N / ( H))

How to identify heavy keys? • Challenge: hash array is irreversible • Many-to-one mapping • Solution: Enumerate all keys!! • Computationally expensive Heavy bucket bucket 1 2 : K Table 1 2 … M

Heavy key Heavy key Sequential Hashing Scheme • Basic idea: smaller keys first, then larger keys • Observation: if there are H heavy keys, then there are at most H unique sub-keys with respect to the heavy keys • Find all possible sub-keys of the H heavy keys • Enumeration of a sub-key space is easier 0 0 0 0 : 16 128 59 1 : 0 135 104 2 : 255 255 255 255 Entire IP space Size = 232 Sub-IP space Size = 28

w1 w2 w3 wD … +v +v +v +v +v +v +v +v +v +v Sequential Hashing Scheme -Record step Input: (key x, value v) Key x bucket 1 2 … : K 1 2 … M1 Table 1 ... M2 1 2 ... MD Array 1 Array 2 Array D

Try all w2’s Try all w3’s Try all wD’s Try all w1’s 1 … 2 (1 + )H w1w2’s (1 + )H w1w2w3’s (1 + )H w1w2…wD’s (1 + )H w1’s : Array 2 Array 3 K Array 1 Array D Sequential Hashing Scheme -Detection step Heavy bucket … • - intermediate error rate  - targeted error rate

Estimation • Goal: find the values of heavy keys • Rank the importance of heavy keys • Eliminate more non-heavy keys • Use maximum likelihood • Bucket values due to non-heavy keys ~ Weibull • Estimation is solved by linear programming

Record step Hash arrays 1 … 2 Data stream Record step : K Array D Array 1 Detection step Candidate heavy keys Detection step Heavy keys + values Estimation Hash arrays Threshold Recap

Experiments • Traces: • Abilene data collected at an OC-192 link • 1 hour long, ~50 GB traffic • Evaluation approach: • Compare our scheme and Deltoids [Cormode & Muthukrishnan, 04], both of which use the same number of counters • Metrics: • False positive rate • (# of non-heavy keys treated as heavy) / (# of returned keys) • False negative rate • (# of heavy keys missed) / (true # of heavy keys)

Results - Heavy Hitter Detection False +ve/-ve rates of sequential hashing • Worst-case error rates: • Sequential hashing: 1.2% false +ve and 0.8% false -ve • Deltoids: 10.5% false +ve, 80% false –ve

Results - Heavy Changer Detection False +ve/-ve rates of sequential hashing • Worst-case error rates: • Sequential hashing: 1.8% false +ve, 2.9% false -ve • Deltoids: 1.2% false +ve, 70% false –ve

Summary of Results • High accuracy of heavy-key detection while using a memory-efficient data structure • Fast detection • On the order of seconds • Accurate estimation • Provides more accurate estimates than least-square regression [Lee et al., 05]

Conclusions • Derived the minimum memory requirement for heavy-key detection • Proposed the sequential hashing scheme • Using a memory-efficient data structure • Allowing fast detection • Providing small false positives/negatives • Proposed an accurate estimation method to reconstruct the values of heavy keys

Thank you

How to Determine H? • H = maximum number of heavy keys Total data volume H ≈ threshold

Tradeoff Between Memory and Computation •  – intermediate error rate • Large : fewer tables, more computation • Small : more tables, less computation

A Fast and Compact Method for Unveiling Significant Patterns in High-Speed Networks