370 likes | 503 Views
Reverse Hashing for Sketch Based Change Detection in High Speed Networks. Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen Class Presentation, June 2004, Network Security Computer Science Department, Northwestern University. Overview. Anomaly Detection
E N D
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish GuptaElliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen Class Presentation, June 2004, Network Security Computer Science Department, Northwestern University
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
Anomaly Detection • Goes beyond signature detection • Two popular types: • Heavy Hitter Detection • Change detection : very broad simple change to statistical methods • Online real-time difficult • Heavy hitter: some solutions proposed • Heavy Change ? • Scalability with High speed traffic • Large Number of flows: large memory required • Performance penalty • Scalable Change Detection: Sketch to the rescue !
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
What is a sketch ? • Probabilistic summary of data streams • Widely used in database research to handle massive data streams • Array of hash tables: Tj[K] (j = 1, …, H)
… h1(k) 0 1 K-1 Estimate v(S, k): sum of updates for key k 1 … hj(k) j hH(k) … H What is a sketch ? Update (k, u): Tj [ hj(k)] += u (for all j)
Using Sketch for anomaly detection • Requires very little space: • E.g. 5 hash tables with 16 K buckets = 360 K • High speed memory usable • Still able to reconstruct the values with high accuracy • Its main problem • To know the value of a key, must know the key. • Can know the anomalies, not the keys !
? ? Using Sketch for anomaly detection • Requires very little space: • E.g. 5 hash tables with 16 K buckets = 360 K • High speed memory usable • Still able to reconstruct the values with high accuracy • Its main problem • To know the value of a key, must know the key. • Can know the anomalies, not the keys !
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
How can we figure out the keys without storing them explicitly ? ? ? Our contribution
Step 1: Taking Intersections • Each hash table independent hash function • Each key maps to different bucket in each table • Each bucket maps to a large set of keys • Example: Key maps to b1, b2, b3, b4, b5 • Intersect A1, A2, A3, A4, A5 really small set ! • E[x] << 1 for 5 hash tables (ref. our paper )
The problem with simple intersection • Why is this difficult ? • One to many mapping • Each set Ai can be very large ! • E.g. for IP addresses Key space is 232. For 212 buckets 220 keys per bucket !
Modular hashing 32 bits 10010100 10101011 10010101 10100011 8 bits Problem with Intersections • How do we store these huge mappings ? • How de we take intersections of these huge sets ? • Partition the key into separate words • Hash each word separately
h1() h2() h3() h4() 010 110 001 101 010 110 001 101 Modular hashing reduces the set size 32 bits 10010100 10101011 10010101 10100011 8 bits Greatly reduces size of reverse mapped sets
Modular hashing 28/23 Only 32 elements per partition • For 8 bit to 3 bit hashing : Each bucket maps to 25 = 32 keys small !
logarithmic in key space • poly-log in key space Modular Hashing is Efficient • Very efficient in space and time: • If n is the key space, m is hash space, q is number of words, • Space = • Run time (intersections) = Set q = O(log n)
An Important problem: spatial locality • This hashing scheme is not uniform and biased • In network streams, strong spatial locality in IP addresses • E.g. many addresses fall into 120.105.56.* • These would be mapped into very few buckets large number of collisions low sketch accuracy IP Mangling
IP Mangling removes correlations • Key idea : randomize the input data to destroy correlations • Must be reversible also !
To be invertible: Must be relatively prime Theory of Modular Linear Equations • a is chosen randomly • Can be easily reversed: replace a by a-1 ! • This function is highly effective in resolving the skewed distribution f(x) a·x mod n
Modular Hashing Makes intersection time and space efficient IP Mangling Removes un-uniformity of modular hashing Recap Intersections of reverse mapped sets Converges to culprit key
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
Handling Multiple Intersections… • A more complex problem Illustration How do we take intersections now ? • Each hash table contains two anomalies now two culprit keys…
Handling Multiple Intersections… • Multiple possibilities…. • Take union of keys from each hash table, and then intersection False positives
Handling Multiple Intersections… • Multiple possibilities…. • Try all possible combinations of intersections…. • Expensive and inaccurate(?)
Handling Multiple Intersections… • Bucket Vector Algorithm: a new algo • Efficient • Similar to all possible intersections but takes polynomial time • Documented in our technical report
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
Evaluation • Got traffic traces from a large ISP • Each 5 min interval 7.5 GB of traces • Used the Change Detection Method described earlier
Evaluation • Efficacy depends on number of heavy changers • Depends on change threshold, • Less threshold large number of heavy changes • To verify our results, used a naïve multi-pass algo the Ground Truth
Our methods are quite effective • Detection quite accurate, even upto 20 heavy changes • False positives and false negatives very less
The bucket vector algorithm is important • For multiple changes, the method of intersection quite important • E.g. w/o bucket vector algorithm:
We can make the sketch more accurate • Use 6 hash tables , instead of 5 • Makes intersections very accurate, less false negatives
Conclusions • Sketches a powerful method for scalable change detection • Our main contribution : can reverse them • Greatly enhances their applicability in online systems • We can extract heavy changes from the sketchs, without storing any key information • Methods are accurate • Low number of false positives and false negatives • Methods are efficient • Runtime: Only poly-logarithmic in key space • Space: logarithmic in key space
Overview • Anomaly Detection • Sketch Based Approaches and their problems • Reverse Hashing algorithms • Dealing with Multiple Anomalies • Evaluation • Conclusions • Future Work
Future Work: Three areas • Application to Online real-time systems • Performance evaluation • Hardware design of our methods • More advanced applications: • Hierarchical change detection • Output the prefix changes not just the key changes ! • E.g. 129.105.100.* shows a big change ! • Advanced change detection methods: • Statistical methods