1 / 33

Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP

Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP. Anat Bremler -Barr Interdisciplinary Center Herzliya Shimrit Tzur David Interdisciplinary Center Herzliya & The Hebrew University, Jerusalem David Hay The Hebrew University, Jerusalem Yaron Koral

stacie
Download Presentation

Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decompression-Free Inspection: DPI for Shared Dictionary Compression over HTTP AnatBremler-Barr Interdisciplinary Center Herzliya ShimritTzur David Interdisciplinary Center Herzliya & The Hebrew University, Jerusalem David Hay The Hebrew University, Jerusalem Yaron Koral Tel Aviv University

  2. Motivation • Background • AC algorithm • Our solution • The offline Phase • The online phase • Experimental Results Outline

  3. Search for patterns in the packets` payload • Signatures-based NIDS • Intrusion Preventions • Web-Application Firewalls • Leakage prevention • Content Filtering • Challenges: • Thousands of known malicious patterns • Real time, link rate • Security tools performance is dominated by the pattern matching engine (Fisk & Varghese 2002) Deep Packet Inspection (DPI)

  4. 84.1% of the top 1,000 sites compress their traffic. • Data compression is done by adding references to repeated data. • There are two types of compression: • Intra-response compression – the references point to bytes within the response (Gzip/Deflate) • Inter-responses/connections compression – the references point to bytes in a separate file, called dictionary (Google’s SDCH). 19% increase in 8 month! Compressed HTTP

  5. File1.html:abcdefgabcd • File2.htmlabcdxyzbcdtr • Encode repeated strings by pointer: {distance, length} TCP Connection Setup GET File1.html abcdefg(7,4) GET File2.html Example – Intra-Response Compression abcdxyz(6,3)tr

  6. Dictionary: abcd • File1.html:abcdefgabcd • File2.htmlabcdxyzbcdtr • Copy repeated strings from the dictionary: (address, length) TCP Connection Setup GET dictionary abcd GET File1.html Delta file: (0,4)efg(0,4) Example – Inter-Response Compression GET File2.html • Delta file:(0,4)xyz(1,3)tr

  7. GET \index.html Accept-Encoding: SDCH GET \index.html Accept-Encoding: SDCH Server NIDS Client Http uncompressed Http uncompressed Current NIDS Operation (1) Scan for Intrusions

  8. GET \index.html Accept-Encoding: SDCH GET \index.html Accept-Encoding: SDCH Server NIDS Client Http compressed Http compressed Current NIDS Operation (2) Do Not Scan/ Decompress,Scan, Compress

  9. GET \index.html Accept-Encoding: SDCH GET \index.html Accept-Encoding: SDCH Server NIDS Client Http compressed Http compressed Our Solution Scan directlywith no decompression

  10. Focused on inter-response compression   • Our algorithm works in two phases • Offline phase - Scanning the dictionary • Online phase - Scanning the delta files • Works at the rate of the compressed traffic • Gain 56% improvement compared with scanning the plain-text directly Our Solution: Decompression-Free Scanning

  11. Motivation • Background • Aho-Corasick (AC) algorithm • Our solution • The offline Phase • The online phase • Experimental Results Outline

  12. Aho-Corasick (AC) Algorithm • Finite State Machine (FSM) • Regular states, accepting states • Goto function (black arrows) • g(state,symbol)state • Each state corresponds to a label-the sequence of characters on its goto path from the root. • The length of the label is the depth of the state • Failure function (red arrows) • f(state)state • Taken when there is no goto function • Goes to a state that its label is the longest suffix of the current state’s label s0 E C B s1 s2 s7 E C D D Patterns: E BE BD BCAA BCD CDBCAB s3 s4 s5 s8 A D B s9 g(S11,B) = S12 g(S11,A) = ? s13 s6 C A s14 s10 A s11 The label of S14 is BCAA B s12 f(S11) = S13  g(S11,A)  g(S13,A)=S14

  13. The automaton remembers only its current state • The input text ends with the label of current state • This label is the longest suffix in the text that can be a prefix of a match • No future pattern can begin before this label s0 E C B s1 s2 s7 E C D D s3 s4 s5 s8 A D B s9 s13 s6 C A s14 s10 A Aho-Corasick Insights s11 B s12

  14. Motivation • Background • Aho-Corasick (AC) algorithm • Our solution • The offline Phase • The online phase • Experimental Results Outlines

  15. The algorithm operates in two phases: • The Offline Phase: • Scan the dictionary and store information about the pattern matching results • The Online Phase: • Scan the delta file and skip almost all referenced bytes that were already scanned for patterns. Accelerator Algorithm Idea

  16. The dictionary is scanned usingAC (from its first byte and from s0). We save the state after each byte. s0 E C B s1 s2 s7 E C D D s3 s4 s5 s8 A D B s9 State: s13 s6 C A s14 s10 A The Offline Phase s11 B • We also save information of matched patterns that are found in the dictionary s12

  17. Dictionary: • Delta file: ABDB(5,4)AAB(1,4) • The uncompressed data is: • We copy from arbitrary position in the dictionary when the automaton in an arbitrary state • We show that no matter in what state and which symbol we start to copy, the resulting state is reachable via failure transitions from the saved state. Patterns/ Signatures: E BE BD BCAA BCD CDBCAB A B D B C D B C A A B B E A A Types of matches: Right boundary Internal Left boundary Challenges

  18. Scan the delta file: • Uncompressed bytes - scan using AC. • Copy instruction(p,x) • The compressed data that we already scanned in the offline phase. • We will save the scan for almost all these bytes. • The internal match is trivial, see paper for details. The Online Phase

  19. When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1] • If the label of the state is longer than the copy-value • The label begins before the copy value • The context of this state is not as in the online scan • We take failure transitions to find state with sufficiently short label. • Otherwise • The label of the state is contained in the copy value • This is the longest suffix that can lead to a match The Online Phase - Right Boundary

  20. Uncompressed data: …B s0 E C COPY(7,4): B C A B B s1 s2 s7 E C D D s3 s4 s5 s8 A D B s9 s13 Go to State[10]=s12. depth(s12) > 4. Go to f(s12)=s2 s6 C A s14 s10 depth(s2) ≤ 4 Current state is S2 A s11 Example – Right Boundary B State: s12

  21. When encountering copy instruction (p,x), We want to stop scanning and jump to state[p+x-1] • If the number of bytes we read from the copy value is less than the depth of the current state • The label of the state begins before the copied bytes • We scan the copy value till we reach a state that its label is shorter than the number of read bytes. • Otherwise • The label of the state is contained in the copy value • Both offline and online scans have the same context The Online Phase – Left Boundary

  22. Uncompressed data: …B s0 E C COPY(5,4): C D B C B s1 s2 s7 E C D D s3 s4 s5 s8 j=0 depth=1 Continue j=2 Depth=3 Continue j=1 Depth=2 Continue A D B s9 s13 s6 j=3 Stop scanning (depth(s9)≤3) C A s14 s10 A s11 Example – Left Boundary B State: s12

  23. Motivation • Background • Aho-Corasick (AC) algorithm • Our solution • The offline Phase • The online phase • Experimental Results Outline

  24. Input: • google.com dictionary • Pages for 1000 most popular Google queries. • Patterns • Snort • The synthetic case • A patterns file for each input file so the input file has a different percentage of matches, from 25% to 100%. Experimental Results

  25. Traversing the failure transitions • In the right boundary • Scanning the copy value • In the left boundary • Memory consumption: • The additional information of the offline phase. • Total: 420 KB (per dictionary) • Can be further reduced by a variable-length pointer encoding. The Algorithm Overheads

  26. If length ≥ depth, no failure transition is taken • In our experiments: • The average is 2.35 failure transitions per file • (average of 557 copy instructions per file) Failure Transitions – Right Boundaries

  27. Compression ratio – compressed/uncompressed • Scan ratio – scanned/uncompressed. • Snort • low percentage of matches scan-ratio ~ compression ratio • The synthetic case • high percentage of matches • Unrealistic case • scan-ratio is between 1.05 to 1.2 times compression-ratio. Scanning the Copy Value -Left Boundary

  28. Strings were extracted from the regular expression and were added to the pattern set. • When needed, we use off-the-shelf perl compatible regular expression engine to scan additional parts of the text. • The overhead of the regular expression is around 1% which is almost negligible Regular Expression Results

  29. Questions??

  30. Very common in security purpose patterns. • In Snort, 55% of the rules contain regular expression. • Composed of anchors and pcretokens. • For example, in the pattern: abc[1-9]*xyza{3,7} • The anchors are: • abc • xyz • The pcretokens are: • [1-9]* • a{3,7} Regular Expression

  31. The anchors are extracted from the regular expression offline. • The anchors are added to the patterns set. • If there is a regular expression which all its anchors were matched: • run an off the-shelf regular expression engine until, either a mismatch, a full pattern match, or the whole (limited) text is searched. Dealing with Regular Expression

  32. In most cases, we can limit the search in at least one direction. • If before the first anchor all tokens have a limited size, there is a bounded number of characters we should examine before the matched anchor. • If after the last anchor all tokens have a limited size there is a bounded number of characters we should examine after the matched anchor. Regular Expression – Limited Search

  33. Doubling the size of the dictionary (for saving the offline scan results, one pointer per symbol) • Saving the matched list (for internal matches) Our experiments: • Match list size 40,000 • Dictionary size 116K symbols • Pointer size 17 bits Total memory consumption is 420 KB(per dictionary) • Can be further reduced by a variable-length pointer encoding. Memory Consumption

More Related