Improving the Performance of Network Intrusion Detection Using Graphics Processors

Improving the Performance of Network Intrusion Detection Using Graphics Processors GiorgosVasiliadis Master Thesis Presentation Computer Science Department - University of Crete

Motivation • Pattern matching is a crucial component of network intrusion detection systems • Thousands of patterns • Require high rate (e.g. gigabit) • Multi-pattern search is not sufficient • Parallel matching provides a scalable solution Giorgos Vasiliadis

Objectives • To offload the pattern matching operations to the Graphics card • highly-parallel computational devices • low-cost • Match thousands of network packets concurrently, instead of one per time Giorgos Vasiliadis

Roadmap • Introduction • Design • Evaluation • Conclusions Giorgos Vasiliadis

Network Intrusion Detection Systems • Passively monitor incoming and outgoing traffic for suspicious payloads. • Single entity locating at the network edge • Scans packet payloads for malicious content Giorgos Vasiliadis

Pattern Matching Algorithms • Essential for any signature-based NIDS • Algorithms were not necessarily motivated by IDS • It is just string searching Giorgos Vasiliadis

The Aho-Corasick Algorithm • Used in most modern NIDSes Example: P={he, she, his, hers} Next state Compile patterns into a state machine The state machine is used to scan for all patterns simultaneously at linear time state:= f(state, char) Input text she is a maniac Giorgos Vasiliadis

The Problem • Aho-Corasick search has increased performance, but is not enough for high-speed networks • Accounts up to 75%of the total CPU processing of a NIDS • Parallel pattern matching provides a scalable solution This Work • To speedup the processing throughput of Network Intrusion Detection Systems by offloading the pattern matching operations to the GPU Giorgos Vasiliadis

Why use the GPU? • The GPU is specialized for compute-intensive, highly parallelcomputation • More transistors are devoted to data processing rather than data caching and flow control • The fast-growing video game industry exerts strong economic pressurethat forces constant innovation GiorgosVasiliadis

NVIDIA GeForce8 Series Architecture Many Multiprocessors Each multiprocessor contains 8 Stream Processors Different types of memory Giorgos Vasiliadis

The CUDA Programming Model • Compute Unified Device Architecture SDK • GPU can be used for non-graphics purposes • GPU is capable of executing thousands of threads Giorgos Vasiliadis

Implementation within Snort • Snort is the most widely used Network Intrusion Detection System • Open-source • Contains a large number of threats signatures Giorgos Vasiliadis

Architecture Outline Transfer packets to the GPU Parallel match Copy results from GPU Giorgos Vasiliadis

Challenges • Overhead of moving data to/from the GPU • Additional communication costs • Parallelize packet inspection process • Map packet data to processing elements Giorgos Vasiliadis

Transferring Packets to the GPU (1/3) • PCIExpress bus provide large transfer capacity • up to 4 GB/s in each direction (v.1.1, x16) GiorgosVasiliadis

Transferring Packets to the GPU (2/3) • Unfortunately, packets cannot be transferred directly to the memory space of the GPU GiorgosVasiliadis

Transferring Packets to the GPU (2/3) • Thus, network packets are copied to host memory first and transferred via DMA to the GPU 2 1 GiorgosVasiliadis

Transferring Packets to the GPU (3/3) • Network packets are copied as textures, instead of global memory • Texture fetches are cached • Random access memory read • Read-only memory Giorgos Vasiliadis

Pattern Matching on the GPU • Each packet is scanned against a specific Aho-Corasick state machine, based on its destination port • All state machines are represented as 2D matrices that are sequentially stored in Texture memory space • Each stream processor searches its assigned data using the appropriate state machine in parallel Giorgos Vasiliadis

Parallelizing Packet Matching (1/3) • Perform data-parallel pattern matching • Distribute packets across Processing Elements • The GeForce8600 contains 32 Stream Processors organized in 4 Multiprocessors • We have explored two different approaches for parallelizing the searching phase. Giorgos Vasiliadis

Parallelizing Packet Matching (2/3) • Approach 1: Assigning a Single Packet to each Multiprocessor • Stream processors search different parts of the packet concurrently • A multiprocessor can pipeline many packets to hide latencies Giorgos Vasiliadis

Parallelizing Packet Matching (3/3) • Approach 2: Assigning a Single Packet to each Stream Processor • Each packet is processed by a different stream processor • A stream processor can pipeline many packets to hide latencies Giorgos Vasiliadis

Saving the results in the GPU • Pattern matches for each packet are appended in a two-dimensional array in global device memory • For each match, we store • the ID of the matched pattern • the index inside the packet where it was found Giorgos Vasiliadis

Copying the results from the GPU • All pattern matches are copied back to the host main memory • The CPU process the results further 2 1 Giorgos Vasiliadis

Software Mapping • Network packets are classified and copied to a packet buffer • Every time the buffer fills up, it is copied and processed by the GPU at once • By using DMA-enabled memory copies and a double-buffer scheme, CPU and GPU execution can overlap Giorgos Vasiliadis

Pipelined Execution • CPU sends a batch of packets to the GPU for processing • By the time the GPU is processing the packets, the CPU collects the next batch of packets • The CPU is synchronized by getting the results of the first batch Giorgos Vasiliadis

Evaluation Overview • Technical equipment • 3.4 GHz Intel Pentium 4 • 2GB of memory • NVIDIA GeForce 8600GT • Evaluation with Snort • 5467 content filtering rules • 7878 patterns associated with these rules Giorgos Vasiliadis

Transferring Packets to the GPU • PCI Express 16x v1.1 • 4 GB/sec maximum theoretical throughput • Divergence from the theoretical maximum data rates may be due to the 8b/10b encoding in the physical layer Giorgos Vasiliadis

Pattern Matching Throughput Giorgos Vasiliadis

Performance Analysis GPU costs are hidden Giorgos Vasiliadis

Throughput vs. Packet size • We ran Snort using random generated patterns • The packets contained random payload • 2.3 Gbit/s for full packets • 3.2xfaster compared to the CPU Giorgos Vasiliadis

Macrobenchmark (1/2) • Experimental setup • Two PCs connected via a 1 Gbit/s Ethernet switch using commodity network cards Giorgos Vasiliadis

Macrobenchmark (2/2) • Original Snort (AC) cannot process all packets in rates higher than 250 Mbit/s • GPU-assisted Snort (AC1, AC2) begins to loose packets at 500 Mbit/s • twice as fast Giorgos Vasiliadis

Conclusions • Graphics cards can be used effectively to speed up Network Intrusion Detection Systems. • Low-cost (GeForce8600 costs less than $100) • Worth the extra GPU programming effort • Our results indicate that network intrusion detection at gigabit rates is feasible using graphics processors Giorgos Vasiliadis

Related Work • Specialized hardware • Reprogrammable Hardware (FPGAs) [3,4,13,14,31] • Very efficient in terms of speed • Poor flexibility • Network Processors [5,8,12] • Commodity hardware • Multi-core processors [25] • Graphics processors [17] Giorgos Vasiliadis

Previous Work • Jacob et al.: Offloading IDS computation to the GPU. ACSAC 2006 • Nen-Fu Huang et al.: A GPU-based Multiple-pattern Matching Algorithm for Network Intrusion Detection Systems. AINAW 2008 Gnort Nen-Fu Huang et al. Jacob et al.: PixelSnort Giorgos Vasiliadis

Publications • G.Vasiliadis, S.Antonatos, M.Polychronakis, E.Markatos, S.Ioannidis. Gnort: High Performance Intrusion Detection Using Graphics Processors. RAID 2008 • G.Vasiliadis, S.Antonatos, M.Polychronakis, E.Markatos, S.Ioannidis. Regular Expression Matching on Graphics Hardware for Intrusion Detection. Under Submission (Security and Privacy 2009) Giorgos Vasiliadis

Fin Thank you Giorgos Vasiliadis

Future work • Transfer the packets directly from the NIC to the memory space of the GPU • Utilize multiple GPUs on multi-slot motherboards • Content-based traffic applications • virus scanners, anti-spam filters, firewalls, etc. Giorgos Vasiliadis

Dividing the Payload • Approach 1 divides the packet payload into fragments • Fragments given to Stream Processors; complete payload scanned • Signature (malicious content) may span fragment • Single Processor may not see complete signature • Must overlap fragments to prevent false negatives • Overlap dependent on the largest signature Giorgos Vasiliadis

Parallel Matching Approaches Giorgos Vasiliadis

Parallelizing Packet Searching (1/2) • Assigning a Single Packet to each Multiprocessor • Each packet is copied to the shared memory of the Multiprocessor • Stream Processors search different parts of the packet concurrently • Overlapping computation • Matching patterns may span consecutive chunks of the packet • Same amount of work per Stream Processor • Stream Processors will be synchronized Giorgos Vasiliadis

Parallelizing Packet Searching (2/2) • Assigning a Single Packet to each Stream Processor • Each packet is processed by a different Stream Processor • No overlapping computation • Different amount of work per Stream Processor • Stream processors of the same Multiprocessor will have to wait until all have finished Giorgos Vasiliadis

Pattern Matching Throughput Global Memory Texture Memory • AC1 performs better for small data sets, but fails to scale when data increases • On the contrary, AC2 scales better as the size of the data increases • Texture memory provides better performance than global device memory Giorgos Vasiliadis

Single-Pattern Matching on GPU Giorgos Vasiliadis

Evaluation (1/2) • Scalability as a function of the number of patterns • We ran Snort using random generated patterns • All patterns are matched against every packet • Payload trace contained UDP 800-bytes packets of random payload • Throughput remains constant when #patterns increases • 2.4x faster than the CPU Giorgos Vasiliadis

Macrobenchmark Giorgos Vasiliadis

Improving the Performance of Network Intrusion Detection Using Graphics Processors