A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee1, Tian Bu2, Girish Chandranmenon21The Chinese University of Hong Kong2Bell Labs, Alcatel-Lucent April 2010

Outline • Motivation • MCRingBuffer, a multi-core ring buffer • Parallel network monitoring prototype • Conclusions

Network Traffic Monitoring • Monitoring data streams in today’s networks is essential for network management: • Accounting • resource provisioning • failure diagnosis • intrusion detection/prevention • Goal: achieve line-rate monitoring • Monitoring speed must keep up with link bandwidth (i.e., prepare for the worst) • Challenges: • Data volume keeps increasing (e.g., to Gigabit scales) • Single CPU systems may no longer support line-rate monitoring

Can Multi-Core Help? • Can multi-core architectures help line-rate monitoring? • Parallelize packet processing Quad-core CPU CPU raw packets core core raw packets core core core Single-core case Multi-core case • The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging • Inter-core communication has overhead: • Upper layer: protocol messages • Lower layer: thread synchronization in shared data structures

Can Multi-Core Help? • Multi-core helps only if we minimize inter-core communication overhead • Let’s focus on minimizing thread synchronization • Benefit a broad class of multi-threaded network monitoring applications

Our Contribution • Why lock-free? • Allows concurrent thread accesses • Why cache-efficient? • Saves expensive memory accesses • We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi-core architectures Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed network traffic monitoring

Producer/Consumer Problem • Classical OS problem • Ring buffer: bounded buffer with fixed number of slots • Thread synchronization: • Producer inserts elements when buffer is not full • Consumer extracts elements when buffer is not empty • First-in-first-out (FIFO): inserted elements and extracted elements in the same order element Producer Consumer Ring buffer

CPU core core Producer Consumer L1 cache L1 cache L2 cache System bus Memory Control variables Ring buffer Producer/Consumer Problem • Ring buffer in multi-core context: • Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.

read write N-1 0 Lamport’s Lock-Free Ring Buffer [Lamport, Comm. of ACM, 1977] • Operate on control variables: read and write, which resp. point to next read and write slots NEXT(x) = (x + 1) % N Insert(T element) 1: wait until NEXT(write) != read 2: buffer[write] = element 3: write = NEXT(write) Extract(T* element) 1: wait until read != write 2: *element = buffer[read] 3: read = NEXT(read)

Previous Work • FastForward [Giacomoni et al., PPoPP, 2008]: • couple data/control operations • need a special NULL data element defined by applications • Hardware-primitive ring buffers • support multiple-producers/multiple-consumers • use hardware synchronization primitives (e.g., compare and swap) • Hardware primitives are expensive in general

MCRingBuffer Overview • Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization • Properties: • Lock-free: allow concurrent accesses of producer and consumer • Cache-efficient: improve cache locality of synchronization • Generic: no assumption on data types and insert/extract patterns • Deployable: works on general-purpose multi-core CPUs • Components: • Cache-line protection • Batch updates of control variables

MCRingBuffer Assumptions • Assumptions inherited from Lamport’s ring buffer: • single-producer/single-consumer • reading/writing read/write are atomic • memory accesses follow sequential consistency

Cache-line Protection • Cache is in unit of cache lines • False sharing occurs when two threads access different variables on the same cache line • Cache line invalidated when a thread modifies a variable • Cache line reloaded from memory when a thread reads a different variable, even unchanged cache read/write modified frequently for thread synchronization N read write N (ring buffer size) is reloaded from memory even if it’s constant

Cache-line Protection • Add padding bytes to avoid false sharing cache int read int write char cachePad1[CL–2*sizeof(int)] int N char cachePad2[CL–sizeof(int)] cachePad1 read write N cachePad2 CL = cache line size

Cache-line Protection • Use cache-line protection to minimize memory accesses cache Shared variables read write cachePad1 localWrite nextRead cachePad2 Consumer’s local variables localRead nextWrite cachePad3 Producer’s local variables Constants N cachePad4 • Shared variables are main controls of synchronization • Use local variables to “guess” shared variables • Goal: minimize freq. of reading shared control variables

Batch Updates of Control Variables • Intuition: • nextRead/nextWrite are the positions where to read/write • Update read/write after batchSize reads/writes Producer Consumer buffer[nextWrite] = element nextWrite = NEXT(nextWrite) wBatch++ if (wBatch >= batchSize) { write = nextWrite wBatch = 0 } *element = buffer[nextRead] nextRead = NEXT(nextRead) rBatch++ if (rBatch >= batchSize) { read = nextRead rBatch = 0 } • Goal: minimize freq. of writing shared control variables

Batch Updates of Control Variables • Limitation: • read/write advanced on per-batch basis elements may not be extracted even buffer is not empty • However, if elements are raw packets in high-speed networks, read/write will be updated regularly

Correctness of MCRingBuffer • Correctness based on Lamport’s ring buffer: • Lamport’s: • Insert only if write – read < N • Extract only if read < write • We prove for MCRingBuffer: • Insert only if nextWrite – nextRead < N • Extract only if nextRead < nextWrite • Details in the paper.

Evaluation • Hardware: Intel Xeon 5355 Quad-core • sibling cores: pair of cores sharing L2 cache • non-sibling cores: pair of cores not sharing L2 cache • Ring buffers: • LockRingBuffer: lock-based ring buffer • BasicRingBuffer: Lamport’s ring buffer • MCRingBuffer: • batchSize = 1: cache-line protection • batchSize > 1: cache-line protection + batch control updates • Metrics: • Throughput: number of insert/extract pairs per second • Number of L2 cache misses: number of cache-line reload operations

Experiment 1 • Throughput vs. element size buffer capacity = 2K elements Sibling cores Non-Sibling cores • MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size

Experiment 2 • Throughput vs. buffer capacity element size = 128 bytes Sibling cores Non-Sibling cores • MCRingBuffer’s throughput invariant with large enough buffer capacity

Experiment 3 • Code profiling from Intel VTune Performance Analyzer Metric numbers for 10M inserts/extracts element size = 8 bytes, capacity = 2K elements • MCRingBuffer improves cache locality

Recap of Evaluation • MCRingBuffer improves throughput in various scenarios: • Different data sizes • Different buffer capacities • Sibling/non-sibling cores • MCRingBuffer has higher throughput gain via: • careful organization of control variables • careful accesses to control variables • MCRingBuffer’s gain does not require any special insert/extract patterns

ring buffer SubAnanlyzer raw packets SubAnanlyzer Dispatcher MainAnanlyzer … SubAnanlyzer decoded packets state reports Parallel Traffic Monitoring • Applying MCRingBuffer to parallel traffic monitoring

SubAnalysis Dispatch MainAnalysis … Parallel Traffic Monitoring • Dispatch stage: • Decode raw packets • Distribute decoded packets by (srcIP, dstIP) • SubAnalysis stage: • Local analysis on address pairs • e.g., 5-tuple flow stats, vertical portscans • MainAnalysis stage: • Global analysis: aggregate results of all SubAnalyzers • e.g., source’s volume, horizontal portscans Evaluation results: MCRingBuffer helps scale up packet processing throughput (details in paper)

Take-away Messages • Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism • Next question: • How do we apply MCRingBuffer to different network monitoring problems?

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring

Presentation Transcript

Network Traffic Monitoring on DETER

OAUNETMON: A Network Traffic Monitoring Tool

Multicast Traffic Monitoring on a Nationwide Backbone Network

Efficient Data Synchronization

A Lock-free Multi-threaded Algorithm for the Max-flow Problem

Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining

Traffic Light Synchronization

Network Monitoring for Internet Traffic Engineering

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

A mechanism of heart rate regulation via synchronization of Calcium release

Task Partitioning for Multi-Core Network Processors

Network-supported Rate Control Mechanism for Multicast Streaming Media

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

Traffic Monitoring on CJK-Network

Sensor network s for traffic monitoring

Coordinated Scheduling: A Mechanism for Efficient Multi-Node Communication

Traffic Monitoring on CJK-Network

Scalable Synchronization Algorithms in Multi-core Processors

A Concurrent Lock-Free Priority Queue for Multi-Thread Systems

A More Efficient Protection Mechanism