260 likes | 611 Views
A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. Patrick P. C. Lee 1 , Tian Bu 2 , Girish Chandranmenon 2 1 The Chinese University of Hong Kong 2 Bell Labs, Alcatel-Lucent April 2010. Outline. Motivation
E N D
A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee1, Tian Bu2, Girish Chandranmenon21The Chinese University of Hong Kong2Bell Labs, Alcatel-Lucent April 2010
Outline • Motivation • MCRingBuffer, a multi-core ring buffer • Parallel network monitoring prototype • Conclusions
Network Traffic Monitoring • Monitoring data streams in today’s networks is essential for network management: • Accounting • resource provisioning • failure diagnosis • intrusion detection/prevention • Goal: achieve line-rate monitoring • Monitoring speed must keep up with link bandwidth (i.e., prepare for the worst) • Challenges: • Data volume keeps increasing (e.g., to Gigabit scales) • Single CPU systems may no longer support line-rate monitoring
Can Multi-Core Help? • Can multi-core architectures help line-rate monitoring? • Parallelize packet processing Quad-core CPU CPU raw packets core core raw packets core core core Single-core case Multi-core case • The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging • Inter-core communication has overhead: • Upper layer: protocol messages • Lower layer: thread synchronization in shared data structures
Can Multi-Core Help? • Multi-core helps only if we minimize inter-core communication overhead • Let’s focus on minimizing thread synchronization • Benefit a broad class of multi-threaded network monitoring applications
Our Contribution • Why lock-free? • Allows concurrent thread accesses • Why cache-efficient? • Saves expensive memory accesses • We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi-core architectures Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed network traffic monitoring
Producer/Consumer Problem • Classical OS problem • Ring buffer: bounded buffer with fixed number of slots • Thread synchronization: • Producer inserts elements when buffer is not full • Consumer extracts elements when buffer is not empty • First-in-first-out (FIFO): inserted elements and extracted elements in the same order element Producer Consumer Ring buffer
CPU core core Producer Consumer L1 cache L1 cache L2 cache System bus Memory Control variables Ring buffer Producer/Consumer Problem • Ring buffer in multi-core context: • Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.
read write N-1 0 Lamport’s Lock-Free Ring Buffer [Lamport, Comm. of ACM, 1977] • Operate on control variables: read and write, which resp. point to next read and write slots NEXT(x) = (x + 1) % N Insert(T element) 1: wait until NEXT(write) != read 2: buffer[write] = element 3: write = NEXT(write) Extract(T* element) 1: wait until read != write 2: *element = buffer[read] 3: read = NEXT(read)
Previous Work • FastForward [Giacomoni et al., PPoPP, 2008]: • couple data/control operations • need a special NULL data element defined by applications • Hardware-primitive ring buffers • support multiple-producers/multiple-consumers • use hardware synchronization primitives (e.g., compare and swap) • Hardware primitives are expensive in general
MCRingBuffer Overview • Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization • Properties: • Lock-free: allow concurrent accesses of producer and consumer • Cache-efficient: improve cache locality of synchronization • Generic: no assumption on data types and insert/extract patterns • Deployable: works on general-purpose multi-core CPUs • Components: • Cache-line protection • Batch updates of control variables
MCRingBuffer Assumptions • Assumptions inherited from Lamport’s ring buffer: • single-producer/single-consumer • reading/writing read/write are atomic • memory accesses follow sequential consistency
Cache-line Protection • Cache is in unit of cache lines • False sharing occurs when two threads access different variables on the same cache line • Cache line invalidated when a thread modifies a variable • Cache line reloaded from memory when a thread reads a different variable, even unchanged cache read/write modified frequently for thread synchronization N read write N (ring buffer size) is reloaded from memory even if it’s constant
Cache-line Protection • Add padding bytes to avoid false sharing cache int read int write char cachePad1[CL–2*sizeof(int)] int N char cachePad2[CL–sizeof(int)] cachePad1 read write N cachePad2 CL = cache line size
Cache-line Protection • Use cache-line protection to minimize memory accesses cache Shared variables read write cachePad1 localWrite nextRead cachePad2 Consumer’s local variables localRead nextWrite cachePad3 Producer’s local variables Constants N cachePad4 • Shared variables are main controls of synchronization • Use local variables to “guess” shared variables • Goal: minimize freq. of reading shared control variables
Batch Updates of Control Variables • Intuition: • nextRead/nextWrite are the positions where to read/write • Update read/write after batchSize reads/writes Producer Consumer buffer[nextWrite] = element nextWrite = NEXT(nextWrite) wBatch++ if (wBatch >= batchSize) { write = nextWrite wBatch = 0 } *element = buffer[nextRead] nextRead = NEXT(nextRead) rBatch++ if (rBatch >= batchSize) { read = nextRead rBatch = 0 } • Goal: minimize freq. of writing shared control variables
Batch Updates of Control Variables • Limitation: • read/write advanced on per-batch basis elements may not be extracted even buffer is not empty • However, if elements are raw packets in high-speed networks, read/write will be updated regularly
Correctness of MCRingBuffer • Correctness based on Lamport’s ring buffer: • Lamport’s: • Insert only if write – read < N • Extract only if read < write • We prove for MCRingBuffer: • Insert only if nextWrite – nextRead < N • Extract only if nextRead < nextWrite • Details in the paper.
Evaluation • Hardware: Intel Xeon 5355 Quad-core • sibling cores: pair of cores sharing L2 cache • non-sibling cores: pair of cores not sharing L2 cache • Ring buffers: • LockRingBuffer: lock-based ring buffer • BasicRingBuffer: Lamport’s ring buffer • MCRingBuffer: • batchSize = 1: cache-line protection • batchSize > 1: cache-line protection + batch control updates • Metrics: • Throughput: number of insert/extract pairs per second • Number of L2 cache misses: number of cache-line reload operations
Experiment 1 • Throughput vs. element size buffer capacity = 2K elements Sibling cores Non-Sibling cores • MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size
Experiment 2 • Throughput vs. buffer capacity element size = 128 bytes Sibling cores Non-Sibling cores • MCRingBuffer’s throughput invariant with large enough buffer capacity
Experiment 3 • Code profiling from Intel VTune Performance Analyzer Metric numbers for 10M inserts/extracts element size = 8 bytes, capacity = 2K elements • MCRingBuffer improves cache locality
Recap of Evaluation • MCRingBuffer improves throughput in various scenarios: • Different data sizes • Different buffer capacities • Sibling/non-sibling cores • MCRingBuffer has higher throughput gain via: • careful organization of control variables • careful accesses to control variables • MCRingBuffer’s gain does not require any special insert/extract patterns
ring buffer SubAnanlyzer raw packets SubAnanlyzer Dispatcher MainAnanlyzer … SubAnanlyzer decoded packets state reports Parallel Traffic Monitoring • Applying MCRingBuffer to parallel traffic monitoring
SubAnalysis Dispatch MainAnalysis … Parallel Traffic Monitoring • Dispatch stage: • Decode raw packets • Distribute decoded packets by (srcIP, dstIP) • SubAnalysis stage: • Local analysis on address pairs • e.g., 5-tuple flow stats, vertical portscans • MainAnalysis stage: • Global analysis: aggregate results of all SubAnalyzers • e.g., source’s volume, horizontal portscans Evaluation results: MCRingBuffer helps scale up packet processing throughput (details in paper)
Take-away Messages • Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism • Next question: • How do we apply MCRingBuffer to different network monitoring problems?