350 likes | 497 Views
FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue. John Giacomoni. Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21. Why? Why Pipelines?. Multicore systems are the future
E N D
FastForward for Efficient Pipeline Parallelism:A Cache-Optimized Concurrent Lock-Free Queue John Giacomoni Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21
Why?Why Pipelines? • Multicore systems are the future • Many apps can be pipelined if the granularity is fine enough • ≈ < 1 µs • ≈ 3.5 x interrupt handler
Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization)
IP IP APP Dec APP OP App Dec Enc Enc IP OP APP OP Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona)
CommunicationOverhead Locks 320ns GigE
CommunicationOverhead Lamport 160ns Locks 320ns GigE
CommunicationOverhead Hardware 10ns Lamport 160ns Locks 320ns GigE
CommunicationOverhead Hardware 10ns FastForward 28ns Lamport 160ns Locks 320ns GigE
More Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization) • Signal Processing • Media transcoding/encoding/decoding • Software Defined Radios • Encryption • Counter-Mode AES • Other Domains • Fine-grain kernels extracted from sequential applications
FastForward • Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); } Lamport’sCLF Queue (1)
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n]
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (1)
ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (2) head tail buf[0] buf[0] buf[1] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.
ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline. N:1 reduction in coherence misses per stage.
Maintaining Slip(Concepts) • Use distance as the quality metric • Explicitly compare head/tail • Causes cache ping-ponging • Perform rarely
Maintaining Slip(Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); } }
ComparativePerformance Lamport FastForward
Thrashing andAuto-Balancing FastForward (Thrashing) FastForward (Balanced)
CacheVerification FastForward (Thrashing) FastForward (Balanced)
On/Off DieCommunications Off-die communication On-die communication M
On/Off-diePerformance FastForward (On-Die) FastForward (Off-Die)
ProvenProperty • “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”
Workin Progress • Operating Systems • 27.5 ns/op • 3.1 % cost reduction vs. reported 28.5 ns • Reduced jitter • Applications • 128bit AES encrypting filter • Ethernet layer encryption at 1.45 mfps • IP layer encryption at 1.51 mfps • ~10 lines of code for each.
Gazing intothe Crystal Ball Hardware 10ns FastForward 28ns Lamport 160ns Locks 320ns GigE
Shared Memory Accelerated Queues Now Available! http://ce.colorado.edu/core Questions? john.giacomoni@colorado.edu