500 likes | 709 Views
FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue. John Giacomoni. Advisor: Dr. Manish Vachharajani University of Colorado at Boulder. The Rise of Multicore. Triple DES with 32B Blocks. Nanoseconds/Block. Number of Threads. Why Pipelines?.
E N D
FastForward for Efficient Pipeline Parallelism:A Cache-Optimized Concurrent Lock-Free Queue John Giacomoni Advisor: Dr. Manish Vachharajani University of Colorado at Boulder
Triple DES with 32B Blocks Nanoseconds/Block Number of Threads Why Pipelines? • Data parallelism has limits • Pipeline parallelism if: • Granularity is fine enough • ≈ < 1 µs • ≈ 3.5 x interrupt handler • Total/Partial order
Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization) • Signal Processing • Software Defined Radios • Encryption • Triple-DES • Other Domains • ODE Solvers • Fine-grain kernels extracted from sequential applications
IP IP APP Dec APP OP App Dec Enc Enc IP OP APP OP Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona)
Routing/BridgeData Flow OP App IP OS
Communication Overhead Locks 190ns GigE
Communication Overhead Lamport 160ns Locks 190ns GigE
Communication Overhead Hardware 10ns Lamport 160ns Locks 190ns GigE
Communication Overhead Hardware 10ns FastForward 28ns Lamport 160ns Locks 190ns GigE
The ProgrammingAbstraction “Stack” • Sequential Programming • Improves programmer productivity • Very successful • Problematic on modern machines
The Complexityof Modern Systems • Sequential Programming • Parallel Programming • Very complex • Ignoring cross-layer behavior very problematic • Can lead to Incorrect behavior
Hardware Matters(Memory Consistency) X = 0 Y = 0 Y,X Legal outputs? 0,0 1,1 0,1 1,0 Weak Consistency Permits! ?
Hardware Matters(Memory Consistency) X = 0 B.Flag = 0 C.Flag = 0 What is the output? 3 2 1 0
TheBig Picture • Programming Modern Machines is a cross cutting problem • Need to evaluate/account/consider every layer • Omitted systems areas • Networking • File systems • Distributed systems • Business server development • Omitted related areas • User interfaces • Security
FastForward • Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); } Lamport’sCLF Queue (1)
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n]
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?
lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (1)
ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (2) head tail buf[0] buf[0] buf[1] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.
ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline.
Maintaining Slip(Concepts) • Use distance as the quality metric • Explicitly compare head/tail • Causes cache ping-ponging • Perform rarely
Maintaining Slip(Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); } }
ComparativePerformance Lamport FastForward
Thrashing andAuto-Balancing FastForward (Thrashing) FastForward (Balanced)
CacheVerification FastForward (Thrashing) FastForward (Balanced)
On/Off DieCommunications Off-die communication On-die communication M
On/Off-diePerformance FastForward (On-Die) FastForward (Off-Die)
ProvenProperty • “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”
Routing/BridgeData Flow OP App IP OS
FShm Forward(Bridge) • AES encrypting filter • Link layer encryption • ~10 lines of code • IDS • Complex Rules • IPS • DDoS • Data Recorders • Traffic Analysis • Forensics • CALEA 64B* 1.36 Mfps
FlexibleCommunication • Pure software stack communicating via shared memory • Abstracted at the driver/NIC boundary • Cross-Domain modules (Kernel/Process, T/T, P/P, K/K) • Compatible with existing OS/library/language services • Can communicate with any device on the memory interconnect
FastForwardConclusions… • Cross-layer optimization • FastForward - Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models
Gazing intothe Crystal Ball Hardware 10ns FastForward 28ns Lamport 160ns Locks 190ns GigE
Gazing intothe Crystal Ball Hardware 10ns FastForward 14ns FastForward 28ns Lamport 160ns Locks 190ns GigE
TheReal World Cycles per Iteration Iteration
The ReallyReal World Cycles per Iteration Iteration
BareMetal Cycles per Iteration Iteration
Questions? john.giacomoni@colorado.edu http://www.cs.colorado.edu/~jgiacomo http://ce.colorado.edu/core