FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

FastForward for Efficient Pipeline Parallelism:A Cache-Optimized Concurrent Lock-Free Queue John Giacomoni Advisor: Dr. Manish Vachharajani University of Colorado at Boulder

The Rise ofMulticore

Triple DES with 32B Blocks Nanoseconds/Block Number of Threads Why Pipelines? • Data parallelism has limits • Pipeline parallelism if: • Granularity is fine enough • ≈ < 1 µs • ≈ 3.5 x interrupt handler • Total/Partial order

Fine-GrainPipelining Examples • Network processing: • Intrusion detection (NID) • Traffic filtering (e.g., P2P filtering) • Traffic shaping (e.g., packet prioritization) • Signal Processing • Software Defined Radios • Encryption • Triple-DES • Other Domains • ODE Solvers • Fine-grain kernels extracted from sequential applications

Network ProcessingScenarios

IP IP APP Dec APP OP App Dec Enc Enc IP OP APP OP Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona)

Routing/BridgeData Flow OP App IP OS

Example3 Stage Pipeline

Communication Overhead

Communication Overhead Locks  190ns GigE

Communication Overhead Lamport  160ns Locks  190ns GigE

Communication Overhead Hardware  10ns Lamport  160ns Locks  190ns GigE

Communication Overhead Hardware  10ns FastForward  28ns Lamport  160ns Locks  190ns GigE

The ProgrammingAbstraction “Stack” • Sequential Programming • Improves programmer productivity • Very successful • Problematic on modern machines

The Complexityof Modern Systems • Sequential Programming • Parallel Programming • Very complex • Ignoring cross-layer behavior very problematic • Can lead to Incorrect behavior

Hardware Matters(Memory Consistency) X = 0 Y = 0 Y,X Legal outputs? 0,0  1,1  0,1  1,0  Weak Consistency Permits! ?

Hardware Matters(Memory Consistency) X = 0 B.Flag = 0 C.Flag = 0 What is the output? 3  2  1  0 

TheBig Picture • Programming Modern Machines is a cross cutting problem • Need to evaluate/account/consider every layer • Omitted systems areas • Networking • File systems • Distributed systems • Business server development • Omitted related areas • User interfaces • Security

FastForward • Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); } Lamport’sCLF Queue (1)

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n]

NUMACache Example M

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (2) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } Lamport’sCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?

lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (1)

ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (2) head tail buf[0] buf[0] buf[1] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.

ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } FastForwardCLF Queue (3) head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline.

Slip Timing

Slip TimingLost

Maintaining Slip(Concepts) • Use distance as the quality metric • Explicitly compare head/tail • Causes cache ping-ponging • Perform rarely

Maintaining Slip(Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist < *OK* && dist > dist_old); } }

ComparativePerformance Lamport FastForward

Thrashing andAuto-Balancing FastForward (Thrashing) FastForward (Balanced)

CacheVerification FastForward (Thrashing) FastForward (Balanced)

On/Off DieCommunications Off-die communication On-die communication M

On/Off-diePerformance FastForward (On-Die) FastForward (Off-Die)

ProvenProperty • “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”

Routing/BridgeData Flow OP App IP OS

FShm Forward(Bridge) • AES encrypting filter • Link layer encryption • ~10 lines of code • IDS • Complex Rules • IPS • DDoS • Data Recorders • Traffic Analysis • Forensics • CALEA 64B*  1.36 Mfps

FlexibleCommunication • Pure software stack communicating via shared memory • Abstracted at the driver/NIC boundary • Cross-Domain modules (Kernel/Process, T/T, P/P, K/K) • Compatible with existing OS/library/language services • Can communicate with any device on the memory interconnect

FastForwardConclusions… • Cross-layer optimization • FastForward - Cache-optimized point-to-point CLF queue • Fast • Robust against unbalanced stages • Hides die-die communication • Works with strong to weak memory consistency models

Gazing intothe Crystal Ball Hardware  10ns FastForward  28ns Lamport  160ns Locks  190ns GigE

Gazing intothe Crystal Ball Hardware  10ns FastForward  14ns FastForward  28ns Lamport  160ns Locks  190ns GigE

TheReal World Cycles per Iteration Iteration

The ReallyReal World Cycles per Iteration Iteration

The PipelinedReal World

BareMetal

BareMetal Cycles per Iteration Iteration

Questions? john.giacomoni@colorado.edu http://www.cs.colorado.edu/~jgiacomo http://ce.colorado.edu/core

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue

Presentation Transcript

ECE 242 Spring 2003 Data Structures in Java

Queueing Theory

CSE 230 Parallelism

D.O.T. Office of Pipeline Safety Pipeline Repair Environmental Guidance System (Pilot Project)

Priority Queue

Stack and Queue

CMPT 300 Introduction to Operating Systems

MS108 Computer System I

OpenVMS Distributed Lock Manager Performance

Chapter 5 Overview

Atomic Actions, Concurrent Processes and Reliability

Transceiver Pipeline and Radio Modeling

Oracle8i Administration

Memory Hierarchy Design

OpenMP

Day 2

Master Program (Laurea Magistrale) in Computer Science and Networking

Characteristics of a RTS

Advanced Pipelining

Cache-Oblivious Priority Queue and Graph Algorithm Applications