Overcoming The Memory Wall in Packet Processing Hammers or Ladders?

Overcoming The Memory Wall in Packet ProcessingHammers or Ladders? Jayaram Mudigonda Harrick M. Vin Raj Yavatkar

The Memory Bottleneck • Ideal packet processing system must • Achieve high packet throughput • Be easy to program • A major source of difficulty: Memory Bottleneck • Known as the Memory Wall • Exacerbated in packet processing • Link bandwidths growing • Applications becoming increasingly sophisticated • But there is hope… • Locality in packet traffic • Massive packet-level parallelism • Throughput - primary performance metric

State of the Art • Many mechanisms • Hammers: Exploit locality to reduce the overhead • Wide-words, result-caches, exposed memory hierarchy • Ladders: Exploit parallelism to hide the overhead • Hardware multithreading, asynchronous memory • Little understanding • Relative effectiveness • Interactions • Many mechanisms + Little understanding  manual, ad hoc • Hard to develop and maintain • Wastes system resources • Complicates hardware design

This Paper What minimal set of mechanisms achieves the twin-goals: ease-of-programming and high-throughput?

Contributions • A thorough comparative study of mechanisms • Hammers: wide-words, memory hierarchies, result- and d-caches • Ladders: multithreading, asynchronous memory • Real applications, traces and control-data • Main findings • Best hammer: data-caching • Best ladder: multithreading • No single mechanism suffices • Minimal set should include data-caches and multithreading • Hybrid system • Up to 400% higher throughput than current systems • Easy to program

Experimental Setup: Applications

Experimental Setup: Data and Tools • Data: • Packet traces: various locations in the Internet • Focus on an edge (ANL) and a core (MRA) trace • Route prefixes: RouteViews Project (U of Oregon) • Tools: • Enhanced Simplescalar – generates execution trace • Multithreaded, multiprocessor simulator • Driven by the execution trace • Focus on application-data (Route tables, meters etc.) • Dominate both in size and access frequency

Comparison of Hammers • Exposed memory hierarchy (8KB) • Static mapping of frequently accessed data • Result-cache (8KB) • Often repeated computation • Wide-words (32-bytes) • Spatial locality in data accesses • Data-cache (8KB, 4-way, quad-word lines) • Metric: Fraction of memory references eliminated

Comparison of Hammers Fraction of Mem Accs Eliminated Applications

1 0.8 0.6 Processor Utilization 0.4 0.2 0 0 10000 20000 30000 40000 50000 60000 Cache Size (Bytes) Data-cache: Improvement in Utilization bitmap IXP tswtcm BSOL, DRR, classify D-caches are attractive but arenot sufficient

1 Accesses to shared read-write data can limit the utility of threads 0.8 0.6 Processor Utilization 0.4 Utilization improves linearly Rate depends on “c” c = computation per mem-ref Peak utilization is limited to “c/(c+s)” s = context-switch overhead c = computation per mem-ref 0.2 0 0 10 20 30 40 50 60 70 Number of Threads Effectiveness of Ladders • Multithreading dominates asynchronous memory • Much more inter-packet parallelism than intra-packet IXP classify patricia stream

BW Requirements of Multithreading 0.04 0.035 0.03 patricia - MRA 0.025 Memory BW used (Refs/Cycle) 0.02 IXP - MRA 0.015 0.01 BW available to a Micro Engine of IXP2800 0.005 Threads useful but not sufficient 0 0 1 2 3 4 5 6 7 8 Number of Threads

Hybrid Systems • No single mechanism suffices • Combine complementary mechanisms • D-caches, multithreading • D-caches reduce # context-switches, memory BW • Threads hide miss-latencies • Simplified programmability • D-caches transparent • Multithreading becomes easier to use • Switch on cache miss

1 0.8 Order of magnitude less bandwdith 0.6 Processor Utilization 4 fold improvement in utilization 0.4 0.2 0 0 0.005 0.01 0.015 0.02 0.025 0.03 Available Bandwidth (Refs/Cycle) Hybrid vs. Threads-only Available Area 64 Thread Equivalents IXP-MRA Hybrid patricia-MRA Hybrid IXP-MRA Threads-only patricia-MRA Threads-only

Conclusions • Memory bottleneck is a serious concern • Aggravated in packet processing • Thorough comparative study • Best hammer: Data-caching • Best ladder: Multithreading • Cannot rely exclusively on either locality or parallelism • Hybrid systems that exploit both locality and parallelism • Higher performance • Significantly simplified programming • Contrast with the state of the art • General-purpose: caches are used as facilitators of parallelism • Network processors: almost every mechanism other than d-cache

Thank You ! CollaboratorsPiyush Agarwal (UTCS)Stephen W. Keckler (UTCS)Erik Johnson (Intel)Aaron Kunze (Intel)Questions?

Overcoming The Memory Wall in Packet Processing Hammers or Ladders?

Overcoming The Memory Wall in Packet Processing Hammers or Ladders?

Presentation Transcript

CFR 29 1926 Subpart X Stairways and Ladders

Memory Management

Chest Wall Tumors

Memory Techniques for Interpreters

Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies

Chapter 14 Memory System

CMPT 454

MEMORY

Memory Management:

Dynamic Memory Management

ECG SIGNAL RECOGNIZATION AND APPLICAITIONS

CS4100: 計算機結構 Memory Hierarchy

Dementia 2010

Chapter 7 Packet-Switching Networks

Overcoming Sales Objections

Overcoming Cancer

AP PSYCHOLOGY Review for the AP Exam

MEMORY

Unit 2 – Memory

Virtual Memory

Memory Interface