1 / 16

Overcoming The Memory Wall in Packet Processing Hammers or Ladders?

Overcoming The Memory Wall in Packet Processing Hammers or Ladders?. Jayaram Mudigonda Harrick M. Vin Raj Yavatkar. The Memory Bottleneck. Ideal packet processing system must Achieve high packet throughput Be easy to program A major source of difficulty: Memory Bottleneck

kaemon
Download Presentation

Overcoming The Memory Wall in Packet Processing Hammers or Ladders?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overcoming The Memory Wall in Packet ProcessingHammers or Ladders? Jayaram Mudigonda Harrick M. Vin Raj Yavatkar

  2. The Memory Bottleneck • Ideal packet processing system must • Achieve high packet throughput • Be easy to program • A major source of difficulty: Memory Bottleneck • Known as the Memory Wall • Exacerbated in packet processing • Link bandwidths growing • Applications becoming increasingly sophisticated • But there is hope… • Locality in packet traffic • Massive packet-level parallelism • Throughput - primary performance metric

  3. State of the Art • Many mechanisms • Hammers: Exploit locality to reduce the overhead • Wide-words, result-caches, exposed memory hierarchy • Ladders: Exploit parallelism to hide the overhead • Hardware multithreading, asynchronous memory • Little understanding • Relative effectiveness • Interactions • Many mechanisms + Little understanding  manual, ad hoc • Hard to develop and maintain • Wastes system resources • Complicates hardware design

  4. This Paper What minimal set of mechanisms achieves the twin-goals: ease-of-programming and high-throughput?

  5. Contributions • A thorough comparative study of mechanisms • Hammers: wide-words, memory hierarchies, result- and d-caches • Ladders: multithreading, asynchronous memory • Real applications, traces and control-data • Main findings • Best hammer: data-caching • Best ladder: multithreading • No single mechanism suffices • Minimal set should include data-caches and multithreading • Hybrid system • Up to 400% higher throughput than current systems • Easy to program

  6. Experimental Setup: Applications

  7. Experimental Setup: Data and Tools • Data: • Packet traces: various locations in the Internet • Focus on an edge (ANL) and a core (MRA) trace • Route prefixes: RouteViews Project (U of Oregon) • Tools: • Enhanced Simplescalar – generates execution trace • Multithreaded, multiprocessor simulator • Driven by the execution trace • Focus on application-data (Route tables, meters etc.) • Dominate both in size and access frequency

  8. Comparison of Hammers • Exposed memory hierarchy (8KB) • Static mapping of frequently accessed data • Result-cache (8KB) • Often repeated computation • Wide-words (32-bytes) • Spatial locality in data accesses • Data-cache (8KB, 4-way, quad-word lines) • Metric: Fraction of memory references eliminated

  9. Comparison of Hammers Fraction of Mem Accs Eliminated Applications

  10. 1 0.8 0.6 Processor Utilization 0.4 0.2 0 0 10000 20000 30000 40000 50000 60000 Cache Size (Bytes) Data-cache: Improvement in Utilization bitmap IXP tswtcm BSOL, DRR, classify D-caches are attractive but arenot sufficient

  11. 1 Accesses to shared read-write data can limit the utility of threads 0.8 0.6 Processor Utilization 0.4 Utilization improves linearly Rate depends on “c” c = computation per mem-ref Peak utilization is limited to “c/(c+s)” s = context-switch overhead c = computation per mem-ref 0.2 0 0 10 20 30 40 50 60 70 Number of Threads Effectiveness of Ladders • Multithreading dominates asynchronous memory • Much more inter-packet parallelism than intra-packet IXP classify patricia stream

  12. BW Requirements of Multithreading 0.04 0.035 0.03 patricia - MRA 0.025 Memory BW used (Refs/Cycle) 0.02 IXP - MRA 0.015 0.01 BW available to a Micro Engine of IXP2800 0.005 Threads useful but not sufficient 0 0 1 2 3 4 5 6 7 8 Number of Threads

  13. Hybrid Systems • No single mechanism suffices • Combine complementary mechanisms • D-caches, multithreading • D-caches reduce # context-switches, memory BW • Threads hide miss-latencies • Simplified programmability • D-caches transparent • Multithreading becomes easier to use • Switch on cache miss

  14. 1 0.8 Order of magnitude less bandwdith 0.6 Processor Utilization 4 fold improvement in utilization 0.4 0.2 0 0 0.005 0.01 0.015 0.02 0.025 0.03 Available Bandwidth (Refs/Cycle) Hybrid vs. Threads-only Available Area 64 Thread Equivalents IXP-MRA Hybrid patricia-MRA Hybrid IXP-MRA Threads-only patricia-MRA Threads-only

  15. Conclusions • Memory bottleneck is a serious concern • Aggravated in packet processing • Thorough comparative study • Best hammer: Data-caching • Best ladder: Multithreading • Cannot rely exclusively on either locality or parallelism • Hybrid systems that exploit both locality and parallelism • Higher performance • Significantly simplified programming • Contrast with the state of the art • General-purpose: caches are used as facilitators of parallelism • Network processors: almost every mechanism other than d-cache

  16. Thank You ! CollaboratorsPiyush Agarwal (UTCS)Stephen W. Keckler (UTCS)Erik Johnson (Intel)Aaron Kunze (Intel)Questions?

More Related