160 likes | 284 Views
Overcoming The Memory Wall in Packet Processing Hammers or Ladders?. Jayaram Mudigonda Harrick M. Vin Raj Yavatkar. The Memory Bottleneck. Ideal packet processing system must Achieve high packet throughput Be easy to program A major source of difficulty: Memory Bottleneck
E N D
Overcoming The Memory Wall in Packet ProcessingHammers or Ladders? Jayaram Mudigonda Harrick M. Vin Raj Yavatkar
The Memory Bottleneck • Ideal packet processing system must • Achieve high packet throughput • Be easy to program • A major source of difficulty: Memory Bottleneck • Known as the Memory Wall • Exacerbated in packet processing • Link bandwidths growing • Applications becoming increasingly sophisticated • But there is hope… • Locality in packet traffic • Massive packet-level parallelism • Throughput - primary performance metric
State of the Art • Many mechanisms • Hammers: Exploit locality to reduce the overhead • Wide-words, result-caches, exposed memory hierarchy • Ladders: Exploit parallelism to hide the overhead • Hardware multithreading, asynchronous memory • Little understanding • Relative effectiveness • Interactions • Many mechanisms + Little understanding manual, ad hoc • Hard to develop and maintain • Wastes system resources • Complicates hardware design
This Paper What minimal set of mechanisms achieves the twin-goals: ease-of-programming and high-throughput?
Contributions • A thorough comparative study of mechanisms • Hammers: wide-words, memory hierarchies, result- and d-caches • Ladders: multithreading, asynchronous memory • Real applications, traces and control-data • Main findings • Best hammer: data-caching • Best ladder: multithreading • No single mechanism suffices • Minimal set should include data-caches and multithreading • Hybrid system • Up to 400% higher throughput than current systems • Easy to program
Experimental Setup: Data and Tools • Data: • Packet traces: various locations in the Internet • Focus on an edge (ANL) and a core (MRA) trace • Route prefixes: RouteViews Project (U of Oregon) • Tools: • Enhanced Simplescalar – generates execution trace • Multithreaded, multiprocessor simulator • Driven by the execution trace • Focus on application-data (Route tables, meters etc.) • Dominate both in size and access frequency
Comparison of Hammers • Exposed memory hierarchy (8KB) • Static mapping of frequently accessed data • Result-cache (8KB) • Often repeated computation • Wide-words (32-bytes) • Spatial locality in data accesses • Data-cache (8KB, 4-way, quad-word lines) • Metric: Fraction of memory references eliminated
Comparison of Hammers Fraction of Mem Accs Eliminated Applications
1 0.8 0.6 Processor Utilization 0.4 0.2 0 0 10000 20000 30000 40000 50000 60000 Cache Size (Bytes) Data-cache: Improvement in Utilization bitmap IXP tswtcm BSOL, DRR, classify D-caches are attractive but arenot sufficient
1 Accesses to shared read-write data can limit the utility of threads 0.8 0.6 Processor Utilization 0.4 Utilization improves linearly Rate depends on “c” c = computation per mem-ref Peak utilization is limited to “c/(c+s)” s = context-switch overhead c = computation per mem-ref 0.2 0 0 10 20 30 40 50 60 70 Number of Threads Effectiveness of Ladders • Multithreading dominates asynchronous memory • Much more inter-packet parallelism than intra-packet IXP classify patricia stream
BW Requirements of Multithreading 0.04 0.035 0.03 patricia - MRA 0.025 Memory BW used (Refs/Cycle) 0.02 IXP - MRA 0.015 0.01 BW available to a Micro Engine of IXP2800 0.005 Threads useful but not sufficient 0 0 1 2 3 4 5 6 7 8 Number of Threads
Hybrid Systems • No single mechanism suffices • Combine complementary mechanisms • D-caches, multithreading • D-caches reduce # context-switches, memory BW • Threads hide miss-latencies • Simplified programmability • D-caches transparent • Multithreading becomes easier to use • Switch on cache miss
1 0.8 Order of magnitude less bandwdith 0.6 Processor Utilization 4 fold improvement in utilization 0.4 0.2 0 0 0.005 0.01 0.015 0.02 0.025 0.03 Available Bandwidth (Refs/Cycle) Hybrid vs. Threads-only Available Area 64 Thread Equivalents IXP-MRA Hybrid patricia-MRA Hybrid IXP-MRA Threads-only patricia-MRA Threads-only
Conclusions • Memory bottleneck is a serious concern • Aggravated in packet processing • Thorough comparative study • Best hammer: Data-caching • Best ladder: Multithreading • Cannot rely exclusively on either locality or parallelism • Hybrid systems that exploit both locality and parallelism • Higher performance • Significantly simplified programming • Contrast with the state of the art • General-purpose: caches are used as facilitators of parallelism • Network processors: almost every mechanism other than d-cache
Thank You ! CollaboratorsPiyush Agarwal (UTCS)Stephen W. Keckler (UTCS)Erik Johnson (Intel)Aaron Kunze (Intel)Questions?