1 / 48

IXP Training Part 3 Programming Tips

IXP Training Part 3 Programming Tips. 2011.04.12. Outline. Memory Absolute Instruction Selection Task Partition Memory Relative Reducing Overhead Reduce the number of memory accesses Reduce average access latency Hiding Overhead. Memory Absolute Tips. Instruction Selection

Download Presentation

IXP Training Part 3 Programming Tips

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IXP Training Part 3Programming Tips 2011.04.12

  2. Outline • Memory Absolute • Instruction Selection • Task Partition • Memory Relative • Reducing Overhead • Reduce the number of memory accesses • Reduce average access latency • Hiding Overhead NCKU CSIE CIAL Lab

  3. Memory Absolute Tips • Instruction Selection • General Coding Skill • Use Hardware Instruction • Task Partition • Multi-Processing • Context-Pipelining NCKU CSIE CIAL Lab

  4. Coding Skill • Loop Unrolling • Shift Operation • Inline Function • __inline & __forceinline • Branch Prediction • Branch Prediction Penalty NCKU CSIE CIAL Lab

  5. Hardware Instruction • POP_COUNT • FFS • Multiply • CRC • Hashing • CAM NCKU CSIE CIAL Lab

  6. POP_COUNT--Brief • Population Count • Report number of bit set in a 32-bit register • Example: • pop_count( 0x3121 ) = ? • 0011 0001 0010 0001 • Result = 5 NCKU CSIE CIAL Lab

  7. POP_COUNT--Naïve Implementation unsigned int pop_count_for (unsigned int x) { unsigned int y=0; unsigned int i; for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y; } NCKU CSIE CIAL Lab

  8. POP_COUNT--Faster Implementation unsigned int pop_count_agg(unsigned int x) { x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f);} } Reference http://aggregate.org/MAGIC/ NCKU CSIE CIAL Lab

  9. POP_COUNT--Hardware Instruction unsigned int pop_count_hardware(unsigned int x) { return pop_count (x); } NCKU CSIE CIAL Lab

  10. POP_COUNT--Additional Information • Bitmap-RFC (Liu, TECS 2008) NCKU CSIE CIAL Lab

  11. FFS • Find the first bit set in data and return its position • Example: • ffs ( 0x3121 ) = 0 • 0011 0001 0010 0001 • ffs ( 0x3120 ) = 5 • 0011 0001 0010 0000 • ffs ( 0x3100 ) = 8 • 0011 0001 0000 0000 NCKU CSIE CIAL Lab

  12. Multiply • Specific Multiply Instruction • Multiply_24x8() • Multiply_16x16() • Multiply_32x32_hi() • Multiply_32x32_lo() NCKU CSIE CIAL Lab

  13. CRC • Example of CRC operation crc_write( 0x42424242); crc_32_be( source_address, bytes_0_3 ); crc_32_be( dest_address, bytes_0_3 ); … Cache_index = crc_read(); NCKU CSIE CIAL Lab

  14. Hash • Hash_48() • Hash_64() • Hash_128() • Example: SIGNAL sig_hash; Hash48(data_out, data_in, count, sig_done, &sig_hash); __wait_for_all(&sig_hash); NCKU CSIE CIAL Lab

  15. CAM--Brief • Each ME has 16 32-bit CAM entries • The CAM is private to other MEs • With lookup operation, each entries is searching in parallel • With a success lookup, the index of matched entries will be returned • Else, the index of entries to be replaced will be returned NCKU CSIE CIAL Lab

  16. Content Addressable Memory--Structure • cam_lookup_t NCKU CSIE CIAL Lab

  17. CAM--Usage cam_lookup_t cam_result; cam_result = cam_lookup( data ); if( cam_result.hit == 1 ) { Access Entry cam_result.entry_num; … } else { …… cam_write( cam_result.entry_num, data, 15 ); } NCKU CSIE CIAL Lab

  18. Task Partition • Multi-Processing • More Computing Power • Easy to implement • Context-Pipelining • More Useable Resource • Hard to balance NCKU CSIE CIAL Lab

  19. Memory Relative Tips--Reducing Overhead • Reduce the number of memory accesses • Wide-word Accesses • Result Caches • Reduce average access latency • Multi-level Memory Hierarchy • Data Cache NCKU CSIE CIAL Lab

  20. Wide-Word Accesses--Brief • Batch Access the needed data • Reduce the necessary accesses • Useful when the data are linked-list like structure NCKU CSIE CIAL Lab

  21. Wild-Word Access--Example NCKU CSIE CIAL Lab

  22. Wide-Word Accesses--Usage (One Node per Access) __declspec(sram_read_reg) UINT32 A; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A ...... ---------------------------------------------- Result: 8 Accesses are needed NCKU CSIE CIAL Lab

  23. Wide-Word Accesses--Usage (Two Node per Access) __declspec(sram_read_reg) UINT32 A[2]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A ...... ---------------------------------------------- Result: 4 Accesses are needed NCKU CSIE CIAL Lab

  24. Wide-Word Accesses--Usage (Four Node per Access) __declspec(sram_read_reg) UINT32 A[4]; SIGNAL sig_read; sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read); __wait_for_all( &sig_read ); Access A ...... ---------------------------------------------- Result: 2 Accesses are needed NCKU CSIE CIAL Lab

  25. Wide-Word Accesses--Experiment • Platform: IXP2800 • Total Accesses: 8 LW (8*4 Byte) NCKU CSIE CIAL Lab

  26. Wide-Word Accesses--Limitation • Data must be contiguous • Suitable for linear search • Not support random accesses • Number of Transfer Registers are fixed • Each thread has 16 read / write registers • The Tx-Regs may be reserved by others NCKU CSIE CIAL Lab

  27. Resulting Cache--Brief • Caching the result of application • If same fields appear again, the cached result is return • Memory accesses are reduced when cache hit. • Depends on time locality of the traffic NCKU CSIE CIAL Lab

  28. Result Cache--IXP2400 • No hardware cache is supported in IXP2400 ME • Not easy to implement set-associative cache • Replacement policy will also be an overhead NCKU CSIE CIAL Lab

  29. Result Cache--Design Consideration • Shared or Private Cache ? • Size of Cache ? • Works with specific Hardware ? • Miss penalty handling ? NCKU CSIE CIAL Lab

  30. Result Cache--Experiment NCKU CSIE CIAL Lab

  31. Multi-Level Memory Hierarchy--Brief • Reduce the average access latency • Number of accesses remained unchanged • If data can fit in faster memory, then do it NCKU CSIE CIAL Lab

  32. Multi-Level Memory Hierarchy--Data Placement • Size smaller while read-only • Hard Code • Size smaller while need updating • Local Memory • Size larger • Scratchpad • Size largest • SRAM NCKU CSIE CIAL Lab

  33. Multi-Level Memory Hierarchy--Packet Data Type • Packet related data • Temporary Data • Valid with specific packet • Local Memory • Flow related data • Related to specific flow • Spatial Locality • Wide-Word Access • Application related data • Valid with specific application • Temporal Locality • Result Cache NCKU CSIE CIAL Lab

  34. Split-Cache (Z. Liu, IET-COM 2007) • Two separate hardware for application data and flow data NCKU CSIE CIAL Lab

  35. Data Cache--Brief • Hardware Cache Mechanism that cached the data for packet processing • App-Cache • Flow-Cache • However, not supported by IXP2400 NCKU CSIE CIAL Lab

  36. Data Cache--CAM + Local Memory • CAM works with Local Memory acts like hardware cache • However, number of CAM entries is less • Each CAM entry may co-worked with several Local Memory Cache entry NCKU CSIE CIAL Lab

  37. Memory Relative Tips--Hiding Overhead • Not really reduce the overhead, but overlapped it • Hardware Multi-Threading • Asynchronous Memory NCKU CSIE CIAL Lab

  38. Hardware Multi-Threading • Swap out itself and let another thread to execute while access memory • Each thread kept its own set of registers, thus no stack are needed for thread swapping • Round Robin Scheduling • No thread preemptive NCKU CSIE CIAL Lab

  39. Asynchronous Memory--Brief • Thread will not be blocked when issue a memory request • Thus, thread can issues multiple memory requests at a time NCKU CSIE CIAL Lab

  40. Asynchronous Memory--Example (1 Issue) Read X __wait_for_all ( &sig_x ) Read Y __wait_for_all ( &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab

  41. Asynchronous Memory--Example (2 Issue) Read X Read Y __wait_for_all ( &sig_x, &sig_y ) // Use X and Y … NCKU CSIE CIAL Lab

  42. Wild-Word Access +Multiple Issues NCKU CSIE CIAL Lab

  43. Wild-Word Access +Multiple Issues (1LW, 2 Issue) NCKU CSIE CIAL Lab

  44. Wild-Word Access +Multiple Issues (2LW, 2 Issue) NCKU CSIE CIAL Lab

  45. Wild-Word Access +Multiple Issues (4LW, 2 Issue) NCKU CSIE CIAL Lab

  46. Wild-Word Access +Multiple Issues (Experiment) NCKU CSIE CIAL Lab

  47. Reference (1) • Jayaram Mudigonda, Harrick M. Vin, Raj Yavatkar, “Overcoming the memory wall in packet processing: hammers or ladders?”, ANCS 2005 • Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor”, ACM TECS 2008. NCKU CSIE CIAL Lab

  48. Reference (2) • Z. Liu, K. Zheng, B. Liu, “Hybrid cache architecture for high-speed packet processing”, IET-COM 2007 NCKU CSIE CIAL Lab

More Related