1 / 44

Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress

Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress. John DeHart jdd@arl.wustl.edu http://www.arl.wustl.edu/projects/techX. Revision History. 10/11/06 (JDD): Created 10/23/06 (JDD): Finished for presentation on 10/24/06 10/24/06 (JDD): Updates from comments during review.

Download Presentation

Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Block Design Review:LookupforIPv4 MR, LC Ingress and LC Egress John DeHart jdd@arl.wustl.edu http://www.arl.wustl.edu/projects/techX

  2. Revision History • 10/11/06 (JDD): • Created • 10/23/06 (JDD): • Finished for presentation on 10/24/06 • 10/24/06 (JDD): • Updates from comments during review. • Added more TCAM info • Added information on format of Database entry files

  3. Guidelines for Design Reviews • Definition of interfaces In/Out • Block diagram of module • Including list of files where code for each block/module exists. • Macros: • List macros and files where they can be found • For each macro, provide a few lines of comments in the code that describes the macro. • Document local and global registers used by macro. • Memory assumptions • What addresses are pre-defined, etc… • Initialization of Memory • Data Structures • Control Blocks • Details of memory accesses, xfer register usage, signal usage. • Critical path • Testing • Develop a well defined acceptance test that convinces you that your block works • Document acceptance test • Pktgen “project” file? • Known bugs • Areas and suggestions for improvements.

  4. Header Format Lookup Tx Rx Substr Decap Parse QM Contents Lookup Hdr Format QM/Schd Switch Tx S W I T C H Phy Int Rx Key Extract QM/Schd Phy Int Tx Hdr Format Lookup Key Extract Switch Rx

  5. File locations • Code • src/applications/LC_Ingress/src/lookup/PL/lookup.uc • src/applications/LC_Egress/src/lookup/PL/lookup.uc • src/applications/IPv4_MR/src/lookup/PL/lookup.uc • Configuration and Database Entry Files: • src/applications/LC_Ingress/build/PL/LCI_config.txt • LC_Ingress_Database_64bKey_64bResult_BothQM.txt • src/applications/LC_Engress/build/PL/LCE_config.txt • LC_Egress_Database_24bKey_64bResult.txt • src/applications/IPv4_MR/build/PL/IPv4_config.txt • GM_Database_144Key_128bResult.txt • IDT Includes • src/IDT_NSE/data_plane_IXP2XXX/include/Iipc.uc • Which then includes Iipc.h from same directory • IDT Simulation Library • Typical Installed location: • C:/IDT_NSE/simulation/windows/IDT75K234.dll • Repository location: • src/IDT_NSE/simulation/windows/IDT75K234.dll • Directions for adding simulation library to a simulation project: • Simulation menu: select Options • Simulation Options window: select Foreign Model tab • Foreign Model DLLs panel: click on New(Insert) icon • Use locator to go to the Repository location listed above • Hit return after selecting dll file • Add Instance information in bottom panel • Click in Instance Name box and enter: IDT75K234 • Clink in Priority box and enter: 1 • Click in Initialization String box and enter: IPv4_config.txt

  6. TCAM Documentation • Docs are distributed sprinkled through the different installation directories • We have gathered most of the important stuff here: • /project/techX/DataSheets/IDT • The following documents are located in the above directory • Datasheet: (Under non-disclosure) • 75K72234_datasheet.pdf • User Manual: • 75K72234_UserManual.pdf • Instruction Latency Application Note: • 75K72234_latency.pdf • SLAM: Simulation • IDT75K234SLAM_UsersManual.pdf • Dataplane Macros: • NSEDataPlaneMacroAPIGuide.pdf • IMS API: • IMS_API.pdf

  7. WU Macros • LC Ingress: • dl_nn_ring_init • dl_source_1ME_NN_4words • dl_sink_1ME_NN_4words • IPv4_MR: • dl_nn_ring_init • dl_source_1ME_NN_9words • dl_sink_1ME_NN_4words • LC Egress: • dl_nn_ring_init • dl_source_1ME_NN_4words • dl_sink_1ME_NN_5words • Diagnostics: • GetTimeStamp • CompareTimeStamps

  8. IDT Macros • IipcStartTimestamp • Does CAP read and write to set bit in MISC_CONTROL to start the timestamp counter. • IipcFormContextFromCsrMeCtx • Sets up the Context field for the TCAM command word based on the ME and context • 128 Contexts per LA-1 Interface • IipcMakeBase • Form the base address word for any instruction for this context • Address is 22 bit WORD address, covers 16 MByte address space • IipcMakeDirectInstruction • Form the command word for any of the 4 Direct instructions • Result of IipcMakeBase and IipcMakeDirectInstruction will be passed as the two address parameters to sram[write]: • sram[write, $w00, iipc_base_word, iipc_command_word, count] • IipcDelayUsingFutureCount(cycles) • Sets the Future Count register to this many cycles • Sets the Future Count Signal register • Ctx_arb on that signal • IipcSramRead • Performs and SRAM read until Done bit is set in result. • We don’t use this if any more. • We do the sram[read] ourselves now and check the done bit. • This allows us to more easily perform diagnostics and performance testing.

  9. Lookup Initialization and Control • XScale utility to initialize NSE and Databases • Control Plane and XScale mechanisms to read and write TCAM entries while system is active.

  10. Lookup Miscellany • Bugs: No known bugs • Testing: • Minimal testing done so far • Some simple functional tests to show distribution of packets across all output ports based on Key fields for each of the three projects. • More complete test plan needed. • Still To Do: • Add information on how to configure Filters for Lookup engine. • Handle init_done signal from Rx • Turn on optimizer • Substrate only lookup for IPv4_MR GPENPE pkts • Add second database to IPv4 MR • DB1: GM/EM Database • DB2: Route Lookup • LD bit in Lookup Result • Clean up definition of DB Ids. • Consider making Lookup code one common file with #ifdef’s to differentiate • Consider removing #ifdef DONE_BIT_FIX code • Refers to a Done bit bug in the Dual Port QDR (which is what we have) • I have not seen this bug mentioned anywhere else. • I have not witnessed any such bug and I have not enabled this code • We’ll probably keep this code around, at least until we have done more thorough testing. • Performance testing with both LC Ingress and LC Egress operating. • Performance testing with second IPv4 MR Database • Data Structures: None • Performance • Current analysis is with OPTIMIZER turned off! • Turning it on should give immediate gains via branch and ctx_arb deferral slots. • How does TCAM perform when both LC Ingress and LC Egress are operating?

  11. TCAM Entries in Simulation • Four Parts to a TCAM Entry in simulation: • dbindex • Slot in database occupied by entry. • Start at 0 • Incremented by 1 for each entry • Not dependent on size • core • What is matched against a provided key • mask • Indicates what part of the entry(core) has to match key supplied to give a hit • data • Results data • Configuration and Database Entry files • src/applications/LC_Ingress/build/PL/LCI_config.txt • LC_Ingress_Database_64bKey_64bResult_BothQM.txt • src/applications/LC_Engress/build/PL/LCE_config.txt • LC_Egress_Database_24bKey_64bResult.txt • src/applications/IPv4_MR/build/PL/IPv4_config.txt • GM_Database_144Key_128bResult.txt

  12. TCAM Entries in Simulation • LC Ingress Database entry from file: • src/applications/LC_Ingress/build/PL/ LC_Ingress_Database_64bKey_64bResult_BothQM.txt { dbindex 0x0; core 0x51C0A80002110001; # SL Type: 0x5 # Port: 1 # IP DA=192.168.0.2 # IP Proto: 17 (UDP) # UDP DPort: 0x0001 # Exact Match everything, except wildcard Port mask 0xf0ffffffffffffff; data 0x0001004A01100001; # VLAN(16b)=0x0001 # Stats_Index(16b)=74(0x4A) # DA=0x01 # Port=1 # QID=1 }

  13. TCAM Entries in Simulation • IPv4 MR Database entry from file: • src/applications/IPv4_MR/build/PL/GM_Database_144Key_128bResult.txt { dbindex 0x0; core 0x0AAA0002C0A84001C0A82002000100020011; # MR ID (VLAN) = 0x0AAA # UDP DPort=0x0002 # IP DA=192.168.64.1 # IP SA=192.168.32.02 # TCP/UDP SPort=0x0001 # TCP/UDP DPort=0x0002, # TCP_FLAGS_Proto=0x0011 (Proto=UDP, no TCP Flags) mask 0xffffffffffffffffffffffffffffffffffff; # Exact match everything data 0x0000003780FC99F95555666601000001; # Reserved(3b), Drop Bit(1b) # Reserved(12b) #Cntr_Index(16b)=55(0x37), # Tx IP DAddr=128.252.153.249, # Tx UDP Dport=0x5555 # Tx UDP SPort=0x6666 # DA=0x01, # Port=0 # QID=1 }

  14. TCAM Entries in Simulation • LC Egress Database entry from file: • src/applications/LC_Egress/build/PL/LC_Egress_Database_24bKey_64bResult.txt { dbindex 0x0; core 0x11000100; # IP Proto (8b) = 0x11 (UDP) # UDP SPort (16b) = 1 # Rsvd(8b) = 0 mask 0xffffffff; # Exact Match. data 0x000101000021; # Rsvd(4b) = 0 # VLAN(12b)=0x001 # Rsvd(4b)=0 # Port(4b)=1 # Rsvd(4b) # QID(20b)=33 (0x00021) }

  15. Basics of TCAM Operation • Instruction is given to TCAM as an sram write: • Address bus gives instruction • 4 Direct Instructions: • Lookup: This is all we use right now. • MultiHit Lookup (MHL) or Simultaneous Multi-Database Lookup • Which one is determined by a bit in a config register • Preload • Indirect: Uses data field to specify subinstruction • Data bus gives: • Subinstruction for Indirect instructions (There are 16 subinstructions) • Data for all instructions • Our lookup keys go here. • Example: IPv4 MR Lookup (Key of 144 bits in 5 words): • Load xfer registers $w00, $w01, $w02, $w03, $w04 with the lookup key • sram[ write, $w00, iipc_base_word, iipc_command_word, 5 ] • More about iipc_base_word and iipc_command_word later • 5: number of data words needed for key • Result is read back from Context’s Results Mailbox • This is an SRAM read, not a TCAM Read instruction. • Example: IPv4_MR Lookup result of 4 words: • sram[read, $r00, iipc_base_word, 0, 4] • Result is valid only when the high order bit of the first word in the mailbox is set. • So, multiple reads may be necessary • We can predict the latency of the TCAM instruction • More about this later when we look at the macros used.

  16. LC Ingress Lookup Lookup Hdr Format QM/Schd Switch Tx S W I T C H Phy Int Rx Key Extract • Main functions: • Perform TCAM Lookup • Pass Through Data: • Buf Handle • IP Pkt Length and Ethernet Header Length • Single code path with possible loop around Result Read • NN communication • Uses 8 threads

  17. QM/Schd LC Ingress: Lookup Block Interfaces Lookup Hdr Format Switch Tx S W I T C H Phy Int Rx Key Extract Buf Handle(32b) Buf Handle(32b) Eth Hdr Len (8b) Eth Hdr Len (8b) IP Pkt Length (16b) Reserved (8b) IP Pkt Length (16b) Reserved (8b) Rsvd (4b) VLAN (16b) Stats Index (16b) Lookup Key[63-32] (32b) DAddr (8b) Port (4b) Lookup Key[ 31-0] (32b) QID (20b) Lookup Result: Lookup Key: SL (4b) Port (4b) D_Addr[31:8] (24b) Rsvd (4b) VLAN (16b) Stats Index (16b) D_Addr[7:0] (8b) Protocol (8b) UDP DPort (16b) DAddr (8b) Port (4b) QID (20b)

  18. 15 cycles + 2 abort cycles 7 cycles + 2 abort cycles 1 cycles + 2 abort cycles 1 cycles + 2 abort cycles 5 cycles + 0 abort cycles Totals: 41 processing cycles 16 abort cycles 12 cycles + 8 abort cycles LC Ingress Lookup Block Diagram mem access dl_source() Signal next ctx Load Xfer Regs NN Dequeue (4W) SRAM Write: 2W Send Lookup Request init signal Wait for prev ctx TimeStamp Delay ctx_swap Read Result SRAM Read: 2W Signal next ctx ctx_swap Check Done Bit NN Enqueue (4W) Wait for prev ctx Reformat Output dl_sink()

  19. Header Format Lookup Tx Rx Substr Decap Parse QM IPv4 MR Lookup • Main functions: • Perform TCAM Lookup • Pass Through Data: • Buf Handle • IP Pkt Length and Offset • Slice Data Ptr • Exception Bits • Single code path with possible loop around Result Read • NN communication • Uses 8 threads

  20. Buf Handle(32b) IP Pkt Length (16b) IP Pkt Offset (16b) Lookup Key[143-112] Slice ID/Rx UDP DPort (32b) Lookup Key[111-80] DA (32b) Lookup Key[ 79-48] SA (32b) Lookup Key[ 47-16] Ports (32b) L Flags (4b) Exception Bits (12b) Lookup Key Proto/TCP_Flags [15- 0] (16b) QM IPv4 MR Lookup Block Interfaces Lookup Tx Rx DeMux Parse Header Format Buf Handle(32b) IP Pkt Length (16b) IP Pkt Offset (16b) Slice ID (VLAN) (16b) Rx UDP DPort(16b) R S V d (1b) H (1b) L D (1b) D (1b) Exception Bits (12b) Cntr Index (16b) Tx IP DAddr (32b) Tx UDP DPort (16b) Tx UDP SPort(16b) DA(8b) Port (4b) QID(20b) Slice Data Ptr (32b) Slice Data Ptr (32b) Reserved (28b) Code (4b) Reserved (28b) Code (4b) Lookup Key (144b): Slice ID/Rx UDP DPort (32b) IP DAddr (32b) IP SAddr (32b) SPort (16b) DPort (16b) Proto/TCP_Flags(16b)

  21. Slice ID/Rx UDP DPort (32b) IP DAddr (32b) IP SAddr (32b) SPort (16b) DPort (16b) Proto/TCP_Flags(16b) D (1b) Exception Bits (12b) Cntr Index (16b) R S V d (1b) H I t (1b) L D (1b) Tx IP DAddr (32b) Tx UDP DPort (16b) Tx UDP SPort(16b) DA(8b) Port (4b) QID(20b) IPv4 MR Functional Block Results Lookup Key (144b): TCAM Status Bits Stored in TCAM Lookup Result (128b): As given to HF Lookup Result (128b): D 1b Reserved (11b) Cntr Index (16b) L D 1b D O N e 1b H I t 1b M H I t 1b Tx IP DAddr (32b) Tx UDP DPort (16b) Tx UDP SPort(16b) DA(8b) Port (4b) QID(20b)

  22. 25 cycles + 2 abort cycles 7 cycles + 2 abort cycles 2 cycles + 2 abort cycles 1 cycles + 2 abort cycles 5 cycles + 0 abort cycles Totals: 57 processing cycles 16 abort cycles 17 cycles + 8 abort cycles IPv4 MR Lookup Block Diagram mem access dl_source() Signal next ctx Load Xfer Regs NN Dequeue (9W) SRAM Write: 5W Send Lookup Request init signal Wait for prev ctx TimeStamp Delay ctx_swap Read Result SRAM Read: 4W Signal next ctx ctx_swap Check Done Bit NN Enqueue (9W) Wait for prev ctx Reformat Output dl_sink()

  23. LC Egress Lookup S W I T C H QM/Schd Phy Int Tx Hdr Format Lookup Key Extract Switch Rx • Main functions: • Perform TCAM Lookup • Pass Through Data: • Buf Handle • IP Pkt Length and Ethernet Header Length • IP Destination Address • Single code path with possible loop around Result Read • NN communication • Uses 8 threads

  24. IP Pkt Length (16b) IP Pkt Length (16b) Eth Hdr Len (8b) Eth Hdr Len (8b) Reserved (8b) Reserved (8b) VLAN (12b) Stats Index (16b) Rsvd (4b) Protocol (8b) UDP SPort (16b) Reserved (8b) QM/Schd Rsvd (4b) Port (4b) Rsvd (4b) QID (20b) LC Egress: Lookup Block Interfaces S W I T C H Phy Int Tx Lookup Key Extract Switch Rx Hdr Format Buf Handle(32b) Buf Handle(32b) IP DAddr (32b) IP DAddr (32b) Lookup Result [63-32] (32b) Lookup Key IP Proto (8b) Lookup Key – UDP SPort (16b) Reserved (8b) Lookup Result [31-0] (32b) Lookup Result: Lookup Key:

  25. 14 cycles + 2 abort cycles 7 cycles + 2 abort cycles 3 cycles + 2 abort cycles 1 cycles + 2 abort cycles 5 cycles + 0 abort cycles Totals: 43 processing cycles 16 abort cycles 13 cycles + 8 abort cycles LC Egress Lookup Block Diagram mem access dl_source() Signal next ctx Load Xfer Regs NN Dequeue (4W) SRAM Write: 1W Send Lookup Request init signal Wait for prev ctx TimeStamp Delay ctx_swap Read Result SRAM Read: 2W Signal next ctx ctx_swap Check Done Bit NN Enqueue (5W) Wait for prev ctx Reformat Output dl_sink()

  26. Performance

  27. Packet Sizes

  28. Cycle Budget (min eth packets) • To hit 5 Gb rate: • 76B per min IPv4 packet (64 min Eth + 12B IFS) • 1.4Ghz clock rate • 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mp/sec • 1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet • compute budget: 170 cycles • latency budget: (threads*170) • 8 threads: 1360 cycles • To hit 10 Gb rate: • 76B per min IPv4 packet (64 min Eth + 12B IFS) • 1.4Ghz clock rate • 10 Gb/sec * 1B/8b * packet/76B = 16.44 Mp/sec • 1.4Gcycle/sec * 1 sec/ 16.44 Mp = 85.16 cycles per packet • compute budget: 85 cycles • latency budget: (threads*85) • 8 threads: 680 cycles

  29. Cycle Budget (IPv4 MN packets) • To hit 5 Gb rate: • 90B per min IPv4 packet (78 min IPv4MN + 12B IFS) • 1.4Ghz clock rate • 5 Gb/sec * 1B/8b * packet/90B = 6.94 Mp/sec • 1.4Gcycle/sec * 1 sec/ 6.94 Mp = 201.7 cycles per packet • compute budget: 201 cycles • latency budget: (threads*201) • 8 threads: 1608 cycles • To hit 10 Gb rate: • 90B per min IPv4 packet (78 min IPv4MN + 12B IFS) • 1.4Ghz clock rate • 10 Gb/sec * 1B/8b * packet/90B = 13.88 Mp/sec • 1.4Gcycle/sec * 1 sec/ 13.88 Mp = 100.86 cycles per packet • compute budget: 100 cycles • latency budget: (threads*100) • 8 threads: 800 cycles

  30. TCAM Instruction Latency Analysis • QDR Clock: 200 MHz, 5ns period • TCAM core Clock: 200 MHz, 5ns period • NPU Clock: 1400 MHz, 0.714 ns period • 1 QDR cycle == 1 TCAM cycle == 7 NPU cycles • TCAM Lookup Latencies: • QDR xfer: 1 cycle per word in key • Instruction Fifo: constant 2 cycles • Synchronizer: constant 3 cycles • Execution Latency: fct(key width, output data width) • Table in IDT Latency Application Note • Re-Synchronizer: constant 1 cycle

  31. TCAM Instruction Latency Analysis • IPv4 MR • Key: 144 bit (5 words) • Output data: 128 bit • QDR Xfer: 5 cycles • Constants: 2 + 3 + 1 = 6 cycles • Execution Latency: 36 cycles • Total Latency: 47 TCAM cycles (235 ns) (329 NPU cycles) • LC Ingress • Key: 64 bit (2 words) • Output data: 64 bit • QDR Xfer: 2 cycles • Constants: 2 + 3 + 1 = 6 cycles • Execution Latency: 32 cycles • Total Latency: 40 TCAM cycles (200 ns) (280 NPU cycles) • LC Egress • Key: 24 bit (1 words) • Output data: 64 bit • QDR Xfer: 1 cycles • Constants: 2 + 3 + 1 = 6 cycles • Execution Latency: 34 cycles • Total Latency: 41 TCAM cycles (195 ns) (273 NPU cycles)

  32. TCAM Performance (Rates in M/sec) LC_Egress LC_Ingress IPv4 MR

  33. TCAM Performance (Rates in M/sec) LC_Egress LC_Ingress IPv4 MR

  34. IPv4: Performance Snapshot ~610 Cycles sram write sram read Timestamp Delay dl_sink ctx_arb dl_sink processing Timestamp Delay setup dl_source & Xfer reg loads • IPv4 MR lookup • Unloaded Ctx_arb vs br_signal optimization

  35. IPv4: Performance Snapshot Write issued At 34016 Write issued At 33333 34016– 33333= 683 Cycles • IPv4 MR lookup • Hack to Parse: loop and repeatedly call dl_sink with same buf_handle • Should guarantee that there is always something in NN ring for lookup to pick up • Hack to HF : set dlNextBlock to IX_DROP • Keep Tx from trying to transmit something bad.

  36. LC_Ingress: Performance Snapshots >=563 Cycles • LC Ingress lookup • unloaded

  37. LC_Ingress: Performance Snapshots Write issued At 60494 Write issued At 59888 60494 – 59888 = 606 Cycles • LC Ingress lookup • Hack to KE stub: loop and repeatedly call dl_sink with same buf_handle • Should guarantee that there is always something in NN ring for lookup to pick up • Hack to HF stub: set dl_next_block to IX_DROP • Keep Tx from trying to transmit something bad.

  38. LC_Egress: Performance Snapshots ~560 Cycles • LC Egress lookup • Unloaded

  39. LC_Egress: Performance Snapshots ~610 Cycles • LC Egress lookup • Loaded with KE and HF hacks.

  40. Performance Summary • Processing Cycles: • LC Ingress:41 • IPv4 MR: 57 • LC Egress:43 • Abort Cycles: • LC Ingress:16 • IPv4 MR: 16 • LC Egress:16 • Latency Cycles: • LC Ingress: 560 – 57 = 503? • IPv4 MR: 610 – 73 = 537? • LC Egress: 560 – 59 = 501? • Expected performance • LC Ingress: 10Gb/s • IPv4 MR: 5Gb/s + • LC Egress: 10Gb/s

  41. Optimizations Possibilities • May still be some code we can move out of processing loop or at least between sram write or read and the ctx swap. • dl_sink has a possible improvement. • ctx_arb vs. br_signal/br_!signal

  42. Extra Slides

  43. Image Slide Template

  44. Text Slide Template

More Related