280 likes | 417 Views
Block Design Review: PlanetLab Line Card Header Format. David M. Zar dzar@wustl.edu http://www.arl.wustl.edu/projects/techX. Revision History. 10/31/06 (DMZ): Initial Draft 11/04/06 (DMZ): Updates for performance issues. Line Card Centric Overview. Lookup. Hdr Format. QM/Schd. Switch
E N D
Block Design Review:PlanetLab Line Card Header Format David M. Zar dzar@wustl.edu http://www.arl.wustl.edu/projects/techX
Revision History • 10/31/06 (DMZ): • Initial Draft • 11/04/06 (DMZ): • Updates for performance issues
Line Card Centric Overview Lookup Hdr Format QM/Schd Switch Tx S W I T C H Phy Int Rx Key Extract Port Splitter QM/Schd Phy Int Tx Hdr Format Lookup Key Extract Switch Rx Port Splitter • Port Splitter (Ingress and Egress): • Accepts packets on a NN ring • Based on the physical destination port number • 0-4 go to QM1 on a scratch ring • 5-9 go to QM2 on a scratch ring • Measured delay is about 120 cycles, including memory latency
Ingress Header Format • Microengine Usage • One microengine • Eight identical threads • NN ring input from Lookup • NN ring output to Port Splitter • Main functions: • Using data from Lookup, modify packet header in DRAM for proper routing to PE: • Destination MAC address • First five bytes are same as source MAC address • Source MAC address • Address of this LC • VLAN tag • Adjust pre-queue stats counters • Format input data for QM • QID • Port Number • Ethernet Frame Length
DstAddr (6B) Ethernet Header SrcAddr (6B) Type=802.1Q (2B) VLAN (2B) Buf Handle(32b) Buffer Handle(32b) Type=IP (2B) Ver/HLen/Tos/Len (4B) IP Pkt Length (16b) Eth Hdr Len (8b) Reserved (8b) ID/Flags/FragOff (4B) Rsv (4b) Port (4b) QID(20b) Rsv (4b) TTL (1B) DstAddr (6B) VLAN (16b) Stats Index (16b) Protocol = UDP (1B) Ethernet Header SrcAddr (6B) Hdr Cksum (2B) DstAddr (6B) DAddr (8b) Port (4b) Frame Length (16b) Stats Index (16b) QID (20b) Src Addr (4B) IP Header Type=802.1Q (2B) SrcAddr (6B) VLAN (2B) Dst Addr (4B) Type=IP (2B) Type=IP (2B) IP Options (0-40B) Ver/HLen/Tos/Len (4B) Ver/HLen/Tos/Len (4B) ID/Flags/FragOff (4B) ID/Flags/FragOff (4B) TTL (1B) TTL (1B) Protocol = UDP (1B) Protocol = UDP (1B) Src Port (2B) UDP Header Hdr Cksum (2B) Hdr Cksum (2B) Dst Port (2B) Dst Addr (4B) Dst Addr (4B) IP Header UDP length (2B) UDP checksum (2B) Src Addr (4B) Src Addr (4B) UDP Payload (MN Packet) IP Options (0-40B) IP Options (0-40B) QM/Schd Src Port (2B) Src Port (2B) UDP Header PAD (nB) Dst Port (2B) Dst Port (2B) Ethernet Trailer UDP length (2B) UDP length (2B) CRC (4B) UDP checksum (2B) UDP checksum (2B) UDP Payload (MN Packet) UDP Payload (MN Packet) PAD (nB) PAD (nB) CRC (4B) CRC (4B) LC Ingress Functional Blocks Lookup Hdr Format Switch Tx Phy Int Rx Key Extract Ouput PacketFormat Possible Input Packet Formats
MAC Address and VLAN Tag (Ingress) • The source MAC address is fixed and set at boot time (_WU_get_mac_address) • The destination MAC address will only differ in the last byte and this byte is obtained from the Lookup data. • The VLAN tag is obtained from the Lookup data.
Stats/Counters (Ingress/Egress) • The Stats Index is obtained from the Lookup Data • The pre-queue packet and byte counters are updated (_WU_update_counters) • Packet counter is incremented (atomic SRAM) • Byte count is incremented by the number of bytes in the entire Ethernet frame (_WU_get_enet_frame_length). • Frame_length = IP_pkt_len + 18 • 18 is the VLAN Ethernet header length
Buffer Handle(32b) Rsv (4b) Port (4b) QID(20b) Rsv (4b) Frame Length (16b) Stats Index (16b) QM Data Formatting (Ingress and Egress) • QID is extracted from Lookup data • Port number is extracted from Lookup data • Total Ethernet frame length is passed to QM • Stats index is passed on for post-queue counters
Ingress HF Block Diagram dl_source() Signal next ctx _WU_get_enet_frame_length Cycles: 10 NN Dequeue Cycles: 17 Cycles: 2 init signal DRAM: 4|5 4B writes Cycles: 26 _WU_write_vlan_header Wait for prev ctx Cycles: 5 SRAM: 1 read 1 write Cycles: 10 _WU_update_counters Signal next ctx Cycles: 1 NN Enqueue Cycles: 16 SRAM: 3 writes Cycles: 12 _WU_update_buffer_descriptor Wait for prev ctx dl_sink() Total cycles: 33+66=99 Budget: 1400 MHz/(10Gbs/8*90) = 100.8 => 100 cycles Measured Latency: 745
Ingress Validation • Send in non-tunneled packets and check output packets to see they are our internal, tunneled, packets. • Worked during development but not tested in integrated system at this point. • Send in tunneled packets and check output packets to see they are our internal, tunneled, packets. • Example: 01020304 05060708 090a0b0c 81000aaa 08004500 00380000 0000ff11 3a61c0a8 0001c0a8 00020001 00010024 ffbd4500 001c0000 0000ff11 3a7dc0a8 0001c0a8 00020001 00020008 7e87 [6d7e d5be] CRC that’s stripped by RX -> • 01020304 0a020102 03040a0b 81000002 08004500 00380000 0000ff11 3a61c0a8 0001c0a8 00020001 00010024 ffbd4500 001c0000 0000ff11 3a7dc0a8 0001c0a8 00020001 00020008 7e87
Egress Header Format • Microengine Usage • One microengine • Eight identical threads • NN ring input from Lookup • NN ring output to Port Splitter • Main functions: • Using data from Lookup, modify packet header in DRAM for proper routing to Switch: • Destination MAC address • First five bytes are same as source MAC address • Destination MAC address is looked up based on IP address from lookup • Source MAC address • Address of this LC • VLAN tag • Adjust pre-queue stats counters • Format input data for QM • QID • Port Number • Ethernet Frame Length
DstAddr (6B) Ethernet Header SrcAddr (6B) Type=802.1Q (2B) VLAN (2B) Type=IP (2B) Buffer Handle(32b) DstAddr (6B) Ver/HLen/Tos/Len (4B) Buf Handle(32b) ID/Flags/FragOff (4B) Ethernet Header SrcAddr (6B) TTL (1B) Rsv (4b) QID(20b) Port (4b) Rsv (4b) Type=802.1Q (2B) Protocol = UDP (1B) VLAN (2B) Hdr Cksum (2B) Type=IP (2B) Src Addr (4B) IP Pkt Length (16b) Eth Hdr Len (8b) Reserved (8b) IP Header Ethernet Frame Length (16b) Stats Index (16b) Ver/HLen/Tos/Len (4B) Dst Addr (4B) ID/Flags/FragOff (4B) TTL (1B) IP Options (0-40B) Protocol = UDP (1B) IP DAddr (32b) Hdr Cksum (2B) Src Addr (4B) IP Header Src Port (2B) UDP Header Dst Addr (4B) Dst Port (2B) Rsvd (4b) VLAN(12b) Stats Index (16b) UDP length (2B) IP Options (0-40B) UDP checksum (2B) UDP Payload (MN Packet) Rsvd (4b) Port (4b) Rsvd (4b) QID (20b) Src Port (2B) UDP Header Dst Port (2B) QM/Schd UDP length (2B) UDP checksum (2B) PAD (nB) UDP Payload (MN Packet) Ethernet Trailer CRC (4B) Ethernet Trailer PAD (nB) CRC (4B) LC Egress Functional Blocks Phy Int Tx Lookup S W I T C H Hdr Format Key Extract Switch Rx Output Packet Format Input Packet Format
MAC Address and VLAN Tag (Egress) • The source MAC address is fixed and set at boot time (_WU_get_mac_address) • The destination MAC address will only differ in the last nibble and this nibble is obtained from the Lookup data. • _WU_ip_lookup will take 32 bits from the destination IP address and use the local CAM to obtain the least significant 4 bits of the MAC address. • The CAM state bits are used for this so that’s why there are only 4 bits of data returned • The VLAN tag is obtained from the Lookup data.
Egress HF Block Diagram dl_source() Signal next ctx Cycles: 10 _WU_get_enet_frame_length Cycles: 1 NN Dequeue Cycles: 2 _WU_ip_lookup init signal Wait for prev ctx Cycles: 1 DRAM: 1 4B read 4 4B writesCycles: 32 _WU_write_vlan_header Cycles: 2 _WU_update_counters SRAM: 1 add 1 incrCycles: 6 Signal next ctx Cycles: 1 NN Enqueue SRAM: 3 writesCycles: 10 _WU_update_buffer_descriptor Wait for prev ctx dl_sink() Total cycles: 65 Measured Latency*: ~660
Egress Validation • Send in our internal, tunneled packets and check output packets to see they are our valid IP, tunneled, packets. • For the PlanetLab demo, there are no non-tunneled output packets • Check packet and byte counters for valid updates • Check CAM for proper initialization (data watch)
HF Initialization (Ingress/Egress) • All memory locations defined in dl_system.h: • Base address for HF • LC[I/E]_HF_SRAM_INIT_BASE • MAC_ADDR_HI32 • MAC_ADDR_LO16 • Pre-Queue Counters • LC[I/E]_LU_COUNTERS_SRAM_INIT_BASE • LC[I/E]_LU_PRE_Q_PKT_CNT_OFFSET – offset into counters structure for packet counter • LC[I/E]_LU_PRE_Q_BYTE_CNT_OFFSET – offset into counters structure for byte counter. • Thread 0 waits for signal from rx • For Egress, the CAM is filled (_WU_hfe_initialize_ip_lookup) with data from LCE_HF_SRAM_INIT_BASE + 8: each entry is 64 bits: cam_entry (32b), RSVD (28b), MAC_DEST (4b)
File Locations (Ingress and Egress) • Main code • Applications/LC_Ingress/src/hdr_format/PL/hdr_format.uc • Applications/LC_Egress/src/hdr_format/PL/hdr_format.uc • Library • library/DataPlane/hdr_format_util.uc
Required Includes (Ingress and Egress) • Files • build/PL/dispatch_loop/dl_system.h • memory locations • IXA_SDK_4.0/src/library/microblocks_library/ • dl_meta – for metadata macros • IXA_SDK_4.0/src/library/dataplane_library/ • dram – for DRAM read/write macros • sram – for SRAM read/write/add/incr macros • xbuf – for transfer buffer macros
These stalls are in various SRAM and DRAM accesses – the command FIFO is FULL! Ingress Performance Anomalies
Ingress Anomalies (Explanation) The SRAM Controllers have a command FIFO These bus arbiters are shared across all memory interfaces
Ingress/Egress SRAM Issues • It seems that using atomic ADD/INCR instructions is expensive at the SRAM controller • If I remove them and read the SRAM, add myself, write the SRAM, this is quicker and consumes less of the SRM controller time an, thus, the command queue never backs up. • The this new design, there are more instructions executed, but there may be a few I could eliminate with some optimizing of code. • No stalling in the WU microblocks (well QM does and RX and TX still do but these looks normal).
Ingress/Egress Performance • ~99 CPU cycles • ~745 cycles latency • Expected performance • Should have no trouble going at 10 Gb/s but does… • Simulated performance (as of 11/06/2006) • ~10 Gb • With all other microengines in place (i.e. real simulation)
Ingress/Egress Future Work • Determine source of I/O stalls • Update Stubs projects for validation of Ingress/Egress blocks (done for Ingress) • Extend Both blocks for all possible packet formats • Ingress – inputs • Egress – outputs • Possible instruction optimization to give a little headroom (99 cycles out of 100). Currently, design will not work for standard IPv4 packets; PlanetLab VLAN packets are OK.