200 likes | 222 Views
Flow Stats Module. James Moscola September 6, 2007. SCR. SCR. SCR. SCR. SCR. SCR. SCR. SCR. NN. NN. NN. NN. NN. NN. NN. SPP V1 LC Egress with 1x10Gb/s Tx. XScale. NAT Miss Scratch Ring. S W I T C H. R B U F. M S F. Rx1. Rx2. Key Extract. Lookup. Hdr Format.
E N D
Flow Stats Module James Moscola September 6, 2007
SCR SCR SCR SCR SCR SCR SCR SCR NN NN NN NN NN NN NN SPP V1 LC Egress with 1x10Gb/s Tx XScale NAT Miss Scratch Ring S W I T C H R B U F M S F Rx1 Rx2 Key Extract Lookup Hdr Format TCAM T B U F QM1 R T M M S F 1x10G Tx2 1x10G Tx1 Flow Stats1 Port Splitter QM2 NAT Pkt return XScale Stats (1 ME) SRAM1 SRAM3 Flow Stats2 XScale SRAM2
Overview of Flow Stats • Main functions • Uniquely identify flows based on 5-tuple • Hash header values to get an index into a table of records • Maintain packet and byte counts for each flow • Compare packet header with header values in record, and increment if same • Otherwise, follow hash chain until correct record is found • Send flow information to XScale for archiving • Secondary functions • Maintain hash table • Identify and remove flows that are no longer active • Invalid flows are removed so memory can be resused
Design Considerations • Efficiently maintaining a hash table with chained collisions • Efficiently inserting and deleting records • Efficiently reading hash table records • Synchronization issues • Multiple threads modifying hash table and chains
= Member of 6-tuple Flow Record • Total Record Size = 11 32-bit words • V is valid bit • Only needed at head of chain • ‘1’ for valid record • ‘0’ for invalid record • Start timestamp (64-bits) is set when record starts counting flow • Reset to zero when record is archived • End timestamp (64-bits) is set each time a packet is seen for the given flow • Packet and Byte counters are incremented for each packet on the given flow • Reset to zero when record is archived • Next Record Number is next record in hash chain • 0x1FFFF if record is tail • Address = (next_record_num * record_size) + base_addr LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (12b) Slice ID (12b) Protocol (8b) LW4 V (1b) Reserved (14b) Next Record Number (17b) LW5 Packet Counter (32b) LW6 Byte Counter (32b) LW7 Start Timestamp_high (32b) LW8 Start Timestamp_low (32b) LW9 End Timestamp_high (32b) LW10 End Timestamp_low (32b)
Hash Table Memory • Allocating 4 MBytes in SRAM Channel 3 for hash table • Supports ~95K records • Divided memory 75% for the main table and 25% for the collision table • Memory required = Main_table_size + Collision_table_size + Bit_vector_size .75*(#records * #bytes/record) + .25*(#records * #bytes/record) + .75*(#records/8) ~71K records + ~24K records ~3Mbytes + ~1Mbytes + ~8Kbytes • Space for main table and collision table can be adjusted to tune performance • Larger main table means fewer collisions, but still need adequate space for collision table Main Table 75% Collision Table 25% Bit Vector
Inserting Into Hash Table • IXP has 3 different hash functions (48-bit, 64-bit, 128-bit) which take 7, 8, and 16 clock cycles • Using 64-bit hash function provides required functionality and takes half the time of the 128-bit hash • Not including Source Addr or Protocol into address • HASH(D.Addr, S.Port, D.Port); • Result of hash is used to address the main hash table • Records in the main table represent the head of a chain • If slot at head of chain is empty (valid_bit=0), store record there • If slot at head of chain is occupied, compare 6-tuple • If 6-tuple matches • If packet_count == 0 then (existing flows will have 0 packet_counts when previous packets on flow have just been archived) • Increment packet_counter for record • Add size of current packet to byte_counter • Set start and end time stamps • If packet_count > 0 then • Increment packet_counter for record • Add size of current packet to byte_counter • Set end time stamp • If 6-tuples doesn’t match then a collision has occurred and the record needs to be stored in collision table Main Table Collision Table Bit Vector
SRAM Ring Hash Collisions • Hash collisions are chained in linked list • Head of list is in the main table • Remainder of list is in collision table • SRAM ring maintains list of free slots in collision table • When a collision occurs, a pointer to an open slot in the collision table can be retrieved from the SRAM ring • When a record is removed from the collision table, a pointer is returned to the SRAM ring for the invalidated slot Main Table Collision Table Bit Vector Free list
Hash Table Bit Vector • Hash table can be sparse data structure • Reading entire table looking for valid records would be very time consuming • Append bit vector to hash table • Each bit represents the head of a chain in the main table • Set bits indicate a valid chain starts at that location • One bit for each entry in main hash table • When archiving records, read bit vector, compute the start addresses of valid chains, and read only valid records from hash table 31 24 16 8 0 offset = 0 01000000 00000000 00000000 10000001 offset = 1 00000000 00000000 00000000 00000001 Bit Vector Example shows chains at 0, 7, 30, 32 are valid
Archiving Hash Table Records • Send all valid records in hash table to XScale for archiving every 5 minutes • For each record in the chain … • If packet count > 0 then • Record is valid • Send record to XScale via Scratch ring (maybe change to SRAM ring?) • Set packet count to 0 • Set byte count to 0 • Leave record in table • If packet count == 0 then • Flow has already been archived • No packet has arrived on flow in 5 minutes • Record is no longer valid • Delete record from hash table to free memory Info Sent to XScale for each flow every 5 minutes LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (12b) Slice ID (12b) Protocol (8b) LW4 Packet Counter (32b) LW5 Byte Counter (32b) LW6 Start Timestamp_high (32b) LW7 Start Timestamp_low (32b) LW8 End Timestamp_high (32b) LW9 End Timestamp_low (32b)
SRAM Ring Deleting Records from Hash Table • While archiving records • If packet count is zero then remove record from hash table • Record has already been archived, and no packets have arrived in the last five minutes • To remove a record • If record == head • If record != tail • Replace record with record.next • Free slot for the moved record • Else if record == tail • Valid_bit = 0 • Else if record != head • Set previous records next pointer to record.next • Free slot for the deleted record Main Table Collision Table Bit Vector Free list
Memory Synchronization Issues • Multiple threads reading/writing same block of memory • Only allow 1 ME to modify structure of hash table • Inserting and deleting nodes • Use global registers to indicate that the structure of the hash table is being modified • One global bitmask register to indicate which threads are modifying the structure • Eight global lock registers (1 per thread) to indicate what chain in the hash table is being modified • When a thread wants to insert or delete a record from the hash table • Store pointer to head of chain in a global lock register • Set bitmask to signal other threads to check the global lock registers • If another thread is processing a packet that hashed to the same hash chain, wait for lock to clear and restart processing packet • Packet can also just be dropped, if counts don’t need to be perfect • However, dropping packets would likely be necessary when bursts are occurring on a single flow ... exactly the behavior that flow stats should count and log • Otherwise, the thread can continue processing the packet normally • Clear pointer from shared register when done with insert/delete • Use atomic increments where possible
Flow Stats Execution • ME 1 • Init - Configure hash function • 8 threads • Read packet header • Hash packet header • Send header and hash result to ME2 for processing • ME2 (thread numbers may need adjusting) • Init - Load SRAM ring with addresses for each slot in the collision tableInit - Set TIMESTAMP to 0 • 7 threads • Insert records into hash table • Increment counter for records • 1 thread • Archive and delete hash table records
Diagram of Flow Stats Execution (ME1) get buffer handle from QM 60 cycles read buffer descriptor (SRAM) 150 cycles read packet header (DRAM) 300 cycles build hash key ~5 cycles compute hash ~8 cycles send packet info to ME2 60 cycles send buffer handle to TX 60 cycles ~643 cycles
Locking head of chain Iterating through hash chain Diagram of Flow Stats Execution (ME2) • Incrementing Counters • Adds records to hash chain, but doesn’t remove them Best: ~360 cycles Worst: ~520 +160x 60 cycles get packet info from ME1 150 cycles x read hash table record (SRAM) Yes No No valid? match? tail? compare record to header read next record in chain ~10 cycles 150 cycles Yes Yes Yes count==0? set register to lock chain No No 150 cycles set register to lock chain set register to lock chain set register to lock chain get record slot from freelist 150 cycles 150 cycles 150 cycles 150 cycles insert new record Write START/END time & new counts Write END time & new counts insert new record clear lock register clear lock register clear lock register clear lock register
Locking head of chain Waiting to archive Diagram of Flow Stats Execution (ME2) • Archiving Records • Removes records from hash chain, but doesn’t add them • Processing of archiving records occurs every five minutes count == 0? Yes head of list? No read current time No Yes No 5 minutes? send record to XScale tail of list? No set register to lock chain set register to lock chain Yes 150 cycles Yes 150 cycles set register to lock chain read 8 LWs of bit_vector read record.next write next_ptr to previous list item set register to lock chain 150 cycles 150 cycles 150 cycles reset paket and byte counters compute address of next record replace record with record.next clear lock register set valid bit to zero 150 cycles clear lock register read record from SRAM clear lock register return record slot to freelist clear lock register done with 8LWs? return record.next slot to freelist No Yes done with all records?
Return from Swap • When returning from each CTX switch, always check global lock registers • If any locks are set, determine if another thread is modifying the same hash chain that the current packet has hashes to • If locks are on different chains, then just continue processing packet • If any locks are on the same chain, then restart processing current packet check global lock registers Yes set? compare lock vals to current chain No Yes equal? No continue processing packet restart procssing packet
SCR SCR SCR SCR NN Flow Stats Interfaces V: Valid Bit V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM1 1x10G Tx1 Flow Stats1 QM2 Source Address (32b) Destination Address (32b) XScale SrcPort (16b) DestPort (16b) SRAM3 Flow Stats2 Reserved (12b) Slice ID (12b) Protocol (8b) Packet Counter (32b) Source Address (32b) Byte Counter (32b) Destination Address (32b) Start Timestamp_high (32b) SrcPort (16b) DestPort (16b) Start Timestamp_low (32b) Reserved (12b) Slice ID (12b) Protocol (8b) End Timestamp_high (32b) Reserved (9b) Hash Result (17b) End Timestamp_low (32b)
Flow Statistics Module • Scratch rings • QM_TO_FS_RING_1: 0x3800 – 0x3BFF // for receiving from QM • QM_TO_FS_RING_2: 0x3C00 – 0x3FFF // for receiving from QM • FS_TO_TX_RING_1: 0x4000 - 0x43FF // for sending to TX • FS_TO_TX_RING_1: 0x4400 - 0x47FF // for sending to TX • FS_TO_XSCALE_RING: 0x???? – 0x???? // for receiving invalidate info from XScale • SRAM ring • FS_FREELIST_RING: 0x???? - 0x???? // stores list of open slots in collision table • LC Egress SRAM Channel 3 info for Flow Stats • SM_RECORD_SIZE = 11 * 4 // 11 32-bit words/record * 4 bytes/word • TOTAL_NUM_RECORDS = 95104 // MAX with 4 MB table is ~95K records (should be divisible by 32) • SM_NUM_HASH_TABLE_RECORDS = 71328 // <= TOTAL_NUM_RECORDS • SM_NUM_CHAINED_RECORDS = TOTAL_NUM_RECORDS - SM_NUM_HASH_TABLE_RECORDS • BIT_VECTOR_SIZE_IN_BYTES = TOTAL_NUM_RECORDS / 8; • BIT_VECTOR_SIZE_IN_WORDS = TOTAL_NUM_RECORDS / 32; • SM_HASH_TABLE_BASE = 0x100000 • SM_HASH_COLLISION_TABLE_BASE = (SM_HASH_TABLE_BASE + (SM_RECORD_SIZE * SM_NUM_HASH_TABLE_ENTRIES)) • SM_VALID_RECORD_BIT_VECTOR_BASE = (SM_HASH_COLLISION_TABLE_BASE + (SM_RECORD_SIZE * SM_NUM_CHAINED_RECORDS))