500 likes | 662 Views
Flow Stats Module -- Control. John DeHart and James Moscola (Original FastPath Design) August 2008. SCR. SCR. SCR. SCR. SCR. SCR. SCR. SRAM. SRAM. SCR. NN. NN. NN. NN. NN. NN. NN. Freelist. SPP V1 LC Egress with 1x10Gb/s Tx. XScale. NAT Miss Scratch Ring. S W I T C
E N D
Flow Stats Module--Control John DeHart and James Moscola (Original FastPath Design) August 2008
SCR SCR SCR SCR SCR SCR SCR SRAM SRAM SCR NN NN NN NN NN NN NN Freelist SPP V1 LC Egress with 1x10Gb/s Tx XScale NAT Miss Scratch Ring S W I T C H R B U F M S F Rx1 Rx2 Key Extract Lookup Hdr Format TCAM T B U F QM0 Port Splitter Flow Stats1 R T M M S F 1x10G Tx2 1x10G Tx1 QM1 QM2 QM3 NAT Pkt return Stats (1 ME) SRAM1 SRAM3 Flow Stats2 XScale SRAM2 XScale Archive Records
SCR SCR SCR SCR SCR SCR SCR SCR SRAM SRAM SCR SCR NN NN NN NN NN Freelist SPP V1 LC Egress with 10x1Gb/s Tx XScale NAT Miss Scratch Ring S W I T C H R B U F M S F Rx1 Rx2 Key Extract Lookup Hdr Format TCAM 5x1G Tx1 (P0-P4) T B U F QM0 Port Splitter Flow Stats1 R T M M S F QM1 5x1G Tx2 (P5-P9) QM2 QM3 NAT Pkt return Stats (1 ME) SRAM1 SRAM3 Flow Stats2 XScale SRAM2 XScale Archive Records
Overview of Flow Stats • Main functions • Uniquely identify flows based on 6-tuple • Hash header values to get an index into a table of records • Maintain packet and byte counts for each flow • Compare packet header with header values in record, and increment if same • Otherwise, follow hash chain until correct record is found • Send flow information to XScale for archiving every five minutes • Secondary functions • Maintain hash table • Identify and remove flows that are no longer active • Invalid flows are removed so memory can be resused
Design Considerations • Efficiently maintaining a hash table with chained collisions • Efficiently inserting and deleting records • Efficiently reading hash table records • Synchronization issues • Multiple threads modifying hash table and chains
= Member of 6-tuple Flow Record • Total Record Size = 8 32-bit words • V is valid bit • Only needed at head of chain • ‘1’ for valid record • ‘0’ for invalid record • Start timestamp (16-bits) is set when record starts counting flow • Reset to zero when record is archived • End timestamp (16-bits) is set each time a packet is seen for the given flow • Packet and Byte counters are incremented for each packet on the given flow • Reset to zero when record is archived • Next Record Number is next record in hash chain • 0x1FFFF if record is tail • Address of next record = (next_record_num * record_size) + collision_table_base_addr LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (6b) TCP Flags (6b) Slice ID (VLAN) (12b) Protocol (8b) LW4 V (1b) Reserved (14b) Next Record Number (17b) LW5 Packet Counter (32b) LW6 Byte Counter (32b) LW7 Start Timestamp (16b) End Timestamp (16b)
Timestamp Details • Timestamp on XScale is 64-bits • Storing 64-bit start and end timestamps would cause each flow record to be too large for a single SRAM read • Instead, only store the 16-bits of each timestamp required to represent a five minute time interval • Clock frequency = 1.4 GHz • Timestamp increments every 16 clock cycles • Use bits 41:26 for 16 bit timestamps • (226 * 16 cycles)/1.4GHz= .767 seconds • (241 * 16 cycles)/1.4GHz =25131.69 seconds (418 minutes) • Time interval that can be represented using these bits • .767 seconds through 418 minutes
Hash Table Memory • Allocating 4 MBytes in SRAM Channel 3 for hash table • Supports ~130K records • Divided memory 75% for the main table and 25% for the collision table • Memory required = Main_table_size + Collision_table_size.75*(#records * #bytes/record) + .25*(#records * #bytes/record)~98K records + ~32K records~3Mbytes + ~1Mbytes • Space for main table and collision table can be adjusted to tune performance • Larger main table means fewer collisions, but still need adequate space for collision table Main Table ~75% Collision Table ~25%
Inserting Into Hash Table • IXP has 3 different hash functions (48-bit, 64-bit, 128-bit) • Using 64-bit hash function is sufficient and takes less time than 128-bit hash function • Not including Source Addr or Protocol into address • HASH(D.Addr, S.Port, D.Port); • Result of hash is used to address the main hash table • Since we want ~100K records in main table, result of hash is used to get as close to 100K entries as possible by adding a 16bit and 15bit chunk from the hash result • hash_result(15:0) + hash_result(30:16) = record_number • Records in the main table represent the head of a chain • If slot at head of chain is empty (valid_bit=0), store record there • If slot at head of chain is occupied, compare 6-tuple • If 6-tuple matches • If packet_count == 0 then (existing flows will have 0 packet_counts when previous packets on flow have just been archived) • Increment packet_counter for record • Add size of current packet to byte_counter • Set start and end time stamps • If packet_count > 0 then • Increment packet_counter for record • Add size of current packet to byte_counter • Set end time stamp • If 6-tuples doesn’t match then a collision has occurred and the record needs to be stored in collision table Main Table Collision Table
SRAM Ring Hash Collisions • Hash collisions are chained in linked list • Head of list is in the main table • Remainder of list is in collision table • SRAM ring maintains list of free slots in collision table • Slots are numbered from 0 to #_Collision_Table_Slots • Same as next_record_number • To convert to memory address(slot_num * record_size) + collision_table_base_addr • When a collision occurs, a pointer to an open slot in the collision table can be retrieved from the SRAM ring • When a record is removed from the collision table, a pointer is returned to the SRAM ring for the invalidated slot Main Table Collision Table Free list
Archiving Hash Table Records • Send all valid records in hash table to XScale for archiving every 5 minutes • For each record in the main table (i.e. start of chain) ... • For each record in hash chain ... • If record is valid ... • If packet count > 0 then • Send record to XScale via SRAM ring • Set packet count to 0 • Set byte count to 0 • Leave record in table • If packet count == 0 then • Flow has already been archived • No packet has arrived on flow in 5 minutes • Record is no longer valid • Delete record from hash table to free memory Info Sent to XScale for each flow every 5 minutes LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (6b) TCP Flags (6b) Slice ID (VLAN) (12b) Protocol (8b) LW4 Packet Counter (32b) LW5 Byte Counter (32b) LW6 Start Timestamp_high (32b) LW7 Start Timestamp_low (32b) LW8 End Timestamp_high (32b) LW9 End Timestamp_low (32b)
SRAM Ring Deleting Records from Hash Table • While archiving records • If packet count is zero then remove record from hash table • Record has already been archived, and no packets have arrived in the last five minutes • To remove a record • If ((record == head) && (record == tail)) • Valid_bit = 0 • Else If ((record == head) && (record != tail)) • Replace record with record.next • Free the slot for the moved record • Else if record != head • Set previous records next pointer to record.next • Free slot for the deleted record Main Table Collision Table Free list
Memory Synchronization Issues • Multiple threads reading/writing same block of memory • Only allow 1 ME to modify structure of hash table • Inserting and deleting nodes • Use global registers to indicate that the structure of the hash table is being modified • Eight global lock registers (1 per thread) to indicate what chain in the hash table is being modified • When a thread wants to insert/delete a record from hash table • Store pointer to the head of the hash chain in the threads dedicated global lock register • If another thread is processing a packet that hashed to the same hash chain, wait for lock register to clear and restart processing packet • Otherwise, continue processing the packet normally • Clear global lock register when done with insert/deletes • Value of 0xFFFFFFFF indicates that lock is clear
Flow Stats Execution • ME 1 • Init - Configure hash function • 8 threads • Read packet header • Hash packet header • Send header and hash result to ME2 for processing • ME 2 (thread numbers may need adjusting) • Init - Load SRAM ring with addresses for each slot in the collision tableInit - Set TIMESTAMP to 0 • 7 threads (ctx 1-7) • Insert records into hash table • Increment counter for records • 1 thread (ctx 0) • Archive and delete hash table records
Diagram of Flow Stats Execution (ME1) get buffer handle from QM 60 cycles 300 cycles read packet header (DRAM) 300 cycles read buffer descriptor (SRAM) 300 cycles 150 cycles send buffer handle to TX 60 cycles build hash key ~50 cycles compute hash 100 cycles send packet info to ME2 60 cycles ~570 cycles
Locking head of chain Iterating through hash chain Diagram of Flow Stats Execution (ME2) • Incrementing Counters • Adds records to hash chain, but doesn’t remove them Best: ~360 cycles Worst: ~520 +160x 60 cycles get packet info from ME1 150 cycles x read hash table record (SRAM) Yes No No valid? match? tail? compare record to header read next record in chain ~10 cycles 150 cycles Yes Yes Yes count==0? set register to lock chain No No 150 cycles set register to lock chain set register to lock chain set register to lock chain get record slot from freelist 150 cycles 150 cycles 150 cycles 150 cycles insert new record Write START/END time & new counts Write END time & new counts insert new record clear lock register clear lock register clear lock register clear lock register
Locking head of chain Waiting to archive Diagram of Flow Stats Execution (ME2) • Archiving Records • Removes records from hash chain, but doesn’t add them • Processing of archiving records occurs every five minutes count == 0? Yes head of list? Yes tail of list? Yes read current time No No No No 5 minutes? send record to XScale set register to lock chain set register to lock chain set register to lock chain Yes set register to lock chain read next record from main table write next_ptr to previous list item read record.next set valid bit to zero reset counters and timestamps Yes valid? clear lock register replace record with record.next clear lock register No clear lock register return record slot to freelist clear lock register read next record in chain return record.next slot to freelist Yes more records in chain? No No done with all records? Yes
Return from Swap • When returning from each CTX switch, always check global lock registers • If any of the global locks contain the address of the hash chain that the current thread is trying to modify, then the hash chain is locked and the current thread must restart processing on the current packet • If none of the global locks contain the address of the hash chain that the current thread is trying to modify, then the current thread can just continue processing that packet as usual check global lock values match current chain? Yes No continue processing packet restart procssing packet
SCR SCR SRAM SRAM NN Freelist SPP V1 LC Egress with 1x10Gb/s Tx V: Valid Bit V 1 Rsv (3b) Port (4b) Buffer Handle(24b) QM0 Flow Stats1 1x10G Tx1 QM1 QM2 Source Address (32b) QM3 Destination Address (32b) SrcPort (16b) DestPort (16b) SRAM3 Flow Stats2 Reserved (6b) TCP Flags (6b) Slice ID (VLAN) (12b) Protocol (8b) XScale Archive Records Packet Counter (32b) Source Address (32b) Byte Counter (32b) Destination Address (32b) Start Timestamp_high (32b) SrcPort (16b) DestPort (16b) Start Timestamp_low (32b) Rsv (2b) TCP Flags (6b) Packet Length (16b) Protocol (8b) End Timestamp_high (32b) Rsvd (3b) Hash Result (17b) Slice ID (VLAN) (12b) End Timestamp_low (32b)
Flow Statistics Module • Scratch rings • QM_TO_FS_RING_1: 0x2400 – 0x27FF // for receiving from QM • QM_TO_FS_RING_2: 0x2800 – 0x2BFF // for receiving from QM • FS1_TO_FS2_RING: 0x2C00 - 0x2FFF // for sending data from FS1 to FS2 • FS_TO_TX_RING_1: 0x3000 - 0x33FF // for sending data to TX1 • FS_TO_TX_RING_2: 0x3400 – 0x37FF // for sending data to TX2 • SRAM rings • FS2_FREELIST: 0x???? - 0x???? // stores list of open slots in collision table • FS2_TO_XSCALE: 0x???? – 0x???? // for sending record information to the XScale for archiving • LC Egress SRAM Channel 3 info for Flow Stats • HASH_CHAIN_TAIL 0x1FFFF // indicates the end of a hash chain • ARCHIVE_DELAY 0x0188 // 5 minutes • RECORD_SIZE 8 * 4 = 32 // 8 32-bit words/record * 4 bytes/word • TOTAL_NUM_RECORDS 130688 // MAX with 4 MB table is ~130K records • NUM_HASH_TABLE_RECORDS 98304 // NUM_HASH_TABLE_RECORDS<=TOTAL_NUM_RECORDS (mod 32 = 0) • NUM_COLLISION_TABLE_RECORDS TOTAL_NUM_RECORDS - NUM_HASH_TABLE_RECORDS = 32384 • LCE_FS_HASH_TABLE_BASE SRAM_CHANNEL_3_BASE_ADDR + 0x200000 = 0xC0200000 • LCE_FS_HASH_TABLE_SIZE 0x400000 • LCE_FS_COLLISION_TABLE_BASE (HASH_TABLE_BASE + (RECORD_SIZE * NUM_HASH_TABLE_RECORDS)) = 0xC0500000
Overview of Flow Stats • 2 MEs in Fastpath to collect flow data for each pkt • Byte counter per flow • Pkt counter per flow • Archive data to XScale via SRAM ring every 5 minutes • XScale control daemon(s) to process data • Receive flow information from MEs • Reformat to put into PlanetFlow format • Maintain databases for PlanetLab archiving and for identifying internal flows (pre-NAT translation) when an external flow (post-NAT) has a complaint lodged against it.
SCR SCR SCR SCR SCR SCR SCR SCR SRAM SRAM SCR SCR NN NN NN NN NN Freelist SPP V1 LC Egress with 10x1Gb/s Tx XScale NAT Miss Scratch Ring S W I T C H R B U F M S F Rx1 Rx2 Key Extract Lookup Hdr Format TCAM 5x1G Tx1 (P0-P4) T B U F QM0 Port Splitter Flow Stats1 R T M M S F QM1 5x1G Tx2 (P5-P9) QM2 QM3 NAT Pkt return Stats (1 ME) SRAM1 SRAM3 Flow Stats2 XScale SRAM2 XScale Archive Records
= Member of 6-tuple Flow Record • Total Record Size = 8 32-bit words • V is valid bit • Only needed at head of chain • ‘1’ for valid record • ‘0’ for invalid record • Start timestamp (16-bits) is set when record starts counting flow • Reset to zero when record is archived • End timestamp (16-bits) is set each time a packet is seen for the given flow • Packet and Byte counters are incremented for each packet on the given flow • Reset to zero when record is archived • For TCP Flows, the TCP Flags are or’ed in from each packet • Next Record Number is next record in hash chain • 0x1FFFF if record is tail • Address of next record = (next_record_num * record_size) + collision_table_base_addr LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (6b) TCP Flags (6b) Slice ID (VLAN) (12b) Protocol (8b) LW4 V (1b) Reserved (14b) Next Record Number (17b) LW5 Packet Counter (32b) LW6 Byte Counter (32b) LW7 Start Timestamp (16b) End Timestamp (16b)
Archiving Hash Table Records • Send all valid records in hash table to XScale for archiving every 5 minutes • For each record in the main table (i.e. start of chain) ... • For each record in hash chain ... • If record is valid ... • If packet count > 0 then • Send record to XScale via SRAM ring • Set packet count to 0 • Set byte count to 0 • Leave record in table • If packet count == 0 then • Flow has already been archived • No packet has arrived on flow in 5 minutes • Record is no longer valid • Delete record from hash table to free memory Info Sent to XScale for each flow every 5 minutes LW0 Source Address (32b) LW1 Destination Address (32b) LW2 SrcPort (16b) DestPort (16b) LW3 Reserved (6b) TCP Flags (6b) Slice ID (VLAN) (12b) Protocol (8b) LW4 Packet Counter (32b) LW5 Byte Counter (32b) LW6 Start Timestamp_high (32b) LW7 Start Timestamp_low (32b) LW8 End Timestamp_high (32b) LW9 End Timestamp_low (32b)
Overview of Flow Stats Control • Main functions • Collection of Flow Information for PlanetLab Node • Used when a complaint is lodged about a misbehaving flow • Must be able to identify flow and the Slice that produced it. • Aggregation of Flow Information from: • Multiple GPEs • Multiple NPEs • Correlation with NAT records to identify internal flow and external flow • External flow will be what complaint will be about. • Internal flow will be what involved PlanetLab researcher will know about.
Overview of PlanetFlow • PlanetFlow • Unprivileged slice • Flow Collector: • Ulogd (fprobe-ulog) • Netlink socket • Uses VSys for privileged operations • Every 5 minutes dumps its cache to DB • DB: • On PlanetLab Node • 5-minute records • Flows spanning 5-minute intervals aggregated daily. • Central Archive • At Princeton? • Updated periodically by using rsync to retrieve new DB entries from ALL PlanetLab nodes.
PF DB Central Archive PF DB Ext PF DB SCR SCR SRAM SPP PlanetFlow CP GPE dbAccumulator rsync NAT records Flow records GPE NATd FSd Ingress XScale Egress XScale SCD FlowStats SRAM Ring NAT Scratch Rings MEs HF LK FS2 Central Archive Record = <time, sliceID, Proto, SrcIP, SrcPort, DstIP, DstPort, PktCnt, ByteCnt> Ext PF DB Record = <Central Archive Record>
PF DB Central Archive PF DB Ext PF DB SCR SCR SRAM SPP PlanetFlow CP GPE dbAccumulator rsync NAT records Flow records GPE NATd FSd Ingress XScale Egress XScale SCD FlowStats SRAM Ring NAT Scratch Rings MEs HF LK FS2 Central Archive Record = <time, sliceID, Proto, SrcIP, SrcPort, DstIP, DstPort, PktCnt, ByteCnt> Ext PF DB Record = <Central Archive Record>
Translations needed • NPE Flow Records: • VLAN to SliceID • Comes from SRM • IXP timestamp to wall clock time • SCD records wall clock time it started IXP • How do we manage time slip between clocks? • GPE Flow Records: • NAT Port translations • Src Port from GPE record becomes SPP Orig Src Port • Src Port from natd translation record becomes Src Port • natd provides port translation updates
GPE Central Archive Ext PF DB PF DB CP CP CP SPP PlanetFlow Databases <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt> <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt> Flow records <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt> NAT records <time, proto, srcIP, intSrcPort, xlatedSrcPort> <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt>
Merging of DBs • NPE Flows • No NAT • Goes directly into Ext PF DB • SPP Orig Src Port == SrcPort • Do they need SliceID translation? • We use the VLAN, but this probably needs to be the PlanetLab version of a Slice ID. • SRM will provide a VLAN to SliceID translation • Where and When? • GPE Configured Flows • No NAT • Goes directly into Ext PF DB • SPP Orig Src Port == SrcPort • GPE NAT Flows • Find corresponding NAT Record, extract Translated SrcPort • Insert record into Ext PF DB with original SrcPort moved to SPP Orig Src Port • Set Src Port to translated SrcPort • CP Traffic?
Overview of PlanetFlow • PlanetFlow • Unprivileged slice • Flow Collector: • Ulogd (fprobe-ulog) • Netlink socket • Uses VSys for privileged operations • Every 5 minutes dumps its cache to DB • DB: • On PlanetLab Node • 5-minute records • Flows spanning 5-minute intervals aggregated daily. • Central Archive • At Princeton? • Updated periodically by using rsync to retrieve new DB entries from ALL PlanetLab nodes. X X
Version Count Unix nSecs Unix Secs PlanetFlow Raw Data NetFlow Header (beginning of file and repeats every 30 flow records) 0005 0011 8e10638b 48a40477 00062638 0000371d 0000 0000 80fc99cd 80fc99d3 00000000 0000 0004 0000000b 0000062d 8dae5570 8dae558b cc1f 01bb 00 1f 0600 0000 0000 02000000 80fc99cd 80fc99d3 00000000 0000 0004 0000001a 000008b7 8dae54eb 8dae5533 cc1e 01bb 001e 0600 0000 0000 02000000 Uptime Engine Type (unused) Engine Id (unused) Pad16 (unused) Flow Sequence SA DA 128.252.153.205 128.252.153.211 IPv4 NextHop (Unused) In SNMP (if_nametoindex) Out SNMP (if_nametoindex) Pkt Count Byte Count NetFlow Flow Record 11 1581 First Switched (flow creation time) Last Switched (time of last pkt) Tcp flags Proto Src Tos Src Port Dst Port Pad 443 52255 Src As (Unused) Dst As (Unused) XID (SliceID) SA DA 128.252.153.205 128.252.153.211 IPv4 NextHop (Unused) NetFlow Flow Record In SNMP (if_nametoindex) Out SNMP (if_nametoindex) Pkt Count Byte Count 26 2231 First Switched (flow creation time) Last Switched (time of last pkt) Tcp flags Proto Src Tos Src Port Dst Port Pad 443 52254 Src As (Unused) Dst As (Unused) XID (SliceID)
Version Count Unix nSecs Unix Secs SPP/PlanetFlow Raw Data NetFlow Header (beginning of file and repeats every 30 flow records) 0005 0011 8e10638b 48a40477 00062638 0000371d xx yy 0000 80fc99cd 80fc99d3 00000000 0000 0004 0000000b 0000062d 8dae5570 8dae558b cc1f 01bb 00 1f 0600 zzzz 0000 02000000 80fc99cd 80fc99d3 00000000 0000 0004 0000001a 000008b7 8dae54eb 8dae5533 cc1e 01bb 001e 0600 zzzz 0000 02000000 Uptime (msecs) SPP Engine Type SPP Engine Id Pad16 (unused) Flow Sequence SA DA 128.252.153.205 128.252.153.211 IPv4 NextHop (Unused) In SNMP (if_nametoindex) Out SNMP (if_nametoindex) Pkt Count Byte Count NetFlow Flow Record 11 1581 First Switched(msec) (flow creation time) Last Switched(msec) (time of last pkt) Tcp flags Proto Src Tos Src Port Dst Port Pad 443 52255 SPP Orig Src Port Dst As (Unused) XID (SliceID) SA DA 128.252.153.205 128.252.153.211 IPv4 NextHop (Unused) NetFlow Flow Record In SNMP (if_nametoindex) Out SNMP (if_nametoindex) Pkt Count Byte Count 26 2231 First Switched (flow creation time) Last Switched (time of last pkt) Tcp flags Proto Src Tos Src Port Dst Port Pad 443 52254 SPP Orig Src Port Dst As (Unused) XID (SliceID)
Issues and Notes • Time: • Keeping time in sync among various machines: • Flow Stats ME timestamps with IXP clock ticks. • Something has to convert this to a Unix time. • GPE(s) timestamps with Unix gettimeofday(). • CP collects flow records and aggregates based on time. • Proposal: • XScale, GPE(s) and CP will use ntp to keep their Unix times in sync • At the beginning of each reporting cycle, the Flow Stats ME should send a timestamp record just to allow the XScale and CP to keep the time in sync. • OR: Can XScale read the IXP clock tick and report that to the CP with along with the XScale’s Unix time. • What are the times that are recorded in the Header and Flow Records? • Header • Uptime (msecs): msecs since a base start time • Time since Unix Epoch: time since January 1, 1970 • Unix secs • Unix nSecs • Uptime and Unix (secs, nSecs) represent the SAME time • So that the Flow times can be calculated based on them. • Flow Record • First Switched (flow creation time): msecs since a base start time • Last Switched (last packet in flow seen time): msecs since base start time
Issues and Notes (continued) • NetFlow Header • Filled in AFTER 30 flow records are filled in OR we get a timeout (10 minutes) • COUNT field tells how many flow records are valid. • File or data packet is ALWAYS padded out to a size that would hold 30 flow records • Flow Sequence: Running total of number of flow records emitted. • Flow Header and Flow Records • Emitted in chunks of 30 flow records plus a Flow Header • Emitted either by writing to a file or sending over a socket to a mirror site. • Padded out to a size that would hold 30 flow records. • A flow is emitted when it has been inactive for at least a minute or when it has been active for at least 5 minutes. • Fprobe-ulog threads: • emit_thread • scan_thread • cap_thread • unpending_thread • Flow lists • flows[]: hashed array of flows, buckets chained off head of list • These are flows that have been reported over netlink socket • flows_emit: linked list of flows ready to be emitted.
Issues and Notes (continued) • VLANs and SliceIDs • NPE and LC use VLANs to differentiate Slices • Flow records must record slice IDs • SRM will provide VLAN to SliceID translation • GPE(s) do not differentiate Slices by VLAN. • All flows from a GPE will use the same VLAN • GPE keeps flow records locally using Slice ID • Flow Stats ME could ignore GPE flow packets if it was told what the default GPE VLAN was. • Otherwise, one of the fs daemons could drop the flow records for the GPE flows that the Flow Stats ME reports. • Slice ID: • What exactly is it? • Is the XID that is recorded by PlanetFlow actually the slice id or is it the VServer id?
Issues and Notes (continued) • NAT Port Translations • GPE flow records are the ones that need the NAT Port translation data • GPE flow records will come across from the GPE(s) to the CP via rsync or similar • natd will report NAT port translations with timestamps to the fs daemon • fs daemon will have to maintain NAT port translations (with their timestamps) for possible later correlation with GPE flow records • GPE(s) will all use the same default VLAN • SRM will send this VLAN to scd so it can write it to SRAM for the fs ME to read in • Fs ME will then filter out GPE flow records. • SRM fsd messaging • srm will push out VLAN SliceID translation creation and deletion messages • srm will wait ~10 minutes before re-using a VLAN • srm will send the delete VLAN message after waiting the 10 minutes. • fsd should not have to keep any history of VLAN/SliceID translations • It should get the creation before it receives any flow records for it • It should get the last flow record before it gets the deleteion • fsd will also be able to query SRM for current translation • This will facilitate a restart of the fsd while the SRM maintains current state.
Issues and Notes (continued) • rsync of flow record files from GPE(s) to CP • A particular run of rsync may get a file that is still being written to by fprobe-ulog on the GPE • A subsequent rsync will may get the file again with additional records in it. • Sample rsync command: • rsync --timeout 15 -avzu -e "ssh -i /vservers/plc1/etc/planetlab/root_ssh_key.rsa " root@drn02:/vservers/pl_netflow/pf /root/pf • This will report the files that have been copied over
Issues and Notes (continued) • Sample fprobe-ulog command: • /sbin/fprobe-ulog -M -e 3600 -d 3600 -E 60 -T 168 -f pf2 -q 1000 -s 30 -D 250000 • Started from /etc/rc.d/rc[2345].d/S56fprobe-ulog • All linked to /etc/init.d/fprobe-ulog • GPE Flow record collection daemon: fprobe-ulog • Scan thread • Collects flow records into a linked list • Emit thread • Periodically writes flow records out to a file • Every 600 seconds – ten minutes! • Daemon can also send flow records to a remote collector! • So we could have the GPEs emit their flow records directly to the flow stats daemon on the CP. • Sample command: • /sbin/fprobe-ulog -M -e 3600 -d 3600 -E 60 -T 168 -f pf2 -q 1000 -s 30 -D 250000 <remote>:<port>[/[<local][/<type]] … • There can be multiple remote host specifications • Where • remote: remote host to send to • port: destination port to send to • local: local hostname to use • type: m for mirror-site, r for rotate-site • send to all mirror-sites, rotate through rotate-sites.
Central Archive Ext PF DB SCR SCR SRAM SPP PlanetFlow CP GPE fprobe fsd rsync srm GPE fprobe Ingress XScale Egress XScale natd scd FlowStats SRAM Ring NAT Scratch Rings MEs HF LK FS2 Central Archive Record = <time, sliceID, Proto, SrcIP, SrcPort, DstIP, DstPort, PktCnt, ByteCnt> Ext PF DB Record = <Central Archive Record>
Plan/Design • Flow Stats daemon, fsd, runs on CP • Collects flow records from GPE(s) and NPE(s) and writes them into a series of PlanetFlow2 files with names: • pf2.#, where # is (0-162) • Current file is closed after N minutes and # is incremented and new file is opened and started. • This mimics what fprobe-ulog does now on the GPE(s) • These files are then collected periodically by PLC for use and archiving • I don’t think there is any explicit indication that PLC has picked up the files but the timing must be such that we know it is done before we roll over the file names and overwrite an old file. • Gets NAT data from natd • Keep records of this with timestamps so we can correlate with flow records coming from GPE(s) • Check with Mart on how this will work • Gets VLAN to sliceID data from srm • srm will send start translation, stop translation msgs with a 10 minute wait period when stopping a translation to make sure we are done with flow records for that slice • FS ME archives records every 5 minutes. • Slices are long lived (right?) so this should not be a problem • Fsd can also request a translation from srm • This is in case fsd has to be restarted while srm and other daemons continue running.
Plan/Design (continued) • Fsd gathers records from GPE(s) and NPE(s) • Gathers flow records from GPE(s) via socket(s) from fprobe-ulog on GPE(s) • Come across as one data packet with up to 30 flow records • Packet is padded out to full 30 flow records with Count in Header indicating how many of them are valid • Update NetFlow header to indicate that this is an SPP and which SPP node it is using Engine Type and Engine ID fields • Update with NAT data and write immediately out to current pf2 file keeping its NetFlow header. • Gathers flow records from NPE(s) via socket from scd on XScale • Come across one flow record at a time • No NetFlow Header • Create NetFlow Header • With appropriate Uptime and UnixTime (secs, nsecs) • With SPP Engine Type and SPP Engine ID • Modify Flow Record times to be msecs correlated with Uptime • Update NPE flow record with SliceID from srm. • Collect NPE records for a period of time or until we get 30 and then write them out to current pf2 file with NetFlow header.
Plan/Design (continued) • FS ME and scd • Use a command field in records coming across from FS ME to scd • Use one command to set current time • When FS ME is starting an archive cycle, first it sends a timestamp command • When scd gets this timestamp command it associates it with a gettimeofday() time and sends the FS ME time and the gettimeofday() time to the fsd on the CP so it can associated ME times with Unix times. • Use another command to indicate flow records • Flow records can be sent directly on to fsd on CP
Version Count Unix nSecs Unix Secs PlanetFlow Raw Data NetFlow Header (beginning of file and repeats every 30 flow records) 0500 0b00 8385 1bd2 a148 31d4 0f00 f84d 0000 8134 0000 0000 fc80 cd99 bb42 04e0 0000 0000 0000 0400 0000 0500 0000 7c01 2e85 eeb2 6d85 d636 7b00 7b00 0000 0011 0000 0000 0002 0000 fc80 cd99 fc80 d399 0000 0000 0000 0400 0000 1a00 0000 b708 3785 9d52 3785 e352 b1b2 bb01 1e00 0006 0000 0000 0002 0000 Uptime Engine Id (unused) Eng. Type (unused) Pad16 (unused) Flow Sequence SA DA IPv4 NextHop (Unused) In SNMP (if_nametoindex) Out SNMP (if_nametoindex) Pkt Count Byte Count NetFlow Flow Record First Switched (flow creation time) Last Switched (time of last pkt) Tcp flags Src Tos Src Port Dst Port Pad Proto Src As (Unused) Dst As (Unused) XID (SliceID) SA DA IPv4 NextHop (Unused) NetFlow Flow Record In SNMP (if_nametoindex) Out SNMP (if_nametoindex) Pkt Count Byte Count First Switched (flow creation time) Last Switched (time of last pkt) Tcp flags Src Tos Src Port Dst Port Pad Proto Src As (Unused) Dst As (Unused) XID (SliceID) Each 16 bits has bytes swapped
GPE PF DB Central Archive Ext PF DB Int PF DB CP CP CP CP SPP PlanetFlow Databases <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt> <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt> <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt, PE ID, intSrcPort> Flow records <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt> NAT records <time, proto, srcIP, intSrcPort, xlatedSrcPort> <time, sliceID, proto, srcIP, srcPort, dstIP, dstPort, pktCnt, byteCnt>
Int PF DB PF DB Central Archive PF DB Ext PF DB SCR SCR SRAM SPP PlanetFlow CP GPE dbAccumulator rsync NAT records Flow records GPE NATd FSd Ingress XScale Egress XScale SCD FlowStats SRAM Ring NAT Scratch Rings MEs HF LK FS2 Central Archive Record = <time, sliceID, Proto, SrcIP, SrcPort, DstIP, DstPort, PktCnt, ByteCnt> Ext PF DB Record = <Central Archive Record> Int PF DB Record = <Central Archive Record, NPE/GPE ID, Internal Src Port>
Merging of DBs • NPE Flows • No NAT • Goes directly into Ext PF DB and into Int PF DB • Internal SrcPort == SrcPort • Do they need SliceID translation? • We use the VLAN, but this probably needs to be the PlanetLab version of a Slice ID. • SRM will provide a VLAN to SliceID translation • Where and When? • GPE Configured Flows • No NAT • Goes directly into Ext PF DB and into Int PF DB • Internal SrcPort == SrcPort • GPE NAT Flows • Find corresponding NAT Record, extract Translated SrcPort • Insert record with translated SrcPort into Ext PF DB • Insert record with internal SrcPort into Int PF DB • CP Traffic?