280 likes | 356 Views
Control Update 1: Phase 0. Fred Kuhns fredk@arl.wustl.edu Applied Research laboratory Department of Computer Science and Engineering Washington University in St. Louis. What’s in the slides?. Some guiding requirements impacting design What will the overlay networks look like ( to me )
E N D
Control Update 1:Phase 0 Fred Kuhns fredk@arl.wustl.edu Applied Research laboratory Department of Computer Science and Engineering Washington University in St. Louis
What’s in the slides? • Some guiding requirements impacting design • What will the overlay networks look like (to me) • simple picture summarizing the relationships between the diversified networking model with our current design • Mapping IP to Ethernet addresses • simple picture depicting how we may associate the MAC layer next hop with a network layer next hop • Basic slice creation (i.e. conventional Planetlab Slice) • Creating an NP-based slice (what we add) • Run-time (production) support • dynamic control and configuration requirements/needs • Boot/configure time support • initial configuration of data plane and any debug needs • Meta-Router control • local delivery and exception packets • Configuration tool (cmd shell) • Testing: packet generation using sp++.
Goals/Charge • Create high performance PlanetLab node • Maintain compatibility with existing plab nodes/interfaces • external interfaces same as existing plab • where possible conform to existing plab abstractions, models, interfaces and development paradigms. • Extend interfaces • internal interfaces add NP abstractions, distributed resource management. • Special issues/concerns • Node audit service: Meta-Net traffic (flow) accounting conforming to existing netflow stats • Virtual machine model and node manager interface: extending rspec to account for NPs • Slice model: extending to include heterogeneous nodes (realizing slivers)
P0 P1 Pn P0 P1 Pl P0 P1 Pp Visualizing Ports, Links and Nodes • Meta-router uses a single UDP port number (i.e. meta-port) • any host/router may send traffic to the advertised IP address/UDP port pair • Only works if all meta-net traffic uses a single line card and physical port • Meta-router uses a UDP port per physical interface in use. • UDP tunnels act as meta-links • define a unique UDP tunnel between pairs of meta-routers • may have multiple UDP ports for each physical interface in use. IPA IPB P0 P0 IPC IPD P0 P0 IPA IPB … … P0 P1 Pm IPC IPD … …
MR1 Ethernet Ethernet MR2 MR3 Mapping IP to Ethernet Destination: Simple Case • A Meta-Router encapsulates its packets within a UDP datagram using the destination IP address and port number obtained from the lookup. • The packet is then sent to the line card encapsulated within an Ethernet 802.1p/q frame. The Ethernet destination address is obtained from the lookup. • The line card must replace the Ethernet header with one specifying the MAC layer next hop (eth addr). • For the demo we will assume there is only one next hop Ethernet device. Substrate Router Simplifying assumption: For a given physical output port, all packets use the same Ethernet header, in particular the same Ethernet destination address regardless of the IP destination address. Meta-Router Line Card IPW Eth1 Eth2 IP rtr IPZ IPX IPY
MR1 Ethernet Ethernet Ethernet IP rtr IP rtr IPX IPZ MR2 MR3 Mapping IP to Ethernet Destination: Not so Simple • Context: In general we can not assume there will only be one next hop Ethernet device. • Problem: We can not assume the destination IP address corresponds to the next hop Ethernet device (the current design’s built-in assumption). • Solutions • Create table mapping packet IP destination addresses to next hop Ethernet addresses. • Line card performs IP route lookup to obtainthe next hop IP address then uses ARP. • Meta-router supplies the next hop IPaddress then use ARP. • Meta-router suppliesthe next hop Ethernetaddress. Substrate Router Meta-Router Line Card IPW Eth1 Ethernet Switch Eth2 Eth3 IPY
Ethernet switch (LAN or Router) Ethernet switch (LAN or Router) A meta-router may use multiple physical ports Meta-Router (NPE) Ethernet Switch (in chasis) Line Card RTM … …
Basic Slice Creation: No changes • Slice information is entered into PLC database. • Current: Node manager pools PLC for slice data. • Planned: PLC contacts Node manager proactively. • Node manager (pl_conf) periodically retrieves slice table. • updates slice information • creates/deletes slices • Node manager (nm) instantiates new virtual machine (vserver) for slice. • User logs into vserver using ssh • uses existing plab mechansism on GPE. NPE GPE root ctx per Slice contexts NM new Slice (Y) … X1 RM slice X Preallocated Ports (UDP) … … sys-sw vnet Eth1 Eth2 Ethernet Switch Eth3 Line card (NPE) Lookup table (TCAM) filter result TUNX VLANX Eth2 … default VLAN0 Eth1 Default configuration: forward traffic to the (single) GPE, in this case the user’s ssh login session.
1 4 3 2 Requesting NP • User requests shared-NP • Specify code option • Request UDP port number for overlay tunnel • Request local UDP port for exception traffic • Substrate Resource Manager • Configure SW: Assign local VLAN to new meta router. Enable VLAN on switch ports. • Configure NPE: allocates NP with requested code option (decision considers both current load and available options) • Configure LC(s): Allocate an externally visible UDP port number (from the preallocated pool of UDP ports for the external IP address). Add filter(s) • Ingress packet’s destination port –to- local (chassis) VLAN and MAC destination address • Egress IP destination address (??) –to- MAC destination address and RTM physical output port • Configure GPE: Open local UDP port for exception and local delivery traffic from NPE. Transfer local port (socket) and results to client slice GPE NPE root ctx per Slice contexts NM … X RM slice X Slice Y Preallocated Ports (UDP) … … sys-sw vnet Y Eth1 Eth2 Ethernet Switch VLANY Exception and local delivery traffic. Only need to install filter in TCAM. Eth3 Line card (NPE) Lookup table (TCAM) filter result TUNX VLANX Eth2 TUNY VLANY Eth2 … default VLAN0 Eth1 Meta-network traffic uses UDP tunnels. Only need to install filter in TCAM.
Software maintained Tables/Maps • Mappings/Associations needed for creating filters Meta-Port to Physical Interface Table Line Card Next Hop Table
Configure Ethernet Switch: Step 1 • Allocate next unused VLAN id for meta-net. • In this scenario can a meta-net have multiple meta-routers instantiated on a node? • If so then do we allocate switch bandwidth and a VLAN id for the meta-net or for each meta-router? • Configure Ethernet switch • enable VLAN id on applicable ports • need to know line card to meta-port (i.e. IP tunnel) mappings • if using external GigE switch then use SNMP (python module pysnmp) • if using Radisys blade then use SNMP??? • set default QoS parameters, which are??? • other ??
Configure NPE: Step 2 • vlan table: • code option and instance number • memory for code options • instance: base address, size and index/instance • each instance is given an instance number to use for indexing into a common code option block of memory • each code option is assigned a block of memory • code option: base address and size. Also Max number of instances that can be supported. • Select NPE to host client MR • Select eligible NPEs (those that have the requested code option) • Select best NPE based on current load and do what??? • Configure NPE • Add entry to SRAM table mapping VLAN:PORT to MR instance • What does this table look like? • Where is it? • Allocate memory block in SRAM for MR. • Where in SRAM are the eligible blocks located? • How do I reference the block? • 1) allocate memory for code option at load time 2) allocate memory dynamically • Allocate 3 counter blocks for MR • where are the blocks? • How are they referenced (i.e. named)? Using VM/PM address on NP? • Configure MR instance attributes • What attributes are needed by the different code options? • Tunnel header fields; Exception/local delivery IP header fields, QID, physical Port#; Ether ssrc of NPE??? • Set default QM QIDs, weights and number of queues? • ??
Configure LC(s): Step 3 • User may request specific UDP port number • Open UDP socket (on GPE) • open socket and bind to external IP address and UDP port number. This prevents other slices or the system from using selected port • Configure line card to forward tunnel(s) to correct NPE and MR instance • Add ingress and egress entries to TCAM • how do I know IP–to-Ethernet destination address mapping for egress filter? • For both ingress and egress allocate QID and configure QM with rate and threshold parameters for MR. • Do I need to allocate a Queue (whatever this means)? • Need to keep track of qid’s (assign qid when create instance etc) • For egress I need to know the output physical port number. I may also need to know this for ingress (if we are using external sw).
Configuring GPE: Step 4 • Assign local UDP port to client for receiving exception and local delivery traffic. • user may request specific port number. • use either a preallocated socket or open a new one. • use UNIX domain socket to pass socket back to client along with other results. • all traffic will use this UDP tunnel, this means the client must perform IP protocol processing of encapsulated packet in user space. • for exception traffic this makes sense. • for local delivery traffic the client can use a tun/tap interface to send packet back into Linux kernel so it can perform more complicated processing (such as TCP connection management). Need to experiment with this. • should we assign a unique local IP address for each slice? • Result of shared-NPE allocation and socket sent back to client.
Run-Time Support for Clients • Managing entries in NPE TCAM (lookup) • add/remove entry • list entries • NPE Statistics: • Allocate 2 blocks of counters: pre-queue and post-queue. • clear block counter pair (Byte/Pkt) ??? • get block counter pair (Byte/pkt) • specify block and index • get once, get periodic • get counter group (Byte/pkt) • specify counter group as set of tuples: {(index, block), …} • SRAM read/write • read/write MR instance specific SRAM memory block • relative address and byte count, writes include new value as byte array. • Line card: Meta-interface packet counters, byte counters, rates and queue thresholds • get/set meta-interface rate/threshold • Other • Register next hop nodes as the tuple (IPdst, ETHdst), where IPdst is the destination address in the IP packet. The ETHdst is the corresponding Ethernet address. • Can we assume the destination ethernet address is always the same? • Issue: how do we map this to LC and physical interface? We need this information to configure output TCAM entries on line cards.
Boot-time Support • Initialize GPE • Initialize NPE • Initialize LC • things to init • spi switch • memory • microengine code download • tables?? • default Line card tables • default code paths • TCAM
IP Meta Router: Control • All meta-net traffic arrives via a UDP tunnel using a local IP address. • raw IP packets must be handled in user space. • complete exception traffic processing in user space. • local delivery traffic: can we inject in Linux kernel so it performs transport layer protocol processing? This would also allow application to use the standard socket interface. • should we use two different IP tunnels, one for exception traffic and one for local delivery? • Configuration responsibilities? • Stats monitoring for demo? • get counter values • support for traceroute and ping • ONL -like monitoring tool • Adding/removing routes: • static routing tables or do we run a routing protocol?
IP-Meta Router • Internal packet format has changed. • see Jing’s slides • Redirect: not in this version of the meta-router
XScale Control Software • Substrate Interface • Raw interface for reading/writing arbitrary memory locations. • substrate stats? • add new meta-router • Meta-router/Slice interface • all requests go through a local entity (managed) • not needed: authenticate client • validate request (verify memory location and operation) • Node Initialization • ??
Command/Configuration Tool • Simple command interpreter with syntax similar to lisp • Basic syntax:expr := cmd [arg]*arg := [‘(‘ expr ’)’ | array | scalar | string] • Commands are either arithmetic expressions or some system defined operation (mem, vmem, set, etc.) • Command arguments are typed scalar and array values: integers, double and string • Allow you to read/write any location in physical memory interactively or via a script.
Example Operations cmd> $a = (dw4 0x01010101 \ 0x02020202 \ 0x03030303) cmd> $b = $a + (dw4 0x01010101 0x02 4) <b,{0x2020202,0x2020204,0x4040408}> cmd> $c = 3 + $b[2] * 2 - 4 <c, 134744079> cmd> (dw4 $c) <TEMP16, 0x808080f (RO)> cmd> $t = "text one" +\ " two" <t, "text one two"> cmd> set Symbol Table: <a,{0x1010101,0x2020202,0x3030303}> <b,{0x2020202,0x2020204,0x3030307}> <c,101058061> <t,"text one two"> cmd> help Usage: <type> : type is one of {int, dw8, dw4, dw2, dw1, dbl} … load "file_name" mem : commands to manage internal memory maps mem read maps mem show maps mem read paddr [type] [count] mem write paddr value vmem : read/write to kernel virutal memory vmem read vaddr [type] [count] vmem write vaddr value
Reading Memory Maps cmd> mem read maps Adding symbols: <DRAM0_PADDR, 0> <DRAM0_VADDR, 0xa7480000> <DRAM0_SIZE, 0x20000000> <DRAM0_CSR_PADDR, 0xd0009000> <DRAM0_CSR_VADDR, 0xa73d0000> <DRAM0_CSR_SIZE, 0x1000> <DRAM1_CSR_PADDR, 0xd000a000> <DRAM1_CSR_VADDR, 0xa73f0000> <DRAM1_CSR_SIZE, 0x1000> <DRAM2_CSR_PADDR, 0xd000b000> <DRAM2_CSR_VADDR, 0xa7410000> <DRAM2_CSR_SIZE, 0x1000> <SRAM0_PADDR, 0x80000000> <SRAM0_VADDR, 0> <SRAM0_SIZE, 0> <SRAM1_PADDR, 0x90000000> <SRAM1_VADDR, 0xc7490000> <SRAM1_SIZE, 0x800000> <SRAM2_PADDR, 0xa0000000> <SRAM2_VADDR, 0xc7ca0000> <SRAM2_SIZE, 0x800000> <SRAM3_PADDR, 0xb0000000> <SRAM3_VADDR, 0xc84b0000> <SRAM3_SIZE, 0x800000> <SRAM0_CSR_PADDR, 0xcc010000> <SRAM0_CSR_VADDR, 0xa7440000> <SRAM0_CSR_SIZE, 0x1000> <SRAM1_CSR_PADDR, 0xcc410000> <SRAM1_CSR_VADDR, 0xa7450000> <SRAM1_CSR_SIZE, 0x1000> <SRAM2_CSR_PADDR, 0xcc810000> <SRAM2_CSR_VADDR, 0xa7460000> <SRAM2_CSR_SIZE, 0x1000> <SRAM3_CSR_PADDR, 0xccc10000> <SRAM3_CSR_VADDR, 0xa7470000> <SRAM3_CSR_SIZE, 0x1000> cmd> mem show maps DRAM Channel 0: kpa 0x00000000, kva 0xa7480000, Size 536870912 (cachable 0, bufferable 0) DRAM CSR Ch 0: kpa 0xd0009000, kva 0xa73d0000, Size 65536 (cachable 0, bufferable 0) DRAM CSR Ch 1: kpa 0xd000a000, kva 0xa73f0000, Size 65536 (cachable 0, bufferable 0) DRAM CSR Ch 2: kpa 0xd000b000, kva 0xa7410000, Size 65536 (cachable 0, bufferable 0) SRAM Channel 0: kpa 0x80000000, kva 0x00000000, Size 0 (cachable 0, bufferable 1) SRAM Channel 1: kpa 0x90000000, kva 0xc7490000, Size 8388608 (cachable 0, bufferable 1) SRAM Channel 2: kpa 0xa0000000, kva 0xc7ca0000, Size 8388608 (cachable 0, bufferable 1) SRAM Channel 3: kpa 0xb0000000, kva 0xc84b0000, Size 8388608 (cachable 0, bufferable 1) … SRAM Ring1 CSR: kpa 0xce400000, kva 0xd16a0000, Size 4096 (cachable 0, bufferable 1) SRAM Ring2 CSR: kpa 0xce800000, kva 0xd16b0000, Size 4096 (cachable 0, bufferable 1) SRAM Ring3 CSR: kpa 0xcec00000, kva 0xd16c0000, Size 4096 (cachable 0, bufferable 1) SRAM CSR Ch 0: kpa 0xcc010000, kva 0xa7440000, Size 4096 (cachable 0, bufferable 1) … cmd>
Possible configuration script set MYTABLE_START 0xXXXXXXX mem write $MYTABLE_START (dw4 0x00000000 \ $DEFAULT_ADDR \ $DEFAULT_VLAN) set ETHER_ADDR 00:e4:4d:33:00:00 $ETHER_ADDR[5] = 2 mem write $ETHER_TABLE[0] $ETHER_BASE $ETHER_BASE[5] = 3 mem write $ETHER_TABLE[0] $ETHER_BASE … mem write ($MYTABLE_START + 20) (mem read $SOMEPLACE dw4 1) …
Testing: Generating Packets sp++ <arguments> ---------Packet/data Sending Rate -------------- [(-n|--pcnt) n] : Number of pkts to send. default = 100000. [(-x|--pps) rate] : Pkt/sec, default 100 [--Kbps rate]: Kbps for IP datagrams. default 0 Kbps [--KBps KBps] : KBps for IP datagram ---------When pkts are sent, see below for description-------- [(-m|--mode) m] : m is one of {cont|burst|swait} [(-p|--period) p] : send <b> pkts every <p> msecs. default = 0 msec [(-B|--batch) b] : Number of pkts to send in a batch, default = 0 pkts [--pdelay n] : nsec inter-packet gap ---------Packet Size, Specify only one -------------------------- [--dlen b] : Size of payload in bytes, default = 4 --------Flags affecting pkt size or content ----------------- [--dtype) type] : Packet data type (zero, seq, UDP) [--ftype type] : Type of frame to send (raw, udp, tcp, data) [--file name] : Name of file containing the raw packet data (ftype == raw) ---------Network addressing information ------------------------- [--sa host] : Use local address "host", default INADDR_ANY [(-s|--sp) port] : Source port to use. default = 0 [--da host] : Send to remote "host“, required option [(-d|--dp) port] : Destination port number to use. default = 5050 [--pr (udp|tcp)] : Transport protocol UDP or TCP. default = udp ---------Various Endsystem/Socket Control parameters ------------ [--sbuf sz] : Set socket buffer size, net set by default ---------Parameters affecting the core processing steps---------- [(-D|--dot) (0/1)]: print a dot '.' each time we have to retransmit [--rxtout ms] : Timeout for reply pkts, units msec. default = 100 msec ---------Debug/message flags ----------------------------------- [--rt p] : Put process in the real-time scheduling class with prio 'p'. …
Example Command • Example using a constant inter-packet gap sp++ -n 10 --pps 1 --mode cont --ftype raw \ --file Rx_NPUA_Dev_0_Port_0.log --ifn eth2 -n 10 : send a total of 10 packets --pps 1 : send at a rate of 1 packets per second --mode cont : use a constant inter-packet delay calculated from pps --ftype : selects the RAW packet interface protocol family --file Rx_NPUA_Dev_0_Port_0.log : read packet contents from file --ifn eth2 : send packets out interface eth2 • Or for low packet rates use burst mode: sp++ -n 10 --pps 1 --mode burst --ftype raw \ --file Rx_NPUA_Dev_0_Port_0.log --ifn eth2 only difference is the option: --mode burst • Printing the help message sp++ --help ... • I have copied an example packet file into the bin directory (/opt/bin): Rx_NPUA_Dev_0_Port_0.log • You must be root or use sudo to run sp++ since it opens a raw device socket. • You can use tcpdump to watch the packets being sent: tcpdump -i eth2
Example File: Rx_NPUA_Dev_0_Port_0.log • 0102030405060708090a0b0c81000aaa08004500005000000000ff003a5ac0a80001c0a8000200010002003cff1b008500004500003000000000ff113a69c0a80001c0a8000200010002001cd3b4ddddddddddddddddddddddddddddddddddddddddcaa08273 • 0102030405060708090a0b0c81000aaa08004500005000000000ff003a5ac0a80001c0a8000200010002003cfedc00c400004500003000000000ff113a69c0a80001c0a8000200010002001cd3b4dddddddddddddddddddddddddddddddddddddddd306942f9 • 0102030405060708090a0b0c81000aaa08004500004c00000000ff003a5ec0a80001c0a80002000100020038ffa85500003000000000ff112a69c0a80001c0a8000200010002001cd3b4dddddddddddddddddddddddddddddddddddddddddb16526b
Testing Environment: Generating Traffic • What sort of packet generations features are useful? What do you need? • Generate packets identical to those used in simulation? • specify on command line? • Do you need to generate arbitrary Ethernet headers or can we preconfigured the host’s Ethernet interface to use VLANs? • Do you need to specify arbitrary UDP tunnel headers or can we use the standard socket mechanism to establish the tunnel? • The encapsulated IP and transport headers will be built up by the program (sp) and thus must be specified on the command line. • or is there a default encapsulated header that will do and can preconfigured at compile time? This can be overloaded at run time.
Expected Ethernet Frame Format (see 802.3ac) Destination (6 B) Destination Address cont. Ethernet Hdr Source Address (6 B) Source Address cont. EtherType (vlan 0x8100) prio CFI VID Original EtherType Version HdrLen TOS Total length Identification Flags Fragment offset IP Hdr TTL Protocol IP Header checksum Tunnel Headers IP Source Address IP Destination Address UDP Hdr sport dport length cksum Version HdrLen TOS Total length Identification Flags Fragment offset TTL Protocol IP Header checksum IP Source Address IP Destination Address Encapsulated IP Datagram transport header Payload Frame Check Sequence (FCS) Tag control information (TCI): Priority (3-bits), Canonical format indicator (CFI) (1-bit), VLAN ID (VID) (12-bit), Length/Type (16-bit). CFI should always be set to zero (CFI = 0). VID = 0 identifies priority frames (what does this mean?). VID = 4095 (0xfff) is reserved. Minimum frame size is 65B