180 likes | 309 Views
Substrate Control: Overview. Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory Washington University in St. Louis. Overview. Last time control software architecture component responsibilities
E N D
Substrate Control: Overview Fred Kuhns fredk@arl.wustl.edu Applied Research Laboratory Washington University in St. Louis
Overview • Last time • control software architecture • component responsibilities • basic abstractions: meta-interface and tunnels; TCAM, slice-oriented view of lookup filters and example IPv4 LPM entries • This time • SW architecture update • assignments and current efforts • assigning bandwidth and queue weights • allocating code options and NPE resources
RMP NMP SCD SCD BCC System Node Manager (SNM) System Resource Manager (SRM) System Block Diagram PLC ReBoot how?? Substrate Control Daemon (SCD) Boot and Configuration Control (BCC) External Interfaces RTM RTM SPP Node 10 x 1GbE NPE NPE GPE GPE NPE LC ARP Table FIB NAT & Tunnel filters (in/out) bootcd Power Control Unit (has own IP Address) cacert.pem boot_server plnote.txt PCI PCI pl_netflow user slivers flow stats (netflow) xscale xscale xscale xscale … … … … sppnode.txt NPU-B NPU-B NPU-A NPU-A TCAM TCAM GE GE vnet SPI SPI interfaces Hub Fabric Ethernet Switch (10Gbps, data path) move pl_netflow to cp? Base Ethernet Switch (1Gbps, control) manage LC Tables I2C (IPMI) Control Processor (CP) Standalone GPEs tftp, dhcpd routed* sshd* httpd* Resource DB nodeconf.xml Shelf manager route DB Slivers DB boot files user info flow stats All flow monitoring done at Line Card
… app … … SCD SCD mi-mux The SPP Node • Slice instantiation: • Allocate VM instance on a GPE • may request code option instance, NPEresources and interface bandwidth • Share a common set of (global) IPaddresses • UDP/TCP port space shared across GPE/NPEs • Line card TCAM Filters direct traffic • send unregistered traffic originating outsidethe node to CP. • unregistered traffic originating within node usesNAT (on line card) • application may register server ports. Causes filter to be inserted in the line card directing traffic to specific GPE • application must register ports (or tunnels) associated with fast path instances • It is assumed that fast path instances will use tunnels (overlays) to send traffic between routing nodes. • Currently we only support UDP tunnels but will extend to include GRE and possibly others. NPE SRAM code option vmx GPE NMP NPE FPx GPE TCAM RMP planetlab OS fabric local delivery and exceptions (UDP Tunnel) Egress Ingress lookup table SNM IP route and ARP … SCD (ARP, NAT) SRM LC CP Ingress Internet
node components not in hub (switch, external GPEs, Development Hosts) SNM Key Software Control Components Primary Hub (Logical Slot 1, Channel 1) Allocate VLANs, enable ports, stats snmpd Fabric SW vlan Table Base SW CP vlan Table Resource DB SFP XFP XFP SRM Assign slices to GPE. Boot management and PLC proxy. Filter management, BW allocations and stats Slice requests to allocate or free resources. Resource allocations and slice bindings GPE vmx NMP LC control SCD NPE RMP SP MUX TCAM SCD Slice owned resource management root context SRAM planetlab OS vnet FPk FPk FPx TCAM Exception and Local delivery traffic. Includes shim header with RxMI.
Done, fredk Todo, fredk Working, fredk Working, fredk Pending, ??? Pending, ??? Pending, ??? Starting, mike Working, fredk Starting, mart Testing, jas Done, fredk Done, fredk Todo, fredk Todo, mart Pending, ??? Pending, ??? Software Control Components • Utilities: parts of BCC to generate config and distribution files • Node configuration and management: generate config files, dhcp, tftp, ramdisk • Boot CD and distribution file management (images, RPM and tar files) for GPEs and CP. • Control processor: • Boot and Configuration Control (BCC): GPE’s retrieve boot images and scripts from BCC • System Resource Manager (SRM): Runtime central resource manager • System Node Manager (SNM): Interface with PlanetLab Central (PLC) • http daemon providing a node specific interface to netflow data (planetflow) • netflow runtime database and data management • User authentication and ssh forwarding daemon • Routing protocol daemon (BGP/OSPF/RIP) for maintaining FIB in Line Card • General Purpose Element (GPE) • Local Boot Manager (LBM): Modified BootManager running on the GPEs • Resource Manager Proxy (RMP) • Node Manager Proxy (NMP), the required changes to existing Node Manager software. • Network Processor Element (NPE) • Substrate Control Daemon (SCD) • TCAM Library • kernel module to read/write memory locations (wumod) • Command interpreter for configuring NPU memory (wucmd) • Modified Radisys and Intel source; ramdisk; Linux kernel • Line Card • SCD: LC version of the SCD • ARP: protocol and error notifications. Lookup table entries either NH IP or enet addr • Sliver packets which can not be mapped to an Ehternet address must receive error notifications. • netflow-like stat collection and reporting to CP for web and PLC downloading • NAT lookup entries for unregistered traffic originating from GPE or CP Working, fredk
Slicex control application Slice-Centric View MI1 • Allocate and free fast path: code option instance, NPE resources and interface BW • Manage interfaces • Get interface attributes:{{ifn, type, ipaddr, linkBW, availBW}, ...} • If peering then get peer’s IP address • Allocate aggregate interface bandwidth • Allocate external port number(s) • Define meta-interfaces • Substrate adds line card filter(s) • Slice may specify minimum BW • Associate queues with meta-interfaces • Substrate has to map meta-interface numbers used in TCAM filters to the corresponding local addresses • Manage queue parameters, get queue length • threshold, bandwidth (weight) q0 DRAM block SRAM block MI1 (tunnel) ... wrr stats BW1,min qi MI1 := {myIP, Port1} qparams Fast path Slicex Slicex MI2 qj ... qlen, threshold, weight MI2 (tunnel) ... MIn := {myIP, Portn} wrr TCAM (Filters) substrate slice state BW2,min qk ... max Buffers qGPE MIm qj VLAN MIn (tunnel) max weights ... wrr BWn,min ql GPE • Manage TCAM filters • add, remove, update, get, lookup • Substrate remaps slice ids (qid, fid, mi, stats) to global identifier • One-time or periodic statistics • Periodic uses polled or callback model • Read and write SRAM • Substrate verifies address and length • Extended to also support DRAM memory
RMP Interface qlen get_queue_len(qid) retcode write_fltr(fid, key[N], mask[N], result[M]) retcode update_result(fid, result[M]) fltr get_fltr(fid), fltr_t get_fltr(key[N]) result[M] lookup_fltr(key) retcode rem_fltr(fid), recode rem_fltr(key[N]) {uint32, tstamp} read_stats(index, location, what) handle create_periodic(id,P,cnt,type,loc,what) retcode delete_periodic(handle) retcode set_callback(handle, xport) stats_t get_periodic(handle) ret_t mem_write(offset[, len], data) data mem_read(offset, len) retcode alloc_fast_path(copt, atype, attrs) retcode free_fast_path() {entry,...} get_interfaces() entry = {ifn, type, ipaddr, linkBW, availBW} entryget_ifattrs(ifn) ipaddr get_ifpeer(ifn) retcode alloc_ifbw(ifn, bw) port alloc_port(ipaddr, port, proto) port alloc_port(ipaddr, port, proto) mi add_endpoint(ep_type, params, BW) mi add_udp_endpoint(ipaddr, port, BW) {ep_type, params} get_endpoint(mi) retcode bind_queue(mi, list_type, qid_list) retcode set_queue_params(qid, thresh, weight) {threshold, weight} get_queue_params(qid)
Short Term Milestones • 21/09/07 • SRM-SCD: alloc_fp(); Does not include tcamLib. fred, mart. • RMP-SRM: noop(). fred, mike. • 28/09/07 (1 week delta) • SRM-SCD: rem_fp(xsid); Does not include tcamLib. fred, mart. • tcamLib: API tests; config file with search machine and multiple DBs; reasonably complex DB (say ising jdd’s configurations for SIGCOM). jonathon. • rudiments of SNMP interface to SRM. fred • 05/10/07 (1 week delta) • SCD: alloc_fp/free_fp using tcamLib; retest asynchronous copt_freed(). Includes configuration file with target search machines and default entries for both NPE and LC. mart, jonathon. • SCD: simple client driven tests of tcam operations (add, remove, update,lookup). mart, jonathon • SCD-SRM: fp_freed(xsid). Asynchronous free when slice queues are non-empty. fred, mart. • RMP-SRM-SCD: alloc_fp(...) and free_fp(). mike, mart, fred. • 12/10/07 (1 week delta) • RMP : send commands from slice to RMP using UNIX domain sockets. Map slice to its planetlab id (PlabID). fred, mike • Configure HUB using snmp from srm: initialization, hardware discovery, add/remove VLAN. fred • 19/10/07 (1 week delta) • IDT kernel module, locking. jonathon • SRM: Interface and bandwidth management. verify interface management with simple client: get_interfaces(), get_ifattrs(), get_ifpeer(),alloc_ifbw(). fred • RMP-SCD: tcam operations: write_filter(), update_result(), get_filter(), lookup_fltr(), rem_fltr(). Must add code to map MI in filter to internal representation and prepend the VLAN tag. mike, jonathon, mart.
Example Outlining Slice Interface and Abstractions Slice Interface and Queue Allocations: {Port, BW, QList}; Qlist = {{qid, weight, threshold},...} NPE wrr Physical Port (Interface) Attributes: {ifn, type, ipaddr, linkBW, availBW} ifn : Interface number type: {Internet, Peering} Operations: get_interfaces() get_ifattrs(ifn) get_ifpeer(ifn) alloc_ifbw(ifn,bw) q10 q11 FP slice1 ... qid in 0...n-1 BW11 q1n’ q20 LC q21 FP slice2 ... qid in 0...m-1 q2m’ wrr FP1 BW1 FP2 ipAddr BW11 + BW21 = BW1 linkBW GPE GPE BW21
QM throughput estimates, up to 20 schedulers • 5 schedulers per microengine. 4 uengines • 2.5 Gbps per microengine • 1 Gbps per scheduler • Add scd commands • initialize static code option/substrate memory/tables • parse block • header format • queue manager • ??? • load microengine code • Use second VLAN tag to represent meta-interface
NPE FP slicek FP slice1 qid in 0...m-1 qid in 0...n-1 Single Interface Example LC • LC Ingress • One queue per slice with reserved bandwidth (really one per scheduler) • One queue for best effort traffic to each GPE • One scheduler for CP with queues for reserved traffic plus BE • LC Egress • At least one scheduler for each physical interface • One queue for each active slice with MI defined for the associated scheduler • One best effort queue for each board (GPE, CP, NPE?) • NPE • Slice binds queues to meta-interfaces, hence physical interfaces • Slice either reserves BW on a physical interface or it is assigned the minimum • Substrate assigns a per interface maximum weight for each slice • Substrate sets scheduler rates according to aggregate allocations • Manage scheduler rates to control aggregate traffic to interfaces and boards. Ingress qxs1 wrr SchedNPE1 qxs2 ... qxsn ... qps1 dst addr protoport/icmp qps2 SchedGPE1 ... qpsn qBE wrr interface 1 ... SchedCP Egress NPE BWNPE1,GPE1 local delivery and exception VLAN wrr qs1 ... qw,GPE1 wrr qs2 ... Total weight for all slices (i) and queues (j) ≤ max weight for interface I1 (Wk) SchedI1 ... ... qsn slice’s minimum allocated BW ... q11 w11 q12 w12 q1n w1n qp1 wp1 qp2 wp2 qpm wpm ... BWI1 ... qGPE src addr proto port/icmp ... wrr SchedI1 BWNPE1,I1 qCP scheduler rate ... GPE minimum weight = 1 MTU sized packet
FP slice1 qid in 0...n-1 FP slice2 qid in 0...m-1 Two Interface Example; Setting Queue Weights Slice i, slice qid j and scheduler k. NPE wrr q10 q11 ... LC BW11 q1n wrr to interface 1 FP1 q20 BW1 FP2 IP1 q21 ... linkBW GPE q2m’ to interface 2 q10’ wrr to interface 1 q11’ FP1 BW2 ... FP2 IP2 BW12 q1n’ linkBW q20’ GPE to interface 2 q21’ ... q2m’ wrr
Allocating Code Option Instance Slice to RMP: npeIP alloc_fast_path(copt, atype, attrs) uint16_t copt: NPE code option {IPv4=0,I3=1} uint16_t atype: Reservation type {Shared = 0, Firm = 1} uint32_t attrib[] : Array of resource allocation parameters: attrib_t {uint32_t bw, pps;// bits/second, packets/second uint32_t fltrs, queues, buffers, stats;// totals uint32_t sram, dram;// memory block size in Bytes } RMP to SRM: {xsid, npeIP} alloc_fast_path(PlabID, copt, atype, attrs) uint32_t PlabID : GPE/PlanetLab slice identifier. The SRM allocates an internal slice identifier unique within the SPP node. All substrate operations use the xsid. SRM to SCD set_fast_path(xsid,copt,VLAN,TParams,Mem) uint16_t xsid; internal slice id. uint16_t VLAN; uint32_t TParams[] = {#Qs, #Fltrs, #Buffers, #Stats}, mem_t Mem[] = {SRAM:{Offset, Size},{DRAM:{Offset, Size}}
Interfaces VLAN maps System tables range:{start,end} ifn:{type,ipaddr,linkBW,availBW} ... SRAM free {...} TCAM endpoint (port) maps resvMap availMap usedMaps xsidMap GPE controlIP Slices BWmaps endpoint references servMap resvMap Per Slice Tables vlan xsid plabID Control Interface mux tbl NPE Table id:{addr,BW/Port,copts,fltrs,sram,Qs} ... lkup ... NPE (allocated) meta-ifaces sram {start,size} #flts mi:endpoint ... board ID #Qs BW #Stats Fast Path gpe board id BW SNM plab sliceID SRM Allocating NPE (Creating Meta-Router) Cache assigned xsid. Open local socket for exception and local delivery traffic; return to client vserver Allocate code option {copt, #fltrs, #Qs, #stats, #buffs, SRAM, DRAM} NPE FP - fast path PE GPE NMP Host (located within node) FPk SCD … RMP root context planetlab OS Slice PlabID requests code option copt with resources {params} Returns status and assigned global Port number Send message to SCD informing it of the new allocation {xsid, VLAN, {params}} 3 2 1 4 VLANk x x x x 10GbE (fabric, data) 5 6 1GbE (base, control) x x Substrate CP LC MI1 user login info Resource DB If sufficient resources available then assign internal slice identifier (xsid) and associate with allocation {Slice, VLAN, NPE:{copt, #fltrs, #Qs, #stats, #buffs, SRAM, DRAM} , EP {}, MI{}, GPE {IP, control Port}} PLC sliver tbl Allocate and Enable VLAN to isolate internal slice traffic, VLANk
SRM: Allocating NPE Resources • Actions required to allocate code option instance and resources: • Select NPE • Load balance across available NPEs: of eligible NPEs, select the one with the greatest “head room”. • Eligible if sufficient resources (SRAM, TCAM space, queues, etc) • Select NPE with greatest firm BW and PPS. If tie then select greatest available soft resources, else pick lowest numbered. • Either allocates requested resources or returns error • Keeps memory map of SRAM (and DRAM) so can perform allocation, though the absolute starting address is not required. • If compaction is necessary then must communicate with SCD directly. • Allocate VLAN and configure switch. • Send command to selected NPE’s SCD • set_fast_path(xsid,copt,VLAN,Params) • SCD updates tables and creates local mappings.
Freeing Code Option Instance (Fast path) • Slice to RMP and RMP to SRM: • void free_fast_path() : requires asynchronous processing by SRM and SCD. • SRM to SCD: • {Success/Pending/Failed} rem_fast_path(xsid) • SRM first sends request to SCD directing it to free all resources assigned to slice xsid. • the SCD first disables the fast path and GPE (how?) so no new packets will be processed. • The it checks all queues assigned to xsid. If all are empty then resources are freed and a Success response is sent to the SRM • Else, if packets are in any of the queues then the SCD must send a Pending response to the SRM and periodically check all queues assigned to xsid. When they are all empty the SCD sends a asynchronous Successful-deallocation message (which includes the Slice’s xsid)to the SRM notifying it that all resources associated with xsid are now free. • If the SCD returns Success the the SRM marks the resources as available and removes the slice from its internal xsid (fast path) tables). • If the SCD returns Pending then the SCD registers a call back method which is called when the SCD sends the resource freed message. • Regardless of whether the resources are freed immediately or asynchronously the SRM returns Success to the RMP.
Comments • Pace SCD message processing • Drop threshold using packets or packets and length • Limit BW over allocations • Use fast path, not slice • GPE traffic to NPE turned off when freeing fast path • How long to wait for Q’s to drain • Turn off FP using vlan table