A High Performance PlanetLab Node

A High Performance PlanetLab Node Jon Turnerjon.turner@wustl.edu http://www.arl.wustl.edu/arl

Objectives • Create system that is essentially compatible with current version of PlanetLab. • secure buy-in of PlanetLab users and staff • provide base on which resource allocation features can be added • Substantially improve performance. • NP blade with simple fast path can forward at 10 Gb/s for minimum size packets • standard PlanetLab node today forwards 100 Mb/s with large packets • multiple GPEs per node allow more resources per user • Phased development process • long term goals include appearance of single PlanetLab node and dynamic code configuration on NPs • phased development provides useful intermediate steps that defer certain objectives • Limitations • does not fix PlanetLab’s usage model • each “slice” limited to a Vserver plus a slice of an NP

Development Phases • Phase 0 • node with single GP blade hosting standard PlanetLab software • Line Card and one or more NP blades to host NP-slices • NP-slice configuration server running in privileged Vserver • invoked explicitly by slices running in Vservers • NP blades with static slice code options (2), but dynamic slice allocation • Phase 1 • node with multiple GP blades, each hosting standard PlanetLab software and with own externally visible IP address • separate control processor hosting NP-slice configuration server • expanded set of slice code options • Phase 2 • multiple GPEs in unified node with single external IP address • CP retrieves slice descriptions from PLC and creates local copy for GPEs • CP manages use of external port numbers • transparent login process • dynamic NP code installation

Phase 0 Overview GPE NPE ... Switch LC • System appears like a standard Plab node. • single external IP address • alternatively, single address for whole system • Standard PlanetLab mechanisms control GPE. • Node Manager periodically retrieves slice descriptions from PlanetLab Central • configures Vservers according to slice descriptions • supports user logins to Vservers • Resource Manager (RM) runs in privileged Vserver on GPE and manages NP resources for user slices. • NP slices explicitly requested by user slices • RM assigns slices to NPEs (to balance usage) • reserves port numbers for users • configures Line Cards and NPs appropriately • Line Cards demux arriving packets using port numbers.

Using NP Slices exception packets useinternal port numbers GPE NPE ... ... VS Switch • External NPE packets use UDP/IP. • NPE slice has ≥1 external port. • LC used dport number todirect packet to proper NPE. • NPE uses dport number to directpacket to proper slice. • Parse block of NPE slice gets: • bare slice packet • input meta-interface,source IP addr and sport • Format block of NPE sliceprovides: • bare slice packet • output meta-interface, dest IP addr and dport for next-hop • NPE provides multiple queues/slice. use dportto demux,determine MI map MIto sport LC IPH IPH daddr=thisNode daddr=nextNode slicepkt slicepkt

Managing NP Usage GPE RM VS NPE ... Switch LC • Resource Manager assigns Vservers to NP-slices on request. • user specifies which of several processing code options to use • RM assigns to NP with requested code option • when choices are available, balance load • configure filters in LC based on port numbers • Managing external port numbers. • user may request specific port number from RM when requesting NP-slice • RM opens UDP connection and attempts to binds port number to it • allocated port number returned to VS • Managing port numbers for exception channel. • user Vserver opens UDP port and binds port number to it • port number supplied to RM as part of NP-slice configuration request • Managing per slice filters in NP • requests made through RM, which forwards to NP’s xScale

parse MEs ME1 ME2 ME3 Execution Environment for Parse/Format • Statically configure code for parse and format. • only trusted developers may provide new code options • must ensure that slices cannot interfere with each other • shut down NP & reload ME program store to configure new code option • User specifies option at NP-allocation time. • Demux determines code option and passes it along. • Each slice may have its own static data area in SRAM • For IPv4 code option, user-installed filters determine outgoing MI, daddr, dport of next hop or if packet should go exception channel • To maximize available code space per slice, pipeline MEs. • each ME has code for small set of code options • MEs just propagate packets for which they don’t have code • ok to allow these to be forwarded out of order • each code option should be able to handle alltraffic (5 Gb/s) in one ME • might load-balance over multiple MEs byreplicating code segments

Monitoring NPE Slice Traffic • Three counter blocks per NPE slice • pre-filter counters – parse block specifies counter pair to use • for IPv4 case, associate counters with meta-interface and type (UDP, TCP, ICMP, options, other) • pre-queue counters – format block specifies counter pair • for IPv4 case, extract from filter result • post-queue counters – format block specifies counter pair • for IPv4 case, extract from filter result • xScale interface for monitoring counters • specify groups of counters to poll and polling frequency • counters in a common group are read at the same time and returned with a single timestamp • by placing a pre-queue and post-queue counter pair in the same group, can determine number of packets/bytes queued in a specific category

Queue Management • Bandwidth resources allocated on basis of external physical interfaces. • by default, each slice gets equal share of each external physical interface • NPE has scheduler for each external physical interface it sends to • Each NP-slice has its own set of queues. • each queue is configured for a specific external interface • each slice has a fixed quantum for each external interface, which it may divide among the different queues, as it wishes • mapping of packets to queues is determined by slice code option • may be based on filter lookup result • Dynamic scheduling of physical interfaces • different NPEs (and GPEs) may send to same physical interface • bandwidth of the physical interface must be divided among senders to prevent excessive queueing in LC • use form of distributed scheduling to assign shares • share based on number of backlogged slices waiting on interface

Phase 0 Demonstration GPE NPE Switch 5 LC 4 internet localhosts • Possible NP slice applications. • basic IPv4 forwarder • enhanced IPv4 forwarder • use TOS and queue lengths to make ECN mark decisions and/or discard decisions • metanet with geodesic addressing and stochastic path tagging • Run multiple NP-slices on each NP • On GPE run pair of standard Plab apps, plus exception code for the NP-slices. • select sample Plab apps for which we can get help • What do we show? • ability to add/remove NP-slices • ability to add/remove filters to change routes • performance charts of queueing performance • compare NP-slice to GP-slice and standard PlanetLab slice

Phase 1 GPE NPE ... ... CP Switch LC • New elements • multiple GPE blades, each with external IP address • CP to manage NPE usage • expanded set of code options • NP management divided between Local Resource Manager (LRM) running on GPEs and Global Resource Manager (GRM) on CP • Vservers interact with LRM as before • LRM contacts GRM to allocate NP slices • port number management handled by LRM • LC uses destination IP addr and dport to direct packets to correct NPE or GPE • Code options • multicast-capable IPv4 MR • ???

Phase 2 Overview GPE NPE ... ... CP Switch LC • New elements. • multiple GPEs in unified node • CP manages interaction with PLC • CP coordinates use of external port numbers • transparent login service • dynamic NP code installation • Line Cards demux arriving packets using IP filters and remap port numbers as needed. • requires NAT functionality in LCs to handle outgoing TCP connections, ICMP echo, etc. • other cases handled with static port numbers

GPE NPE ... ... NM CP PLC NM myPLC Switch LC Slice Configuration • Slice descriptions created using standard PlanetLab mechanisms and stored in PLC database. • CP’s Node Manager periodically retrieves slice descriptions and makes local copy of relevant parts in myPLC database. • GPEs’ Node Managers periodically retrieve slice descriptions and update their local configuration.

Managing Resource Usage GPE LRM VS NPE ... CP GRM Switch LC • Vservers request NP-slices from LocalResource Manager (LRM). • LRM relays request to GRM whichassigns NPE slice • LRM configures NPE to handle slice • GRM configure filters in LC • Managing external port numbers. • LRM reserves pool of port numbers by opening connections and binding port numbers • user may request specific port number from LRM pool • used for NP ports and externally visible “server ports” on GPEs • Network Address Translation • to allow outgoing TCP connections to be handled transparently • use LC filter to re-direct TCP control traffic to xScale • address translation created when outgoing connection request intercepted • similar issue for outgoing ICMP echo packets – insert filter to handle later packets with same id (both ways)

Transparent Login Process • Objective – allow users to login to system and configure things, in a similar way to PlanetLab. • currently, they SSH to selected node and SSH server authenticates and forks process to run in appropriate Vserver • seamless handoff, as new process acquires TCP state • Tricky to replicate precisely. • if we SSH to CP and authenticate there, need to transfer session to appropriate GPE and Vserver • need general process migration mechanism to make this seamless • Another approach is to authenticate on CP and use user-level forwarding to give impression of direct connection. • Or, use alternate client that users invoke to access our system. • client contacts CP informing it of slice user wants to login to • CP returns an external port number that is remapped by LC to SSH port on target host • client then opens SSH connection to target host through the provided external port number

Specifying Parse and Format Code • Use restricted fragment of C. • all variables are static, user declares storage type • register, local, SRAM, DRAM • registers and local variables not retained between packets • loops with bounded iterations only • sample syntax: for (<iterator>) :<constant-expression> { loop body } • <constant-expression> can include the pseudo-constant PACKET_LENGTH which refers to the number of bytes in the packet being processed • only non-recursive functions/procedures • no pointers • no floating point • Compiler verifies that worst-case code path has bounded number of instructions and memory accesses • at most C1 + C2*PACKET_LENGTH, where C1 and C2 are constants to be determined • Limited code size (maybe 500-1000 instructions per slice) • Implement as front-end that produces standard C and sends to Intel compiler for code gen. – back-end to verify code path lengths

parse MEs ME1 ME2 ME3 bypass MEused whenreconfiguring MEB Dynamic Configuration of Slices • To add new slice • configure bypass ME with code for old slices and new slice • swap in using scratch rings for input and output • reconfigure original ME with code image and swap back • requires MEs retain no state in local memory between packets • drain packets from “old” ME before accepting packets from new one • Similar process required if MEs used in parallel • configure spare ME with new code image and add to pool • iteratively swap out others and swap back in • for n MEs need n swap operations

A High Performance PlanetLab Node

A High Performance PlanetLab Node

Presentation Transcript

PlanetLab Architecture

A Culture of High Performance

Driving a High Performance Culture

Leading a High Performance Team

PlanetLab

A Culture of High Performance

PlanetLab : Tickets

Building a High Performance Team

Supercharging PlanetLab

Developing a High Performance Anatomy

A Scalable High-Performance Active Network Node

A High Performance Workforce

Building a High Performance Workplace

Monitoring PlanetLab

Creating a high performance culture

Building a High Performance Team

PlanetLab Architecture

Creating a High Performance Culture

Building a High Performance Team

HIGH PERFORMANCE

Building a High Performance Team