Network Processor Algorithms: Design and Analysis

Network Processor Algorithms: Design and Analysis Stochastic Networks Conference Montreal July 22, 2004 Balaji Prabhakar Balaji Prabhakar Stanford University

Overview • Network Processors • What are they? • Why are they interesting to industry and to researchers • SIFT: a simple algorithm for identifying large flows • The algorithm and its uses • Traffic statistics counters • The basic problem and algorithms • Sharing processors and buffers • A cost/benefit analysis

IP Routers 19” 19” Capacity is sum of rates of line-cards Capacity: 160Gb/sPower: 4.2kW Capacity: 80Gb/sPower: 2.6kW 6ft 3ft 2ft 2.5ft 2.5ft Juniper M160 Cisco GSR 12416

A Detailed Sketch Output Scheduler Interconnection Fabric Switch Lookup Engine Packet Buffers Network Processor Lookup Engine Packet Buffers Network Processor Lookup Engine Packet Buffers Network Processor Line cards Outputs

Network Processors • Network processors are an increasingly important component of IP routers • They perform a number of tasks (essentially everything except Switching and Route lookup) • Buffer management • Congestion control • Output scheduling • Traffic statistics counters • Security … • They are programmable, hence add great flexibility to a router’s functionality

Network Processors • But, because they operate under severe constraints • very high line rates • heat constraints the algorithms that they can support should be lightweight • They have become very attractive to industry • They give rise to some interesting algorithmic and performance analytic questions

Rest Of The Talk SIFT: a simple algorithm for identifying large flows • The algorithm and its uses • with Arpita Ghosh and Costas Psounis • Traffic statistics counters • The basic problem and algorithms • with Sundar Iyer, Nick McKeown and Devavrat Shah • Sharing processors and buffers • A cost/benefit analysis • with Vivek Farias and Ciamac Moallemi

SIFT: Motivation Egress Buffer FIFO: SRPT: • Current egress buffers on router line cards serve packets in a FIFO manner • But, giving the packets of short flows a higher priority, e.g. using the SRPT (Shortest Remaining Processing Time) policy • reduces average flow delay • given the heavy-tailed nature of Internet flow size distribution, the reduction in delay can be huge

But … • SRPT is unimplementable • router needs to know residual flow sizes for all enqueued flows: virtually impossible to implement • Other pre-emptive schemes like SFF (shortest flow first) or LAS (least attained service) are like-wise too complicated to implement • This has led researchers to consider tagging flows at the edge, where the number of distinct flows is much smaller • but, this requires a different design of edge and core routers • more importantly, needs extra space on IP packet headers to signal flow size • Is something simpler possible?

SIFT: A randomized algorithm • Flip a coin with bias p (= 0.01, say) for heads on each arriving packet, independently from packet to packet • A flow is “sampled” if one its packets has a head on it T T T T T H H

SIFT: A Randomized Algorithm • A flow of size X has roughly 0.01Xchance of being sampled • flows with fewer than 15 packets are sampled with prob 0.15 • flows with more than 100 packets are sampled with prob 1 • the precise probability is: 1 – (1-0.01)X • Most short flows will not be sampled, most long flows will be

The Accuracy of Classification • Ideally, we would like to sample like the blue curve • Sampling with prob p gives the red curve • there are false positives and false negatives • Can we get the green curve? Prob (sampled) Flow size

SIFT+ • Sample with a coin of bias q = 0.1 • say that a flow is “sampled” if it gets two heads! • this reduces the chance of making errors • but, you have to have a count the number heads • So, how can we use SIFT at a router?

SIFT at a router B All flows B/2 Short flows sampling B/2 Long flows • Sample incoming packets • Place any packet with a head (or the second such packet) in the low priority buffer • Place all further packets from this flow in the low priority buffer (to avoid mis-sequencing)

Simulation results • Topology: Traffic Traffic Sinks Sources

Overall Average Delays

Average Delay for Short Flows

Average Delay for Long Flows

Implementation Requirements • SIFT needs • two logical queues in one physical buffer • to sample arriving packets • a table for maintaining id of sampled flows • to check whether incoming packet belongs to sampled flow or not • all quite simple to implement

A Big Bonus • The buffer of the short flows has very low occupancy • so, can we simply reduce it drastically without sacrificing performance? • More precisely, suppose • we reduce the buffer size for the small flows, increase it for the large flows, keep the total the same as FIFO

SIFT Incurs Fewer Drops Buffer_Size(Short flows) = 10; Buffer_Size(Long flows) = 290; Buffer_Size(Single FIFO Queue) = 300; SIFT ------ FIFO ------

Reducing Total Buffer Size • Suppose we reduce the buffer size of the long flows as well • Questions: • will packet drops still be fewer? • will the delays still be as good?

Drops With Less Total Buffer Buffer_Size(PRQ0) = 10; Buffer_Size(PRQ1) = 190; Buffer_Size(One Queue) = 300; OneQueue SIFT ------ FIFO ------

Delay Histogram for Short Flows SIFT ------ FIFO ------

Delay Histogram for Long Flows SIFT ------ FIFO ------

Why SIFT Reduces Buffers • The amount of buffering needed to keep links fully utilized • old formula: = 10 Gbps x 0.25 = 2.5 G • corrected to: ¼ 250 M • But, this formula is for large (elephant) flows, not for short (mice) flows • elephant arrival rate: 0.65 or 0.7 of C; hence they smaller buffers for them • mice buffers are almost empty due to high priority, mice don’t cause elephant packet drops • elephants use TCP to regulate their sending rate according to mice SIFT elephants

Conclusions for SIFT • A randomized scheme, preliminary results show that • it has a low implementation complexity • it reduces delays drastically (users are happy) • with 30-35% smaller buffers at egress line cards (router manufacturers are happy) • Leads to a 15 pkts or less lane on the Internet, could be useful • Further work needed • at the moment we have a good understanding of how to sample, and extensive (and encouraging) simulation tests • need to understand the effect of reduced buffers on end-to-end congestion control algorithms

Traffic Statistics Counters: Motivation • Switches maintain statistics, typically using counters that are incremented when packets arrive • At high line rates, memory technology is a limiting factor for the implementation of counters; for example, in a 40 Gb/s switch, each packet must be processed in 8 ns • To maintain a counter per flow at these line rates, we would like an architecture with the speed of SRAM, and the density (size) of DRAM

Hybrid Architecture • Shah, Iyer, Prabhakar, and McKeown (2001) proposed a hybrid SRAM/DRAM architecture DRAM SRAM Update counter in DRAM, empty corresponding counter in SRAM (once every b time slots) N counters Arrivals (at most one per time slot) Counter Management Algorithm … …

Counter Management Algorithm • Shah et al. place a requirement on the counter management algorithm (CMA) that it must maintain all counter values accurately • That is, given N and b, what should the size of each SRAM counter be so that no counts are missed?

Some CMAs • Round robin • maximum counter value is bN • Largest Counter First (LCF) • optimal in terms of SRAM memory usage • no counter can have a value larger than:

Analysis of LCF • This upper bound is proved by establishing a bound on the following potential (Lyapunov) function • let Qi(t) be the size of counter i at time t, then • E.g. for b = 2, • Hence, the size of the largest counter is at most

An Implementable Algorithm • LCF is difficult to implement • with one counter per flow, we would like to support at least 1 million counters • maintaining a sorted list of counters to determine the longest counter takes too much SRAM memory • Ramabhadran and Varghese (2003) proposed a simpler algorithm with the same memory usage as LCF

LCF with Threshold • The algorithm keeps track of the counters that have value at least as large as b • At any service time, let j be the counter with the largest value among those incremented since the previous service, and let c be its value • if c ¸ b, serve counter j • if c ·b, serve any counter with value at least b; if no such counter exists, serve counter j • Maintaining the counters with values at least b is a non-trivial problem; it is solved using a bitmap and an additional data structure • Is something even simpler possible?

Some Simpler Algorithms … • Possible approaches for a CMA that is simpler to implement: • arrival information (serve largest counter among those incremented) • random sampling • round-robin pointer • Trade-off between simplicity and performance: more SRAM is needed in the worst case for these schemes

An Alternative Architecture DRAM SRAM • Decision problem: given a counter with a particular value and the occupancy of the buffer, when should the counter value be moved to the FIFO buffer? What size counters does this lead to? • Interesting question with Poisson arrivals, exponential services, tractable N counters FIFO Buffer Counter Management Algorithm … …

The Cost of Sharing • We have seen that there is a very limited amount of buffering and processing capability in each line card • In order to fully utilize these resources, it will become necessary to share them amongst the packets arriving at each line card • But, sharing imposes a cost • we may need to traverse the switch fabric more often than needed • each of the two processors involved in a migration will need to do some processing; e.g. e local, 1 remote, instead of just 1 • or, the host processor may simply be worse at the processing; e.g. 1 local versus K (> 1) remote • Need to understand the tradeoff between costs and benefits • will focus on a specific queueing model • interested in simple rules • benefit measured in reduction of backlogs

The Setup Poisson (l) Poisson (l) Poisson (l) Poisson (l) K 1 1 exp(1) exp(1) exp(1) exp(1) • Does sharing reduce backlogs?

Additive Threshold Policy • Job arrives at queue 1 • Send the job to queue 2 if • Otherwise, keep the job in queue 1 • Analogous policy for jobs arriving at queue 2

Additive Thresholds - Queue Tails No Sharing

Additive Thresholds - Stability • Theorem: Additive policy is stable if and unstable if • For example, if Stable for Unstable for

Inference • The pros/cons of sharing • Reduction in backlogs • Loss of throughput

Multiplicative Threshold Policy • Job arrives at queue 1 • Send the job to queue 2 if • Otherwise, keep the job in queue 1 • Theorem: Multiplicative policy is stable for all l < 1 • Interestingly, this policy improves delays while preserving throughput!

Multiplicative Thresholds - Queue Tails No Sharing

Multiplicative Thresholds - Delay Average Delay

Conclusions • Network processors add useful features to a router’s function • There are many algorithmic questions that come up • simple, high performance algorithms are needed • For the theorist, there are many new and interesting questions; we have seen three examples briefly • SIFT: a sampling algorithm • Designing traffic statistics counters • Sharing: a cost-benefit analysis

Network Processor Algorithms: Design and Analysis