Sundar Iyer

Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures Sundar Iyer

Reducing the size of the SRAM Intuition • If we use a lookahead buffer to peek at the requests “in advance”, we can replenish the SRAM cache only when needed. • This increases the latency from when a request is made until the byte is available. • But because it is a pipeline, the issue rate is the same.

Requests in Lookahead Buffer Q(b-1) + 1 • Compute: Determine which queue will run into “trouble” soonest. green! Algorithm Queues Q • Lookahead: Next Q(b – 1) + 1arbiter requests are known. b - 1 Queues • Replenish: Fetch b bytes for the “troubled” queue. Q b - 1

Queues Queues Queues Requests in lookahead buffer Requests in lookahead buffer Requests in lookahead buffer t = 0; Green Critical t = 1 t = 2 Requests in lookahead buffer Requests in lookahead buffer Requests in lookahead buffer t = 3 t = 5 t = 4; Blue Critical Requests in lookahead buffer Requests in lookahead buffer Requests in lookahead buffer t = 6 t = 7 t = 8; Red Critical Example Q=4, b=4 EE384x

Example: 160Gb/s linecard, b=2560, Q=512: SRAM = 1.3MBytes, delay bound is 65ms (equivalent to 13 miles of fiber). Claim Patient Arbiter: With an SRAM of size Q(b – 1), a requested byte is available no later than Q(b – 1) + 1 after it is requested.

Controlling the latency • What if application can only tolerate a latency l < Q(b– 1) + 1 timeslots? • Algorithm: Serve a queue once every b timeslots in the following order: • If there is an earliest critical queue, replenish it. • If not, then replenish the queue that will have the largest deficit l timeslots in the future.

Queue length vs. Pipeline depthQ=1000, b = 10 Queue Length for Zero Latency SRAM Size Queue Length for Maximum Latency Pipeline Latency, x

Some Real Implementations • Cache size can be further optimized based on • Using embedded DRAM • Dividing queues across ingress, egress • Grouping queues into aggregates (10 X 1G is a 10 G stream) • Knowledge of the scheduler pattern • Granularity of the read scheduler • Most Implementations • Q < 1024 (original buffer cache),Q < 16K (modified schemes) • 15+ unique designs • 40G to 480G streams • Cache Sizes < 32 Mb (~ 8mm2 in 28nm implementation)

Winter 2012 Lecture 8b Statistics Counters EE384 Packet Switch Architectures Sundar Iyer

What is a Statistics Counter? • When a packet arrives, it is first classified to determine the type of the packet. • Depending on the type(s), one or more counters corresponding to the criteria are updated/incremented

Examples of Statistics Counters • Packet switches maintain counters for • Route prefixes, ACLs, TCP connections • SNMP MIBs • Counters are required for • Traffic Engineering • Policing & Shaping • Intrusion detection • Performance Monitoring (RMON) • Network Tracing

Motivation: Build high speed counters Determine and analyze techniques for building very high speed (>100Gb/s) statistics counters and support an extremely large number of counters. OC192c = 10Gb/s; Counters/pkt (C) = 10; Reff= 100Gb/s; Total Counters (N) = 1 million; Counter Size (M) = 64 bits; Counter Memory = N*M = 64Mb; MOPS =150M reads, 150M writes Stanford University

Question • Can we use the same packet-buffer hierarchy to build statistics counters? • Ans: Yes • Modify MDQF Algorithm • Numbers of Counters = N • Consider counters in base 1 • Mimics cells of size 1 bit per second • Create Longest Counter First Algorithm • Maximum size of counter (base 1): ~ b(2 + ln N) • Maximum size of counter (base 2): ~ log2 b (2 + ln N)

Optimality of LCF-CMACan we do better? Theorem: LCF-CMA, is optimal in the sense that it minimizes the size of the counter maintained in SRAM Stanford University

f(b, N) = log2 () ln bN ________ ln [b/(b-1)]  = (log log N) Minimum Size of the SRAM Counter Theorem: (speed-down factor b>1)LCF-CMA, minimizes the size of counter (in bits) in the SRAM to: Stanford University

Implementation Numbers Example: • OC192 Line Card :R = 10Gb/sec. • No. of counters per packet :C = 10 • Reff= R*C :100Gb/s • Cell Size :64bytes • Teff = 5.12/2 :2.56ns • DRAM Random Access Time :51.2ns • Speed-down Factor, b >= T/Teff:20 • Total Number of Counters :1000, 1 million • Size of Counters :64 bits, 8 bits

Implementation Examples 10 Gb/s Line Card

Conclusion • The SRAM size required by LCF-CMA is a very slow growing function of N • There is a minimal tradeoff in SRAM memory size • The LCF-CMA technique allows a designer to arbitrarily scale the speed of a statistics counter architecture using existing memory technology

Winter 2012 Lecture 8c QoS and PIFO EE384 Packet Switch Architectures Sundar Iyer

Contents • Parallel routers: work-conserving (Recap) • Parallel routers: delay guarantees

Memory Memory Memory Memory Memory Memory Memory Memory A5 1 A4 A3 A4 A1 A1 A5 B1 C3 C3 C1 B3 B1 K=8 C1 The Parallel Shared Memory Router At most one operation – a write or a read per time slot A R B R C Arriving Packets Departing Packets

Problem • Problem : • How can we design a parallel output-queued work-conserving router from slower parallel memories? • Theorem 1: (sufficiency) • A parallel output-queued router is work-conserving with 3N –1 memories that can perform at most one memory operation per time slot

Re-stating the Problem • There are K cages which can contain an infinite number of pigeons. • Assume that time is slotted, and in any one time slot • at most N pigeons can arrive and at most N can depart. • at most 1 pigeon can enter or leave a cage via a pigeon hole. • the time slot at which arriving pigeons will depart is known • For any router: • What is the minimum K, such that all N pigeons can be immediately placed in a cage when they arrive, and can depart at the right time?

DT=t DT=t+X • Only one packet can enter or leave a memory at time t • Only one packet can enter or leave a memory at any time DT=t+X Intuition for Theorem 1 Time = t • Only one packet can enter a memory at time t Memory

Proof of Theorem 1 • When a packet arrives in a time slot it must choose a memory not chosen by • The N – 1 other packets that arrive at that timeslot. • The N other packets that depart at that timeslot. • The N - 1 other packets that can depart at the same time as this packet departs (in future). • Proof: • By the pigeon-hole principle, 3N –1 memories that can perform at most one memory operation per time slot are sufficient for the router to be work-conserving

Memory Memory Memory Memory Memory Memory Memory Memory A5 1 A4 A3 A4 A1 A1 A5 B1 C3 C3 C1 B3 B1 K=8 C1 The Parallel Shared Memory Router At most one operation – a write or a read per time slot A R B R C Arriving Packets Departing Packets From theorem 1, k = 7 memories don’t suffice .. but 8 memories do

Summary - Routers which give 100% throughput Mem. BW per Mem1 Switch Algorithm Total Mem. BW Switch BW Fabric # Mem. Output-Queued Bus N (N+1)R N(N+1)R NR None C. Shared Mem. Bus 1 2NR 2NR 2NR None MWM Input Queued Crossbar N 2R 2NR NR 2N 3R 6NR 2NR Maximal CIOQ (Cisco) Crossbar Time Reserve* 2N 3R 6NR 3NR P Shared Mem. Bus k 3NR/k 3NR 2NR C. Sets DSM (Juniper) N 3R 3NR 4NR Edge Color Xbar N 3R 3NR 6NR C. Sets N 4R 4NR 4NR C. Sets PPS - OQ Clos Nk 2R(N+1)/k 2N(N+1)R 4NR C. Sets Nk 4NR/k 4NR 4NR C. Sets PPS –Shared Memory Clos None Nk 2NR/k 2NR 2NR 1 Note that lower mem. bandwidth per memory implies higher random access time, which is better

Contents • Parallel routers: work-conserving • Parallel routers: delay guarantees

one output, single PIFO queue push constrained traffic Push In First Out (PIFO) Delay Guarantees one output, many logical FIFO queues Weighted fair queueing sorts packets 1 constrained traffic m PIFO models • Weighted Fair Queueing • Weighted Round Robin • Strict priority etc.

Delay Guarantees • Problem : • How can we design a parallel output-queued router from slower parallel memories and give delay guarantees? • This is difficult because • The counting technique depends on being able to predict the departure time and schedule it. • The departure time of a cell is not fixed in policies such as strict priority, weighted fair queueing etc.

Theorem 2 • Theorem 2: (sufficiency) • A parallel output-queued router can give delay guarantees with 4N –2 memories that can perform at most one memory operation per time slot.

DT = 3 DT= 2 DT= 1 9 8 7 6 5 4 3 2 1 FIFO: Window of memories of size N-1 that can’t be used 1.5 2.5 Departure Order 8 9 8 7 7 6 6 5 5 4 3 4 2.5 3 2 2 1 1 … N-1 packets before cell at time of insertion 8 7 6 7 6 5 4 5 4 3 2.5 3 2.5 2 2 1.5 1 1 Departure Order … N-1 packets after cell at time of insertion Intuition for Theorem 2N=3 port router Departure Order … PIFO: 2 windows of memories of size N-1 that can’t be used

DT=t Before C After C • Will be used to read the N-1 cells that depart before it. • Will be used to read the N-1 cells that depart after it. Proof of Theorem 2 Time = t A packet cannot use the memories: • Used to write the N-1 arriving cells at t. • Used to read the Ndeparting cells at t. DT=t+T Cell C

Input Queued - Crossbar N 2R 2NR NR Nk 2NR/k 2NR 2NR - Summary- Routers which give delay guarantees Switch Algorithm Total MemoryBW Switch BW Fabric # Mem. Mem. BW per Mem. Output-Queued Bus N (N+1)R N(N+1)R NR None C. Shared Mem. Bus 1 2NR 2NR 2NR None 2N 3R 6NR 2NR Marriage CIOQ (Cisco) Crossbar Time Reserve 2N 3R 6NR 3NR P. Shared M Bus k 4NR/k 4NR 2NR C. Sets DSM (Juniper) N 4R 4NR 5NR Edge Color Xbar N 4R 4NR 8NR C. Sets N 6R 6NR 6NR C. Sets PPS - OQ Clos Nk 3R(N+1)/k 3N(N+1)R 6NR C. Sets Nk 6NR/k 6NR 6NR C. Sets PPS –Shared Memory Clos

Backups

Distributed Shared Memory Router Switch Fabric Memories Memories Memories R R R Line Card 1 Line Card 2 Line Card N • The central memories are moved to distributed line cards and shared • Memory and line cards can be added incrementally • From theorem 1, 3N –1 memories which can perform one operation per time slot i.e. a total memory bandwidth of  3NR suffice for the router to be work-conserving

Corollary 1 • Problem: • What is the switch bandwidth for a work-conserving DSM router? • Corollary 1: (sufficiency) • A switch bandwidth of 4NR is sufficient for a distributed shared memory router to be work-conserving • Proof 1: • There are a maximum of 3 memory accesses and 1 port access

Corollary 2 • Problem: • What is the switching algorithm for a work-conserving DSM router? • Bus : No algorithm needed, but impractical • Crossbar : Algorithm needed because only permutations are allowed • Corollary 2: (existence) • An edge coloring algorithm can switch packets for a work-conserving distributed shared memory router • Proof : • Follows from König’s theorem - Any bipartite graph with maximum degree  has an edge coloring with  colors

Sundar Iyer

Sundar Iyer

Presentation Transcript

Presented by Gopalkrishnan Iyer 27 March 2009

Presented by Sundar P Subramani

Dr. Sundar Christopher ATS 782-01 sundar@nsstc.uah.edu

Sundar Iyer

Patent Liability Analysis -Krithika Iyer

Dr. Sundar Christopher sundar@nsstc.uah.edu

Dr. Sundar Christopher ATS 782-01 sundar@nsstc.uah

0610_Ver06L_Ghana Ghana Neela Vadana Ati Sundar

Pico Iyer

S. Sundar Kumar Iyer

Michael Xu, Hari Sundar, Brandon Adoni

Sundar Gopalakrishnan, Guttorm Sindre, and John Krogstie:

Presented by Gopalkrishnan Iyer

Subbalakshmi Iyer

Twacha ko Naturally Sundar banane ke Nuskhe

Sundar Cabs And Call Taxi in Tirunelveli

Braja Sundar Pradhan

Kee Pharma Braja Sundar Pradhan

Dr. Sundar Christopher sundar@nsstc.uah

Dr. Sundar Christopher sundar@nsstc.uah

Dr. Sundar Christopher sundar@nsstc.uah

The Life of Sadhu Sundar Singh