380 likes | 567 Views
Winter 2012 Lecture 8a Packet Buffers with Latency. EE384 Packet Switch Architectures. Sundar Iyer. Reducing the size of the SRAM. Intuition If we use a lookahead buffer to peek at the requests “in advance”, we can replenish the SRAM cache only when needed.
E N D
Winter 2012 Lecture 8a Packet Buffers with Latency EE384 Packet Switch Architectures Sundar Iyer
Reducing the size of the SRAM Intuition • If we use a lookahead buffer to peek at the requests “in advance”, we can replenish the SRAM cache only when needed. • This increases the latency from when a request is made until the byte is available. • But because it is a pipeline, the issue rate is the same.
Requests in Lookahead Buffer Q(b-1) + 1 • Compute: Determine which queue will run into “trouble” soonest. green! Algorithm Queues Q • Lookahead: Next Q(b – 1) + 1arbiter requests are known. b - 1 Queues • Replenish: Fetch b bytes for the “troubled” queue. Q b - 1
Queues Queues Queues Requests in lookahead buffer Requests in lookahead buffer Requests in lookahead buffer t = 0; Green Critical t = 1 t = 2 Requests in lookahead buffer Requests in lookahead buffer Requests in lookahead buffer t = 3 t = 5 t = 4; Blue Critical Requests in lookahead buffer Requests in lookahead buffer Requests in lookahead buffer t = 6 t = 7 t = 8; Red Critical Example Q=4, b=4 EE384x
Example: 160Gb/s linecard, b=2560, Q=512: SRAM = 1.3MBytes, delay bound is 65ms (equivalent to 13 miles of fiber). Claim Patient Arbiter: With an SRAM of size Q(b – 1), a requested byte is available no later than Q(b – 1) + 1 after it is requested.
Controlling the latency • What if application can only tolerate a latency l < Q(b– 1) + 1 timeslots? • Algorithm: Serve a queue once every b timeslots in the following order: • If there is an earliest critical queue, replenish it. • If not, then replenish the queue that will have the largest deficit l timeslots in the future.
Queue length vs. Pipeline depthQ=1000, b = 10 Queue Length for Zero Latency SRAM Size Queue Length for Maximum Latency Pipeline Latency, x
Some Real Implementations • Cache size can be further optimized based on • Using embedded DRAM • Dividing queues across ingress, egress • Grouping queues into aggregates (10 X 1G is a 10 G stream) • Knowledge of the scheduler pattern • Granularity of the read scheduler • Most Implementations • Q < 1024 (original buffer cache),Q < 16K (modified schemes) • 15+ unique designs • 40G to 480G streams • Cache Sizes < 32 Mb (~ 8mm2 in 28nm implementation)
Winter 2012 Lecture 8b Statistics Counters EE384 Packet Switch Architectures Sundar Iyer
What is a Statistics Counter? • When a packet arrives, it is first classified to determine the type of the packet. • Depending on the type(s), one or more counters corresponding to the criteria are updated/incremented
Examples of Statistics Counters • Packet switches maintain counters for • Route prefixes, ACLs, TCP connections • SNMP MIBs • Counters are required for • Traffic Engineering • Policing & Shaping • Intrusion detection • Performance Monitoring (RMON) • Network Tracing
Motivation: Build high speed counters Determine and analyze techniques for building very high speed (>100Gb/s) statistics counters and support an extremely large number of counters. OC192c = 10Gb/s; Counters/pkt (C) = 10; Reff= 100Gb/s; Total Counters (N) = 1 million; Counter Size (M) = 64 bits; Counter Memory = N*M = 64Mb; MOPS =150M reads, 150M writes Stanford University
Question • Can we use the same packet-buffer hierarchy to build statistics counters? • Ans: Yes • Modify MDQF Algorithm • Numbers of Counters = N • Consider counters in base 1 • Mimics cells of size 1 bit per second • Create Longest Counter First Algorithm • Maximum size of counter (base 1): ~ b(2 + ln N) • Maximum size of counter (base 2): ~ log2 b (2 + ln N)
Optimality of LCF-CMACan we do better? Theorem: LCF-CMA, is optimal in the sense that it minimizes the size of the counter maintained in SRAM Stanford University
f(b, N) = log2 () ln bN ________ ln [b/(b-1)] = (log log N) Minimum Size of the SRAM Counter Theorem: (speed-down factor b>1)LCF-CMA, minimizes the size of counter (in bits) in the SRAM to: Stanford University
Implementation Numbers Example: • OC192 Line Card :R = 10Gb/sec. • No. of counters per packet :C = 10 • Reff= R*C :100Gb/s • Cell Size :64bytes • Teff = 5.12/2 :2.56ns • DRAM Random Access Time :51.2ns • Speed-down Factor, b >= T/Teff:20 • Total Number of Counters :1000, 1 million • Size of Counters :64 bits, 8 bits
Conclusion • The SRAM size required by LCF-CMA is a very slow growing function of N • There is a minimal tradeoff in SRAM memory size • The LCF-CMA technique allows a designer to arbitrarily scale the speed of a statistics counter architecture using existing memory technology
Winter 2012 Lecture 8c QoS and PIFO EE384 Packet Switch Architectures Sundar Iyer
Contents • Parallel routers: work-conserving (Recap) • Parallel routers: delay guarantees
Memory Memory Memory Memory Memory Memory Memory Memory A5 1 A4 A3 A4 A1 A1 A5 B1 C3 C3 C1 B3 B1 K=8 C1 The Parallel Shared Memory Router At most one operation – a write or a read per time slot A R B R C Arriving Packets Departing Packets
Problem • Problem : • How can we design a parallel output-queued work-conserving router from slower parallel memories? • Theorem 1: (sufficiency) • A parallel output-queued router is work-conserving with 3N –1 memories that can perform at most one memory operation per time slot
Re-stating the Problem • There are K cages which can contain an infinite number of pigeons. • Assume that time is slotted, and in any one time slot • at most N pigeons can arrive and at most N can depart. • at most 1 pigeon can enter or leave a cage via a pigeon hole. • the time slot at which arriving pigeons will depart is known • For any router: • What is the minimum K, such that all N pigeons can be immediately placed in a cage when they arrive, and can depart at the right time?
DT=t DT=t+X • Only one packet can enter or leave a memory at time t • Only one packet can enter or leave a memory at any time DT=t+X Intuition for Theorem 1 Time = t • Only one packet can enter a memory at time t Memory
Proof of Theorem 1 • When a packet arrives in a time slot it must choose a memory not chosen by • The N – 1 other packets that arrive at that timeslot. • The N other packets that depart at that timeslot. • The N - 1 other packets that can depart at the same time as this packet departs (in future). • Proof: • By the pigeon-hole principle, 3N –1 memories that can perform at most one memory operation per time slot are sufficient for the router to be work-conserving
Memory Memory Memory Memory Memory Memory Memory Memory A5 1 A4 A3 A4 A1 A1 A5 B1 C3 C3 C1 B3 B1 K=8 C1 The Parallel Shared Memory Router At most one operation – a write or a read per time slot A R B R C Arriving Packets Departing Packets From theorem 1, k = 7 memories don’t suffice .. but 8 memories do
Summary - Routers which give 100% throughput Mem. BW per Mem1 Switch Algorithm Total Mem. BW Switch BW Fabric # Mem. Output-Queued Bus N (N+1)R N(N+1)R NR None C. Shared Mem. Bus 1 2NR 2NR 2NR None MWM Input Queued Crossbar N 2R 2NR NR 2N 3R 6NR 2NR Maximal CIOQ (Cisco) Crossbar Time Reserve* 2N 3R 6NR 3NR P Shared Mem. Bus k 3NR/k 3NR 2NR C. Sets DSM (Juniper) N 3R 3NR 4NR Edge Color Xbar N 3R 3NR 6NR C. Sets N 4R 4NR 4NR C. Sets PPS - OQ Clos Nk 2R(N+1)/k 2N(N+1)R 4NR C. Sets Nk 4NR/k 4NR 4NR C. Sets PPS –Shared Memory Clos None Nk 2NR/k 2NR 2NR 1 Note that lower mem. bandwidth per memory implies higher random access time, which is better
Contents • Parallel routers: work-conserving • Parallel routers: delay guarantees
one output, single PIFO queue push constrained traffic Push In First Out (PIFO) Delay Guarantees one output, many logical FIFO queues Weighted fair queueing sorts packets 1 constrained traffic m PIFO models • Weighted Fair Queueing • Weighted Round Robin • Strict priority etc.
Delay Guarantees • Problem : • How can we design a parallel output-queued router from slower parallel memories and give delay guarantees? • This is difficult because • The counting technique depends on being able to predict the departure time and schedule it. • The departure time of a cell is not fixed in policies such as strict priority, weighted fair queueing etc.
Theorem 2 • Theorem 2: (sufficiency) • A parallel output-queued router can give delay guarantees with 4N –2 memories that can perform at most one memory operation per time slot.
DT = 3 DT= 2 DT= 1 9 8 7 6 5 4 3 2 1 FIFO: Window of memories of size N-1 that can’t be used 1.5 2.5 Departure Order 8 9 8 7 7 6 6 5 5 4 3 4 2.5 3 2 2 1 1 … N-1 packets before cell at time of insertion 8 7 6 7 6 5 4 5 4 3 2.5 3 2.5 2 2 1.5 1 1 Departure Order … N-1 packets after cell at time of insertion Intuition for Theorem 2N=3 port router Departure Order … PIFO: 2 windows of memories of size N-1 that can’t be used
DT=t Before C After C • Will be used to read the N-1 cells that depart before it. • Will be used to read the N-1 cells that depart after it. Proof of Theorem 2 Time = t A packet cannot use the memories: • Used to write the N-1 arriving cells at t. • Used to read the Ndeparting cells at t. DT=t+T Cell C
Input Queued - Crossbar N 2R 2NR NR Nk 2NR/k 2NR 2NR - Summary- Routers which give delay guarantees Switch Algorithm Total MemoryBW Switch BW Fabric # Mem. Mem. BW per Mem. Output-Queued Bus N (N+1)R N(N+1)R NR None C. Shared Mem. Bus 1 2NR 2NR 2NR None 2N 3R 6NR 2NR Marriage CIOQ (Cisco) Crossbar Time Reserve 2N 3R 6NR 3NR P. Shared M Bus k 4NR/k 4NR 2NR C. Sets DSM (Juniper) N 4R 4NR 5NR Edge Color Xbar N 4R 4NR 8NR C. Sets N 6R 6NR 6NR C. Sets PPS - OQ Clos Nk 3R(N+1)/k 3N(N+1)R 6NR C. Sets Nk 6NR/k 6NR 6NR C. Sets PPS –Shared Memory Clos
Distributed Shared Memory Router Switch Fabric Memories Memories Memories R R R Line Card 1 Line Card 2 Line Card N • The central memories are moved to distributed line cards and shared • Memory and line cards can be added incrementally • From theorem 1, 3N –1 memories which can perform one operation per time slot i.e. a total memory bandwidth of 3NR suffice for the router to be work-conserving
Corollary 1 • Problem: • What is the switch bandwidth for a work-conserving DSM router? • Corollary 1: (sufficiency) • A switch bandwidth of 4NR is sufficient for a distributed shared memory router to be work-conserving • Proof 1: • There are a maximum of 3 memory accesses and 1 port access
Corollary 2 • Problem: • What is the switching algorithm for a work-conserving DSM router? • Bus : No algorithm needed, but impractical • Crossbar : Algorithm needed because only permutations are allowed • Corollary 2: (existence) • An edge coloring algorithm can switch packets for a work-conserving distributed shared memory router • Proof : • Follows from König’s theorem - Any bipartite graph with maximum degree has an edge coloring with colors