170 likes | 293 Views
Efficient Implementation of a Statics Counter Architecture. Author: Sriram Ramabhadran, George Varghese Publisher: SIGMETRICS’03 Presenter: Yun-Yan Chang Date: 2010/12/29. Outline. Introduction Previous works Scheme LR(T) Aggregated bitmap Implementation Conclusion. Introduction.
E N D
Efficient Implementation of a Statics Counter Architecture Author:Sriram Ramabhadran, George Varghese Publisher: SIGMETRICS’03 Presenter: Yun-Yan Chang Date: 2010/12/29
Outline • Introduction • Previous works • Scheme • LR(T) • Aggregated bitmap • Implementation • Conclusion
Introduction • Remove bottleneck of [1] by proposing a counter management algorithm (CMA) called LR(T) (Largest Recent with threshold T) that avoids sorting by only keeping a bitmap that tracks counters that are larger than threshold T.
Previous Work • D. Shah, S. Iyer, B. Prabhakar, and N. McKeown • Maintaining statistics counters in router line cards • Propose a hybrid architecture in which DRAM is used to store the statistics counters but a small amount of SRAM is used to enable counter updates at line rate. • Propose a CMA called LCF (Largest CounterFirst)which picks the counter with the largest value to beupdated to DRAM.
Previous Work (cont.) • Architecture • SRAM stores N counters of size m<M bits. • DRAM stores N counters of size M bits. • The SRAM counters hold recent updates and are periodically transferred to the corresponding DRAM counters. Figure 1. Statistics counter architecture
Previous Work (cont.) • Largest Counter First (LCF) • An algorithm which can minimize the size of SRAM. • Selects the largest counter. • If multiple counters have the same value, picks one arbitrarily. • Updates the value of the corresponding counter in the DRAM and sets in the SRAM. • Bottleneck: • Sort: find the highest counter • Difficult to implement at high speed
LR(T) Algorithm • Algorithm description • Let j*be the counter with the largest value among the counters incremented in the last cycle of b updates to SRAM. • If the value of counter cj*≥T, then updates counter j*to DRAM. • If cj* <T, LR(T) updates any counter with value at least T to DRAM. • If no counter exists, LR(T) updates counter j*to DRAM.
LR(T) Algorithm • Proof: • Threshold T=0 allows a simple implementation, while T=b is optimal and minimizes the size of SRAM requirement. • LR(0) • Only remembers the last b updates to SRAM in determining which counter update to DRAM. • Let be maximum value of a counter can reach under LR(0) • Theorem 1: • Implies SRAM counter of size at least
LR(T) Algorithm • LR(b) • Threshold increases from 0 to b. • b:time between accesses DRAM • Let be maximum value of a counter can reach under LR(0). • Theorem2: • Implies any counter is at most (b − 1)(N − 1) • Value of counter cannot be larger than (b-1)+logd(N-1) , where
Aggregated bitmap • To minimize the required storage • Consider a fixed universe U of N elements labelled 1, 2,…,N. • Use a bitmap b1b2 ... bN to record which elements are contained in set S or not. • biis set to 1 if element i ∈ S, otherwise set to 0. • Implement functions: • add(i) Adds element i to set S • delete(i) Deletes element i from set S • test(i) Tests whether element i belongs to set S • find() Returns any element i that belongs to set S
Aggregated bitmap Figure 2: Aggregated bitmap for N = 128 elements and W = 16 word size.
Aggregated bitmap • Each group of W bits in the bitmap is aggregated to form a single node. • N : bits of aggregated bitmap • W: the word size (N and W must be power of 2) Total: nodes Total memory: Figure 2: Aggregated bitmap for N=128 elements and W=16 word size. W
Aggregated bitmap • Each internal node in the tree contains two fields called lcount and rcount. • lcount is the number of 1s present in its left child • rcount is the number of 1s present in its right child lcount rcount Figure 2: Aggregated bitmap for N=128 elements and W=16 word size.
Aggregated bitmap • Pipelined implementation • Each operation proceeds top-down, start at root, from one level to another. • At each level of the tree, there is potentially a memory read followed by a memory write. • Storing each of the levels of the tree in a different memory bank permits simultaneous access to all levels of the tree.
Implementation • To implement LR(T), it’s necessary to keep track of two things: • The largest value among all counters updated in the last cycle of b updates along with the corresponding counter j∗. • All counters above the threshold T. • Memoryaccesses for counter operations and bitmap operationsproceed in parallel.
Implementation • Every cycle of b updates involves b SRAM and a DRAM update operation • SRAM update operation • Two accesses to update SRAM counter • Two accesses for add • DRAM update operation • Two accesses to read and reset SRAM counter • Four accesses for delete and find. • Two DRAM accesses to update DRAM counter Figure 3: Timing diagram for SRAM and DRAM updates for two successive cycles of b counter updates.
Conclusion • For a reference system of a million 64-bit counters and a line rate of 10 Gbps with 10 counter updates per packet Table 1: Cost - benefit comparison for different schemes.