1 / 46

CoolCAMs: Power-Efficient TCAMs for Forwarding Engines

CoolCAMs: Power-Efficient TCAMs for Forwarding Engines. Paper by Francis Zane, Girija Narlikar, Anindya Basu Bell Laboratories, Lucent Technologies Presented by Edward Spitznagel. Outline. Introduction TCAMs for Address Lookup Bit Selection Architecture Trie-based Table Partitioning

wnunez
Download Presentation

CoolCAMs: Power-Efficient TCAMs for Forwarding Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CoolCAMs: Power-Efficient TCAMs for Forwarding Engines Paper by Francis Zane, Girija Narlikar, Anindya Basu Bell Laboratories, Lucent Technologies Presented by Edward Spitznagel

  2. Outline • Introduction • TCAMs for Address Lookup • Bit Selection Architecture • Trie-based Table Partitioning • Route Table Updates • Summary and Discussion

  3. Introduction • Ternary Content-Addressable Memories (TCAMs) are becoming very popular for designing high-throughput forwarding engines; they are • fast • cost-effective • simple to manage • Major drawback: high power consumption • This paper presents architectures and algorithms for making TCAM-based routing tables more power-efficient

  4. TCAMs for Address Lookup • Fully-associative memory, searchable in a single cycle • Hardware compares query word (destination address) to all stored words (routing prefixes) in parallel • each bit of a stored word can be 0, 1, or X (don’t care) • in the event that multiple matches occur, typically the entry with lowest address is returned

  5. TCAMs for Address Lookup • TCAM vendors now provide for a mechanism that can reduce power consumption by selectively addressing smaller portions of the TCAM • The TCAM is divided into a set of blocks; each block is a contiguous, fixed size chunk of TCAM entries • e.g. a 512k entry TCAM could be divided into 64 blocks of 8k entries each • When a search command is issued, it is possible to specify which block(s) to use in the search • This can help us save power, since the main component of TCAM power consumption when searching is proportional to the number of searched entries

  6. Bit Selection Architecture • Based on observation that most prefixes in core routing tables are between 16 and 24 bits long • over 98%, in the authors’ datasets • Put the very short (<16bit) and very long (>24bit) prefixes in a set of TCAM blocks to search on every lookup • The remaining prefixes are partitioned into “buckets,” one of which is selected by hashing for each lookup • each bucket is laid out over one or more TCAM blocks • In this paper, the hashing function is restricted to merely using a selected set of input bits as an index

  7. Bit Selection Architecture

  8. Bit Selection Architecture • A route lookup, then, involves the following: • hashing function (bit selection logic, really) selects k hashing bits from the destination address, which identifies a bucket to be searched • also search the blocks with the very long and very short prefixes • The main issues now are: • how to select the k hashing bits • Restrict ourselves to choosing hashing bits from the first 16 bits of the address, to avoid replicating prefixes • how to allocate the different buckets among the various TCAM blocks (since bucket size may not be an integral multiple of the TCAM block size)

  9. Bit Selection: Worst-case power consumption • Given any routing table containing N prefixes, each of length L , what is the size of the largest bucket generated by the best possible hash function that uses k bits out of the first L? • Theorem III.1: There exists some hash function splitting the setof prefixes such that the size of the largest bucket is at most • more details and proof in Appendix I • ideal hash function would generate 2k equal-sized buckets • e.g. if k=3, then each has size 0.125; if k=6, then each has size 0.015625

  10. Bit Selection Heuristics • We don’t expect to see the worst-case input, but it gives designers a power budget • Given such a power budget and a routing table, it is sufficient to find a set of hashing bits that produce a split that does not exceed the power budget (a satisfying split ) • 3 Heuristics • the first is simple: use the rightmost k bits of the first 16 bits. In almost all routing traces studied, this works well.

  11. Bit Selection Heuristics • Second Heuristic: brute force search to check all possible subsets of k bits from the first 16. • Guaranteed to find a satisfying split • Since it compares possible sets of k bits, running time is maximum for k =8

  12. Bit Selection Heuristics • Third heuristic: a greedy algorithm • Falls between the simple heuristic and the brute-force one, in terms of complexity and accuracy • To select k hashing bits, the algorithm performs k iterations, selecting one bit per iteration • number of buckets doubles each iteration • Goal in each iteration is to select a bit that minimizes the size of the biggest bucket produced in that iteration

  13. Bit Selection Heuristics • Third heuristic: greedy algorithm: pseudocode

  14. Bit Selection Heuristics • Combining the heuristics, to reduce running time (in typical cases) • First, try the simple heuristic (use k rightmost bits), and stop if that succeeds. • Otherwise, apply the third heuristic (greedy algorithm), and stop if that succeeds. • Otherwise, apply the brute-force heuristic • Apply algorithm again whenever route updates cause any bucket to become too large.

  15. Bit Selection Architecture: Experimental Results • Evaluate the heuristics with respect to two metrics: running time and quality of splits produced. • Applied to real core routing tables; results are presented for two, but others were similar • Applied to synthetic table with ~1M entries, constructed by randomly picking how many prefixes share each combination of first 16 bits

  16. Bit Selection Results: Running Time • Running time on 800MHz PC • Required less than 1MB memory

  17. Bit Selection Results: Quality of Splits • let N denote the number of 16-24 bit prefixes • let cmax denote the maximum bucket size • The ratio N / cmaxmeasures the quality(evenness) of thesplit produced bythe hashing bits • it is the factor ofreduction in theportion of the TCAMthat needs to besearched

  18. Bit Selection Architecture: Laying out of TCAM buckets • Blocks for very long prefixes and very short prefixes are placed in the TCAM at the beginning and end, respectively. • Ensures that we select the longest prefix, if more than one should match. • Laying out buckets sequentially in any order: • any bucket of size c occupies no more than TCAM blocks,where s is the TCAM block size • At most TCAM blocks need to be searched for any lookup(plus the blocks for very long and very short prefixes) • Thus, actual power savings ratio is not quite as good as theN / cmax mentioned before, but it is still good.

  19. Bit Selection Architecture: Remarks • Good average-case power reduction, but the worst-case bounds are not as good • hardware designers are thus forced to design for much higher power consumption than will be seen in practice • Assumes most prefixes are 16-24 bits long • may not always be the case (e.g. number of long (>24bit) prefixes may increase in the future)

  20. Trie-based Table Partitioning • Partitioning scheme using a Routing Trie data structure • Eliminates the two drawbacks of the Bit Selection architecture • worst-case bounds on power consumption do not match well with power consumption in practice • assumption that most prefixes are 16-24 bits long • Two trie-based schemes (subtree-split and postorder-splitting), both involving two steps: • construct a binary routing trie from the routing table • partitioning step: carve out subtrees from the trie and place into buckets • The two schemes differ in their partitioning step

  21. Trie-based Architecture • Trie-based forwarding engine architecture • use an index TCAM (instead of hashing) to determine which bucket to search • requires searching the entire index TCAM, but typically the index TCAM is very small

  22. Overview of Routing Tries • A 1-bit trie can be used for performing longest prefix matches • the trie consists of nodes, where a routing prefix of length n is stored at level n of the trie • Routing lookup process • starts at the root • scans input and descends the left (right) if the next bit of input is 0 (1), until a leaf node is reached • last prefix encountered is the longest matching prefix • count(v ) = number of routing prefixes in the subtree rooted at v • the covering prefix of a node u is the prefix of the lowest common ancestor of u that is in the routing table (including u itself)

  23. Routing Trie Example Routing Table: Corresponding 1-bit trie:

  24. Splitting into subtrees • Subtree-split algorithm: • input: b = maximum size of a TCAM bucket • output: a set of K TCAM buckets, each with size inthe range , and an index TCAM of size K • Partitioning step: post order traversal of the trie, looking for carving nodes. • Carving node: a node with count  and with a parent whose count is > b • When we find a carving node v , • carve out the subtree rooted at v, and place it in a separate bucket • place the prefix of v in the index TCAM, along with the covering prefix of v • counts of all ancestors of v are decreased by count(v )

  25. Subtree-split: Algorithm

  26. Subtree-split: Example b = 4

  27. Subtree-split: Example b = 4

  28. Subtree-split: Example b = 4

  29. Subtree-split: Example b = 4

  30. Subtree-split: Remarks • Subtree-split creates buckets whose size range from b/2 to b (except the last, which ranges from 1 to b ) • At most one covering prefix is added to each bucket • The total number of buckets created ranges from N/b to 2N/b ; each bucket results in one entry in the index TCAM • Using subtree-split in a TCAM with K buckets, during any lookup at most K + 2N /K prefixes are searched from the index and data TCAMs • Total complexity of the subtree-split algorithm isO(N +NW /b)

  31. Post-order splitting • Partitions the table into buckets of exactly b prefixes • improvement over subtree-split, where the smallest and largest bucket sizes can vary by a factor of 2 • this comes with the cost of more entries in the index TCAM • Partitioning step: post-order traversal of the trie, looking for subtrees to carve out, but, • Buckets are made from collections of subtrees, rather than just a single subtree • This is because it is possible the entire trie does not containN /b subtrees of exactly b prefixes each

  32. Post-order splitting • postorder-split : does a post-order traversal of the trie, calling carve-exact to carve out subtree collections of size b • carve-exact : does the actual carving • if it’s at a node with count = b , then it can simply carve out that subtree • if it’s at a node with count < b , whose parent has count  b , do nothing (since we will later have a chance to carve the parent) • if it’s at a node with count x, where x < b , and the node’s parent has count > b , then… • carve out the subtree of size x at this node, and • recursively call carve-exact again, this time looking for a carving of size b - x (instead of b)

  33. Post-order split: Algorithm

  34. Post-order split: Example b = 4

  35. Post-order split: Example b = 4

  36. Post-order split: Example b = 4

  37. Postorder-split: Remarks • Postorder-split creates buckets of size b (except the last, which ranges from 1 to b ) • At most W covering prefixes are added to each bucket, where W is the length of the longest prefix in the table • The total number of buckets created is exactly N/b. Each bucket results in at most W +1 entries in the index TCAM • Using postorder-split in a TCAM with K buckets, during any lookup at most (W +1)K + N /K+W prefixes are searched from the index and data TCAMs • Total complexity of the postorder-split algorithm isO(N +NW /b)

  38. Post-order split: Experimental results • Algorithmrunningtime:

  39. Post-order split: Experimental results • Reduction in routing table entries searched

  40. Route Table Updates • Briefly explore performance in the face of routing table updates • Adding routes may cause a TCAM bucket to overflow, requiring repartitioning of the prefixes and rewriting the entire table into the TCAM • Apply real-life update traces (about 3.5M updates each) to the bit-selection and trie-based schemes, to see how often recomputation is needed

  41. Route Table Updates • Bit-selection Architecture: • apply brute-force heuristic on the initial table; note size cmax of largest bucket • recompute hashing bits when any bucket grows beyondcthresh = (1 + t ) x cmax for some threshold t • when recomputing, first try the static heuristic; if needed, then try the greedy algorithm; fall back on brute-force if necessary • Trie-based Architecture: • similar threshold-based strategy • subtree-split: use bucket size of 2N /K  • post-order splitting: use bucket size of N /K 

  42. Route Table Updates • Results for bit-selection architecture:

  43. Route Table Updates • Results for trie-based architecture:

  44. Route Table Updates: “post-opt” algorithm • Post-opt: post-order split algorithm, with clever handling of updates: • with post-order split, we can transfer prefixes between neighboring buckets easily (few writes to the index and data TCAMs are needed) • so, if a bucket becomes overfull, we can usually just transfer one of its prefixes to a neighboring bucket • repartitioning, then, is only needed when both neighboring buckets are also full

  45. Summary • TCAMs would be great for routing lookup, if they didn’t use so much power • CoolCAMs: two architectures that use partitioned TCAMs to reduce power consumption in routing lookup • Bit-selection Architecture • Trie-based Table Partitioning (subtree-split and postorder-splitting) • each scheme has its own subtle advantages/disadvantages, but overall they seem to work well

  46. Discussion

More Related