1 / 30

Performance Modeling for Fast IP Lookups

Performance Modeling for Fast IP Lookups. Girija Narlikar Joint work with Francis Zane Bell Laboratories, Lucent Technologies Appeared in Proc. SIGMETRICS ’01. What is IP Lookup?. Input: Table of IP address prefixes (networks), stream of packets

egil
Download Presentation

Performance Modeling for Fast IP Lookups

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Modeling for Fast IP Lookups Girija Narlikar Joint work with Francis Zane Bell Laboratories, Lucent Technologies Appeared in Proc. SIGMETRICS ’01

  2. What is IP Lookup? • Input: Table of IP address prefixes (networks), stream of packets • Output: Longest matching prefix for each packet • Applications: routing, accounting, clustering router action dest addr 11  1100  010  0110  10  a1 lookup(10011001) = 10  lookup(11001011) = 1100  lookup(11011010) = 11  a2 a5 a3 a2 a4 a1 a5

  3. Hardware Vs Software • Core routers: ASICs perform IP lookup • Worst case performance • Edge routers: Software IP lookup • Eg, PCs, network processors (eg, IXP, Cport, Xstream, …) • Average case performance • Memory hierarchy matters Memory L1 cache CPU L2 cache 2 cycles 10 cycles 100 cycles

  4. Memory gets bigger and slower Cache performance must be considered

  5. Goal • Optimize IP lookup data structures based on characteristics of route table, input traffic and hardware platform (memory hierarchy and processor) • Optimal hardware design of lookup engine for characteristic traffic and tables • Approach • Build accurate performance model to predict performance of data structures

  6. Results • Optimizing data structures for input traffic and hardware yields higher performance • Impact of hardware improvements can be predicted

  7. Simple lookup solution: binary tree 010  0110  10  11  1100  root 0 1 0 1 1 10  11  0 0 ~100K entries 010  0 0 0110  1100  32-bit addresses too many memory accesses

  8. Optimization to binary tree Multi-level trie with larger strides stride = 2 00 01 10 11 - C A B C D E 010  0110  10  11  1100  stride = 2 - A A B D E D D 11 00 01 10 00 01 10 11 Trade-off between # accesses and space

  9. Large Strides = Good Performance? 1-level trie A B C D E 010  0110  10  11  1100  stride = 4 - - - - - B A A C C C C E D D D More space more replication more cache misses poor performance?

  10. Non-uniform distribution of prefixes

  11. Non-uniform accesses to prefixes

  12. Optimizing for cache performance too many cache misses too many memory accesses optimal

  13. Performance Model Inputs • Hardware parameters • Distribution of packets to prefixes in route table • Lookup data structure Output • Average lookup time

  14. Space of data structures Multi-level tries with splay trees at trie leafs trie splay trees (binary, self-adjusting) Goal: find the appropriate number of levels and stride values

  15. Performance model Average lookup time = M1 x tL1 + M2 x tL2 + H x ttrie + T x ttree M1= # L1 cache misses tL1 = L1 miss latency M2 = # L2 cache misses tL2 = L2 miss latency H = # trie nodes visited ttrie= time to visit a trie node T = # tree nodes visited ttree = time to visit a tree node L1 cache Memory CPU L2 cache tL1 tL2

  16. Predicting cache misses Access probabilities for N memory blocks p1 p2 pN C cache blocks 1/C 1/C P (miss for mem blk i ) = pi x (1/C-pi ) = pi x (1-C pi ) direct-mapped pi x (1-pi )C fully associative pi x (1-Cpi /n)n n-way associative 1/C

  17. Obtaining access counts Input: number of hits to each prefix • Trie nodes: • count(v) = count(u) • uchild(v) x x x x x x x x x x • Splay tree nodes: assume E(S) accesses per lookup in splay tree S • 3E(S) in theory; T = weighted average of E(S) • Accumulate access counts in 32n global counters to search for an n-level trie

  18. Hardware platforms Processor L1 cache L1 miss L2 cache L2 miss 400 MHz 16KB on- 38 ns 512KB off- 100 ns Pentium-II chip, 4-waychip, 4-way 700 MHz 16KB on- 10 ns 256KB on- 100 ns Pentium-III chip, 4-way chip, 8-way tL1 tL2 ttree and ttrie obtained from vtune and confirmed via data fitting

  19. Packet Traces Distribution of packets to prefixes in 52K Mae-East BGP table Synthetic traces: Rand-Net, Rand-IP Real traces: ISP, SDC Rand-IP ISP Rand-Net SDC Rand-IP ISP

  20. Model Validation Using measured (not predicted) M1, M2 , T, H Avge lookup time = M1 x tL1 + M2 x tL2 + T x ttree + H x ttrie 1-level trie Measured Rand-Net Model ISP Rand-IP SDC

  21. Model Validation (cont’d) Measured Model 1-level trie Rand-Net ISP Rand-IP SDC

  22. Model Validation (cont’d) Measured Model 2-level trie Rand-Net ISP SDC

  23. “Best” lookup data structures Trace Pentium II Pentium III Meas. Model Struct. Meas. Model Struct. Rand-Net T3(16,24,28) 242 235 T2(16,24) 168 164 ISP 197 202 T2(21,24) 131 149 T3(21,24,27) Rand-IP 140 142 T2(16,24) 89 108 T3(16,24,28) SDC 89 104 T1(21) 50 62 T1 (21)

  24. Using suboptimal structures % loss in performance (Pentium II) Input trace Trace with optimal structure Rand-Net ISP Rand-IP SDC 15.7 0.0 58.3 Rand-Net ISP 29.9 29.9 11.3 Rand-IP 0.0 33.1 35.0 SDC 31.5 20.4 31.5

  25. Impact of hardware improvementsexample:L2 cache size Pentium III

  26. Processor and L2 speeds ISP trace, Pentium II architecture model

  27. Conclusions • Possible to predict (within ~10% accuracy) average-case performance for IP lookup: the memory hierarchy cannot be ignored • “Best” data structure depends on input trace and lookup hardware • Performance model could be used to design future lookup architectures • can search space of hardware configurations under cost constraints

  28. Total data structure space Pentium III

  29. L1 cache size Pentium III

More Related