300 likes | 460 Views
Performance Modeling for Fast IP Lookups. Girija Narlikar Joint work with Francis Zane Bell Laboratories, Lucent Technologies Appeared in Proc. SIGMETRICS ’01. What is IP Lookup?. Input: Table of IP address prefixes (networks), stream of packets
E N D
Performance Modeling for Fast IP Lookups Girija Narlikar Joint work with Francis Zane Bell Laboratories, Lucent Technologies Appeared in Proc. SIGMETRICS ’01
What is IP Lookup? • Input: Table of IP address prefixes (networks), stream of packets • Output: Longest matching prefix for each packet • Applications: routing, accounting, clustering router action dest addr 11 1100 010 0110 10 a1 lookup(10011001) = 10 lookup(11001011) = 1100 lookup(11011010) = 11 a2 a5 a3 a2 a4 a1 a5
Hardware Vs Software • Core routers: ASICs perform IP lookup • Worst case performance • Edge routers: Software IP lookup • Eg, PCs, network processors (eg, IXP, Cport, Xstream, …) • Average case performance • Memory hierarchy matters Memory L1 cache CPU L2 cache 2 cycles 10 cycles 100 cycles
Memory gets bigger and slower Cache performance must be considered
Goal • Optimize IP lookup data structures based on characteristics of route table, input traffic and hardware platform (memory hierarchy and processor) • Optimal hardware design of lookup engine for characteristic traffic and tables • Approach • Build accurate performance model to predict performance of data structures
Results • Optimizing data structures for input traffic and hardware yields higher performance • Impact of hardware improvements can be predicted
Simple lookup solution: binary tree 010 0110 10 11 1100 root 0 1 0 1 1 10 11 0 0 ~100K entries 010 0 0 0110 1100 32-bit addresses too many memory accesses
Optimization to binary tree Multi-level trie with larger strides stride = 2 00 01 10 11 - C A B C D E 010 0110 10 11 1100 stride = 2 - A A B D E D D 11 00 01 10 00 01 10 11 Trade-off between # accesses and space
Large Strides = Good Performance? 1-level trie A B C D E 010 0110 10 11 1100 stride = 4 - - - - - B A A C C C C E D D D More space more replication more cache misses poor performance?
Optimizing for cache performance too many cache misses too many memory accesses optimal
Performance Model Inputs • Hardware parameters • Distribution of packets to prefixes in route table • Lookup data structure Output • Average lookup time
Space of data structures Multi-level tries with splay trees at trie leafs trie splay trees (binary, self-adjusting) Goal: find the appropriate number of levels and stride values
Performance model Average lookup time = M1 x tL1 + M2 x tL2 + H x ttrie + T x ttree M1= # L1 cache misses tL1 = L1 miss latency M2 = # L2 cache misses tL2 = L2 miss latency H = # trie nodes visited ttrie= time to visit a trie node T = # tree nodes visited ttree = time to visit a tree node L1 cache Memory CPU L2 cache tL1 tL2
Predicting cache misses Access probabilities for N memory blocks p1 p2 pN C cache blocks 1/C 1/C P (miss for mem blk i ) = pi x (1/C-pi ) = pi x (1-C pi ) direct-mapped pi x (1-pi )C fully associative pi x (1-Cpi /n)n n-way associative 1/C
Obtaining access counts Input: number of hits to each prefix • Trie nodes: • count(v) = count(u) • uchild(v) x x x x x x x x x x • Splay tree nodes: assume E(S) accesses per lookup in splay tree S • 3E(S) in theory; T = weighted average of E(S) • Accumulate access counts in 32n global counters to search for an n-level trie
Hardware platforms Processor L1 cache L1 miss L2 cache L2 miss 400 MHz 16KB on- 38 ns 512KB off- 100 ns Pentium-II chip, 4-waychip, 4-way 700 MHz 16KB on- 10 ns 256KB on- 100 ns Pentium-III chip, 4-way chip, 8-way tL1 tL2 ttree and ttrie obtained from vtune and confirmed via data fitting
Packet Traces Distribution of packets to prefixes in 52K Mae-East BGP table Synthetic traces: Rand-Net, Rand-IP Real traces: ISP, SDC Rand-IP ISP Rand-Net SDC Rand-IP ISP
Model Validation Using measured (not predicted) M1, M2 , T, H Avge lookup time = M1 x tL1 + M2 x tL2 + T x ttree + H x ttrie 1-level trie Measured Rand-Net Model ISP Rand-IP SDC
Model Validation (cont’d) Measured Model 1-level trie Rand-Net ISP Rand-IP SDC
Model Validation (cont’d) Measured Model 2-level trie Rand-Net ISP SDC
“Best” lookup data structures Trace Pentium II Pentium III Meas. Model Struct. Meas. Model Struct. Rand-Net T3(16,24,28) 242 235 T2(16,24) 168 164 ISP 197 202 T2(21,24) 131 149 T3(21,24,27) Rand-IP 140 142 T2(16,24) 89 108 T3(16,24,28) SDC 89 104 T1(21) 50 62 T1 (21)
Using suboptimal structures % loss in performance (Pentium II) Input trace Trace with optimal structure Rand-Net ISP Rand-IP SDC 15.7 0.0 58.3 Rand-Net ISP 29.9 29.9 11.3 Rand-IP 0.0 33.1 35.0 SDC 31.5 20.4 31.5
Impact of hardware improvementsexample:L2 cache size Pentium III
Processor and L2 speeds ISP trace, Pentium II architecture model
Conclusions • Possible to predict (within ~10% accuracy) average-case performance for IP lookup: the memory hierarchy cannot be ignored • “Best” data structure depends on input trace and lookup hardware • Performance model could be used to design future lookup architectures • can search space of hardware configurations under cost constraints
Total data structure space Pentium III
L1 cache size Pentium III