Towards a Billion Routing Lookups per Second in Software

Towards a Billion Routing Lookups per Second in Software • Author: Marko Zec, Luigi, Rizzo MiljenkoMikuc • Publisher: SIGCOMM Computer Communication Review, 2012 • Presenter: Yuen-Shuo Li • Date: 2012/12/12

Outline • Idea • Introduction • Search Data Structure • Lookup Algorithm • Updating • Performance

Idea Can a software routing implementation compete in a field generally reserved for specialized lookup hardware?

introduction(1/3) Software routing lookups have fallen out of focus of the research community in the past decade, as the performance of general-purpose CPUs had not been on par with quickly increasing packet rates. Software vs Hardware?

introduction(2/3) But landscape has changed, however. • The performance potential of general-purpose CPUs has increased significantly over the past decade. • instruction-level parallelism, shorter execution pipelines, etc. • The increasing interest in virtualized systems and software defined networks calls for a communication infrastructure that is more flexible and less tied to the hardware.

introduction(3/3) DXR, an efficient IPv4 lookup scheme for software running on general CPUs. • The primary goal was not to implement a universal routing lookup schemes: data structures and algorithms which have been optimized exclusively for the IPv4 protocol. • distills a real-world BGP snapshot with 417,000 prefix into a structure consuming only 782 Kbytes, less than 2 bytes per prefix, and achieves 490 MLps MLps: million lookups per second

Search Data Structure(1/11) Building of the search data structure begins by expanding all prefixes from the database into address ranges.

Search Data Structure(2/11) Neighboring address ranges that resolve to the same next hop are then merged.

Search Data Structure(3/11) • each entry only needs the start address and the next hop. the search data we need!

Search Data Structure(4/11)

Search Data Structure(5/11) k = 16, chunks Lookup table • To shorten the search by splitting the entire range in chunks, and using the initial k bits of the address to reach, through a lookup table.

Search Data Structure(6/11) k = 16, chunks • Each entry in the lookup table must contain the position and size of the corresponding chunk in the range table • a special value for size indicates that the 19 bits are an offset into the next-hop table. Lookup table entry 32 bits 1 bit 12 bits 19 bits

Search Data Structure(8/11) Range table • With k bits already consumed to index the lookup table, the range table entries only need to store the remaining 32-k address bits, and the next hop index.

Search Data Structure(9/11) Range table • A further optimization which is especially effective for large BGP views, where the vast majority of prefixes are /24 or less specific. long format short format

Search Data Structure(10/11) Range table • one bit in the lookup table entry is used to indicate whether a chunk is stored in long or short format. 32 bits long format 16 bit 16 bits 16 bits short format 8 bits 8 bit

Lookup Algorithm(1/4) • The k leftmost bits of the key are used as an index in the lookup table. 1.2.4.0 00000001.00000010.00000100.00000000 index

Lookup Algorithm(2/4) • If the entry points to the next hop, the lookup is complete. • A special value for size indicates that the 19 bits are an offset into the next-hop table.

Lookup Algorithm(3/4) • Otherwise, the (position, size) information in the entry select the portion of the range table on which to perform a binary search of the (remaining bits of the ) key. position size

Lookup Algorithm(4/4)

Updating(1/3) Lookup structures store only the information necessary for resolving next hop lookups, so a separate database which stores detailed information on all the prefixes is required for rebuilding the lookup structures. lookup information detailed information

Updating(2/3) Updates are handled as follows: • updates covering multiple chunks(prefix length < k) are expanded. • then each chunk is processed independently and rebuilt from scratch. • Finds the best matching route for the first IPv4 address belonging to the chunk, translating it to an range table entry. • If the range table heap contains only a single element when the process ends, the next hop can be directly encoded in the lookup table.

Updating(3/3) The rebuild time can be reduced • by coalescing multiple updates into a single rebuild • We have implemented this by delaying the reconstruction for several milliseconds after the first update is received • processing chunks in parallel, if multiple cores are available

Performance(1/11) Our primary test vectors were three IPv4 snapshots from routeviews.org BGP routers.

Performance(2/11) Here we present the results obtained using the LINX snapshot, selected because it is the most challenging for our scheme: • it contains the highest number of next hops • and an unusually high number of prefixes more specific than /24

Performance(3/11) Our primary test machine had a 3.6 GHz AMD FX-815 8-core CPU and 4 GB of RAM. Each pair of cores share a private 2 MB L2 cache block, for a total of 8 MB of L2 cache. All 8 cores share 8 MB of L3 cache. We also ran some benchmarks on a 2.8 GHz Intel Xeon W3530 4-core CPU with smaller but faster L2 caches(256 KB per core), and 8 MB shared L3 cache • All tests ran on the same platform/compiler(FreeBSD 8.3, amd 64, gcc 4.2.2) • Each test ran for 10 seconds

Performance(4/11) We used three different request patterns: • REP (repeated): • each key is looked up 16 times before moving to the next one. • RND (random): • each key is looked up only once. The test loop has no data dependencies between iterations. • SEQ (sequential): • same as RND, but the timing loop has an artificial data dependency between iterations so that requests become effectively serialized.

Performance(5/11)

Performance(6/11) average lookup throughput with D18R is consistently faster compared to D16R on our test systems.

Performance(7/11) the distribution of binary search steps for uniformly random queries observed using the LINX database. n = 0 means a direct match in the lookup table

Performance(8/11) • the interference between multiple cores, causing a modest slowdown of the individual threads.

Performance(9/11) DIR-24-8-BASIC

Towards a Billion Routing Lookups per Second in Software

Towards a Billion Routing Lookups per Second in Software

Presentation Transcript

CLIENT ADDRESS LOOKUPS

Towards More Adaptive Internet Routing

Towards a Logic for Wide-Area Internet Routing

One Billion Rows Per Second: Analytics for the Digital Media Markets

5cm Per Second

One Billion Rows Per Second: Analytics for the Digital Media Markets

Towards a Theory of Onion Routing

Dual lookups in Pattern Databases

Towards 3D Conceptual Design in Second Life

Fuego Event Service: Towards Modularity in Event Routing

Towards a Logic for Wide-Area Internet Routing

SENSE project – towards a next (second) part

Towards a Connectivity-Based, Reliable Routing Framework

Routing Lookups and Packet Classification: Theory and Practice

Greedy Prefix Cache for IP Routing Lookups

towards 50 billion connections

Beacon Vector Routing: towards scalable routing for sensor networks

Queries and Lookups

Concentration: Parts per Million and Parts per Billion

7 Miles Per Second

IP Routing Lookups

Delivery Routing Software