Discussion Prepared by Jennifer Chiang

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of ProgrammingMichael K. Chen, Xiao Feng Li, Ruiqi Lian, Jason H. Lin, Lixia Liu, Tao Liu, Roy Ju Discussion Prepared by Jennifer Chiang

Shangri-La : Some Insight • A synonym for paradise • Legendary place from James Hilton’s novel Lost Horizon • Goal: achieve a perfect compiler

Introduction • Problem:Programming network processors is challenging. • Tight memory access and instruction budgets to sustain high line rates. • Traditionally done by hand coded assembly. • Solution: Recently, researchers proposed high level programming languages for packet processing. • Challenge: Is compiling these languages into code as competitive as hand tuned assembly?

Shangri-La Compiler from 10,000 foot view • Consists of programming language,compiler, and runtime system targeted towards Intel IXP multi-core network processor. • Accepts packet program written in Baker. • Maximizes processor utilization • Hot code paths mapped across processing elements. • No hardware caches on target. • Delayed update software controlled caches for frequently accessed data. • Packet handling optimizations • Reduce per packet memory access and instruction counts. • Custom stack model • Maps stack frames to fastest levels of target processor’s memory hierarchy.

Baker Programming Language • Backer programs are structured as a dataflow of packets from Rx to Tx. • Module: container for holding related PPFs, wirings, support code & shared data. • PPF (Packet processing functions): • C like code that performs the actual packet processing. • Hold temporary local states & access global data structures. • CC (Communication channels): • Input and output channel endpoints of PPFs wired together. • Asynchronous queues ordered by FIFO.

Baker Program Example Module PPF CC

Packet Support • Specify protocols using Backer’s protocol construct • Metadata used to store state associated with a packet, but not contained in a packet. • Useful for storing state associated with a packet from one PPF and used later by another PPF • Packet_handle • used to manipulate packets. Data Metadata Packet_handle

IXP2400 Network Processor • Intel XScale core – process control packets, execute noncritical application code, handle initialization and management of the network processor. • 8 MEs (microengines) - lightweight, multi-threaded pipelined processors running special ISA designed for processing packets. • 4 levels of memory: Local Memory, Scratch Memory, SRAM, DRAM DRAM SRAM XScale Core Scratch Memory Local Memory

Compiler Details

Aggregation • Throughput model:t = n/ p x k • n = number of MEs • k= pipeline stage with lowest throughput • t= throughput • P = total number of pipeline stages • Latencyof a packet through the system can be tolerated, but minimum forwarding rates must be guaranteed. • Maximize throughput, compiler uses pipelineor duplicates code across multiple processing elements. • Techniques: pipelining, merging, duplication

Delayed-Update Software Controlled Caching • Caching candidates: frequently read data structures with high hit rates, but infrequently written. • Updatesto these structures rely only on coherency of single atomic write to guarantee correctness. • Reduces frequency and cost of coherency checks. • Late penalty: packet delivery errors

PAC • Packet access combining • Packet data always stored in DRAM memory. • If every packet access mapped to DRAM access, packet forwarding rates are quickly limited by DRAM bandwidth. • Code Generation stage of compiler: multiple protocol field accesses combined into a single wide DRAM access. • Same can be done for SRAM metadata accesses.

Stack Layout Optimization • Goal: allocate as many stack frames as possible to the limited amount of fast memory. • Stack can grow into SRAM, but has high latency and impacts performance. • Assign local memory to procedures higher in the program call graph. • Assign SRAM memory when Local Memory is completely exhausted. • Utilize physicalandvirtual stack pointers.

Experimental Results • 3 benchmarks: L3-Switch, Firewall, MPLS • Significant impact of PAC evident in the large reduction in packet handling SRAM and DRAM accesses. • Code generated by Shangri-La for all 3 successfully achieved 100% forwarding rates at 2.5Gbps, which meets the designed spec of IXp24000. • Also, same throughput target achieved by hand coded assembly written specifically for these processors.

Conclusions • Shangri-La provides complete framework for aggressively compiling network programs. • Reduce both instruction and memory access counts. • Achieved goal of 100% packet forwarding rate at 2.5Gbps

Discussion Prepared by Jennifer Chiang