350 likes | 539 Views
Introduction. Linus Svensson D5, linus@sm.luth.se Åke Östmark D5, ake@sm.luth.se. Why We Are Here. The architecture of a Network Processor Unit (NPU) Master’s thesis - a joint operation between Luleå University of Technology and SwitchCore AB. Today's Topics. Background
E N D
Introduction • Linus Svensson • D5, linus@sm.luth.se • Åke Östmark • D5, ake@sm.luth.se
Why We Are Here • The architecture of a Network Processor Unit (NPU) • Master’s thesis - a joint operation between Luleå University of Technology and SwitchCore AB
Today's Topics • Background • Ethernet and internetworks • Switches and routers • NPU (Network Processor Unit) • Why an NPU? • Cons and pros with NPU:s • The architecture of our NPU • Design difficulties and design choices • The architecture, strengths and weaknesses • The big picture • From idea to silicon
Ethernet • Most widespread network technology used in LAN (Local Area Network) • 10 Mb/s (Ethernet) • 100 Mb/s (Fast Ethernet) • 1000 Mb/s (Gigabit Ethernet) • Packet switched network • Host-to-host delivery on the same network • Switches forward packets from one section to another using the datagram paradigm
Ethernet • Datagram paradigm • Packet contains enough information for a switch to forward it correctly • I.e. packet contains complete destination address • Ethernet packets = frames • In Ethernet the packets are referred to as frames
Ethernet Frame Format • Preamble • 64 bits used for synchronisation • Header • 48-bit globally unique destination address • 48-bit globally unique source address • 16-bit type field used for classification
Ethernet Frame Format • Body • 46-1500 bytes of data • CRC • 32-bit CRC (Cyclic Redundancy Check) for error detection
Internetworks • Internetwork • Several physical networks combined into one logical internetwork • Also called internet (with lowercase “i”) • Most famous is the world spanning Internet (with capital “I”) • Host-to-host delivery between different networks
Internet Protocol (IP) • Most widespread protocol used in internetworks • Routers forward packets from one network to another using the datagram paradigm
IP Packet Format • 12 bytes of status fields e.g. version, length etc • 32-bit globally unique source address • 32-bit globally unique destination address • Optional fields of variable length • Body
IP Over Ethernet • IP packets are encapsulated in Ethernet frames
Devices • SwitchCore CXE-2010 • A 16-port Gigabit Ethernet Switch-on-a-chip • Full 4K VLAN support • Includes support of IEEE 802.1p • Cisco 1710 • Security Access Router • Secure Internet, intranet, and extranet access with VPN and firewall • Advanced QoS features
Features • What if we want: • Load Balancing • distributing client requests across multiple servers • Multi-Protocol Label Switching (MPLS) • next hop based on a the label
Features • What if we don’t want • QoS • Security features • The Network Processor Unit (NPU) • A programmable CPU chip that is optimized for networking and communications functions • Quick adaptation of new standards/features
Conditions For the Work • 1 GE (1000 Mbit) port • 8 FE (100 Mbit) ports • Scalable • Add more ports • Remove ports • Feasible to make an ASIC prototype
NPU components: • Processor Core • Embedded software • Network Interface • Packet buffers • Queues • Tables • Switch fabric
Design Choices • Processor core • RISC based • Network specific • Network Interface • FE • MII (Media Independent Interface) • RMII (Reduced MII) • GE • GMII (Gigabit MII) • RGMII (Reduced GMII)
Design Choices • Queues • A packet ready for transmission • Tables • Data structure for IP & MAC addresses • Switch fabric • The internal interconnect architecture. How to transport from in-port to out-port?
Design Choices • Packet buffers • Internal and/or external • How many times do we need to access a (buffer) memory? • Write when receive from network • Read packet for processing • Write modified packet for transmission • Reading the packet when transmitting • For N ports the memory needs to run at 4N the port speed
Design Choices • 8 FE ports • 1 GE port • Inter-arrival time: • 1.5*106 + 8*1.55 = 2.7*106 packets/s • -> New packet every 370 ns • Cycle budget example: • 100 MHz -> 37 cycles to process every packet • 200 MHz -> 74 cycles to process every packet
Design Choices • Model of operation • Route processing • Packet forwarding~200 cycles • Special services • Target technology • ~150 MHz
Design Decisions Parallel Processor Architecture • 2 FE ports • 125 MHz • 1 Integer Unit • 1 GE port • 125 MHz • 5 Integer Units • -> Cycle budget of 420 for each packet • Interactive voice can tolerate somewhere between 100 and 200 milliseconds of end-to-end delay without people noticing it. • 420 cycles -> 0.00336 ms
Design Decisions • Tables • MAC Address lookup, fixed length: • CAM (Content Addressable Memory) • Pros: Fast • Cons: Expensive • Like a cache • IP Address lookup, longest match: • Possibly large table • External SRAM
Internal packet buffers: • Pros: Fast, less pin count • Cons: Limited size of memory • 2 FE ports / 1 buffer • Pros: Reduce contention, reduce 4N problem • Cons: Less effective use of memory
Virtual output queues: • Pros: No Head Of Line (HOL) blocking, Possible to select any packet from buffer memory • Cons: Expensive in hardware
Strengths in the Architecture • More bandwidth • More RU and TU • New types of RU and TU • More processing power • More PU per RU/TU • More IU per PU • New types of PU • New types of IU
Strengths in the Architecture • New functionality • New types of shared resources • Semaphores • Multipurpose CPU • New software • All IU:s can run different software
Weaknesses in the Architecture • Not everything scales well • Shared resources • No. of IU:s in a PU
From Idea to Silicon • ASIC design flow
Layout ALU : process(alu_RegA, alu_RegB, In_Ctrl_Ex) begin case In_Ctrl_Ex.OP is when ALU_ADD => alu_Result <= alu_RegA + alu_RegB; when ALU_SUB => alu_Result <= alu_RegA - alu_RegB; when ALU_AND => alu_Result <= alu_RegA and alu_RegB; when ALU_OR => alu_Result <= alu_RegA or alu_RegB; when ALU_XOR => alu_Result <= alu_RegA xor alu_RegB; when ALU_NOR => alu_Result <= alu_RegA nor alu_RegB; when others => alu_Result <= (others => '-'); end case; end process;