420 likes | 597 Views
Anatomy of an IP Router Vahid Tabatabaee Fall 2007. References. Title: Network Processors Architectures, Protocols, and Platforms Author: Panos C. Lekkas Publisher: McGraw-Hill James Aweya, “IP Router Architectures: An Overview”, Nortel Networks, Ottawa, Canada
E N D
Anatomy of an IP Router Vahid Tabatabaee Fall 2007
References • Title: Network Processors Architectures, Protocols, and PlatformsAuthor: Panos C. LekkasPublisher: McGraw-Hill • James Aweya, “IP Router Architectures: An Overview”, Nortel Networks, Ottawa, Canada • Florian Brodersen, Alexander Klinetschek, “Anatomy of a High Performance IP router”, Communication Network Seminar 2003/04, Hasso-Plattner-Institute, University of Potsdam, Jan. 2004 • Steve Kohalmi, Tim Hale, “Anatomy of an IP Service Edge Switch”, 2002 Quary Technologies. • Cisco Systems CRS-1 router documents.
Basic IP Router Components Path computation, Routing Table Maintenance • Network Interfaces • Processing Modules • Buffering Modules • Interconnection Unit (switch fabric) • The processing and buffering modules may be replicated either fully or partially on the network interfaces. Transfer Packets btw. Ingress and Egress Interface (Line) Cards Packet Forwarding, Packet Processing, May cache routing table
Basic Functions of a Router Slow Path or Control Plane • Route Processing (Routing Protocols OSPF, RIP, …) • Path Computation • Routing Table Maintenance • Reachability Propagation • Packet Forwarding Fast Path or Data Plane
Packet Forwarding • IP Packet Validation • Version Number • Header length field • Check sum. • Dest. IP address parsing and table lookup. • Local delivery in the network. • Unicast delivery to an output port. • Multicast delivery to a set of output ports • Packet Lifetime Control • Adjust the time-to-live (TTL) field • A packet with positive TTL is delivered to a local address • Packet delivered to output ports has its TTL decremented and rechecked before forwarding • Packet Fragmentation • Check if the packet size is larger than MTU of the network • If yes, fragment the packet.
First Generation of Routers • Similar to a typical computer layout. • All functionality is implemented in software. • Single CPU, single Memory, Single Bus!
Problems with first generation routers • Processing speed is limited by the single CPU. • The CPU should process all packets destined to it and those packets that are passing through it. • Major packet processing tasks such as table lookups are memory intensive operations and can not be done faster by simple processor upgrades. • Software implementation is inefficient, since it is a small set of operations repeated on all packets. • Slow path and fast path are implemented on the same CPU. Therefore, slow path can influence the fast path. • The routing table size has grown from 20,000 entries from 1994 to 260,000 entries today. • Moving data from one interface to another can be time consuming that often exceeds the packet processing time. Source http://bgp.potaroo.net/
Problems with first generation routers • The routing table lookup speed can not be improved if we use traditional memories. • The conventional bus structure for the interconnection is very inefficient. • Every packet has to pass the bus at least twice. • The whole packet (not just the header) is transferred.
How fast a router should be? • An OC-48 link data rate : 2.488 Gbps • Packet rate is more important than the data rate. • Bottleneck is caused by the minimum packet size which depends on the technology. • E.g. Packet-over-SONET (PoS): 40 byte IP payload + 6 byte PPP/HDLC overhead: 2.405 Gbps /(8 x 46) = 6.53 MPPS • The aggregate packet rate for a 16 port system: 16 x 6.53 = 104.48 MPPS • One decision every 9.57 nsec. • SDRAM speed is about 10ns from sequential locations and practically around 20-50 ns.
What is the solution • Take advantage of Parallelism: • NIC became more intelligent and took care of most packet forwarding. • We use ASIC in NIC (line cards) for packet classification and forwarding. • Most packets do not go to the CPU card (control card). • Switching Interface: • Use switching element to pass packets between line cards directly and simultaneously.
Modern Switch Based Architecture • Classification and forwarding decisions are done in line cards. • High speed interconnection mechanism (switching) between the line cards. • This provides a fast data path. • Standard CPU (RISC processor) is used for the control plane (slow path). • Hardware and/or software implementation for classification and forwarding in the line card.
Functional Blocks in a Modern Switches • The PHY Interface • Responsible for transmitting and receiving information • Conversion of the bit stream from digital form to analog signal and vice versa. • Switch Fabric • The router has a bus or a backplane • The switch fabric reads packet from input port and routes it to the output port. • Packet processing • Fast path (data path): Handles all operations that are executed in real time on packets (e.g.: framing/parsing, classification, modification, compression/encryption, queueing) • Slow path (control path): Operations executed of the packet flows. (e.g.: add. Resolution, route calculation, update of routing table,…) • Host processing • Network management, configuring devices, diagnostics • Implemented in software on a CPU
Line cards in Modern Switches • Line card handles packet processing such as: • Classification • Forwarding • Traffic Policing and shaping • Monitoring and Statistics
Ingress Switching Element Optics CDR & Serdes Framer / Mapper Network Processor Traffic Manager Switch Interface Egress Scheduling Element Line Card Switch Card Data Path Diagram Source: Light Reading Report Packet Processing Units
Ingress Line Card Switch Fabric Egress Line Card Class based queueing of outgoing packets SF Arbiter Egress Scheduler& Shaper TM Scheduler Reassemble Incoming packets Segmentation + header SF Flow Control WRED Discard Data Path Functions Network Processor Ingress Traffic Manager Switch Fabric Egress Traffic Manager - Police - Manage congestion (WRED) - Queue packets in class-based VOQs - Segment packets into switch cells - Queues cells in class based VOQs - Flow control TM per class based VOQ - Schedule class based VOQs to egress ports -Reassemble cells into packets -Shape outgoing traffic -Schedule egress traffic - Parse - Identify flow - Determine Egress Port - Mark QoS Parameters - Append TM or SF Header
Switch Card to Line Card Connection • This connection should pass through the Backplane. • Serdes (Serializer-Deserializer) is used for this connection. • Each Serdes signal run over two wires and two pins (differential mode signal). • The speed is usually around 3.125 Gbps. • They run some sort of coding (8b/10b encoding) • The actual data rate would be around 2.5 Gbps. • There are attempts to provide 10 Gbps serdes.
How many serdes do we need? • How fast should be the connection between switch card and line card? • The line speed is not enough. • Switch fabric throughput is less than 100% due to contention. • Network Processor, Traffic manager and switch fabric add their headers. • There is also cell tax.
Speedup • Effective Speedup = RSF/RTM • In the commercial systems, speedup usually refers to RSF/RL. • Higher speedup factor: • Increases system design complexity. • Increases power consumption. • Creates signal integrity issues. • Required Speedup factor is around 2 Traffic Manager Header Fragmentation (Cell Tax) Switch Fabric Header Switching Element Switch Interface Line Card Elements RL RTM RSF Scheduling Element Line Card Switch Card
Redundancy • We have spare switch cards and control cards in the system. • The redundancy models: • Passive redundancy (N:1) We have one inactive switch card in the system that starts to work after failure. • Passive redundancy (1:1, N:N) for each active switch card, we have one inactive card. • Load-Sharing Redundancy (N-1) all cards are active and when a failure happens, performance will degrade gracefully. • Active Redundancy (1+1): Two sets of fabrics carrying the same traffic. Source: www.idt.com/content/switchblock.jpg
Example • In a 16 port 10Gbps switch with 2X speed up with and N:N redundancy how many 2.5 Gbps serdes do we need? • We need 20 Gbps active and 20 Gbps redundant data rate for each line card. • This means 16 serdes for each line card. • For 16 line cards we need 256 serdes in this system.
Example • What is the effective speed up of this system for 40 byte IP packets if the traffic manager header size is 12 bytes, switch fabric header size is 8 bytes and the payload size of the cell is 52 bytes. • Solution: In slide 9 example we show that there can be 6.53 MPPS (40 byte packets) on an OC-48 line. • Similarly on an OC-192 there can be up to 9.622/(8x46) = 26.15 MPPS. • Each packet is encapsulated in one cell, since 40 + 6 < 52 • The maximum number of cells that a line card can generate is (2.5 x 8 Gbps) / ((52+8+12)x8) = 34.722 • Effective Speedup is, Speedup = (34.722/26.15) = 1.33
Traces Per Serdes • Typical LVDS speed is 1.25Gbps • For 2.5Gbps we need 2 channels • LVDS is differential, i.e. 2 traces per channel • LVDS is unidirectional, i.e. 2 for full duplex • Full duplex 2.5Gbps, using LVDS requires 8 traces • In the previous example we will have 256 x 8 = 2048 traces on the back plane.
The line card chassis 8 service cards and 8 physical layer interface module cards
16 slot Single-Shelf system Physical layer Interface Module Switching Card Routing Processor (control plane) • The distributed route processor (DRP) is optional components that provide enhanced routing capabilities. • • The DRP contains two symmetric multiprocessors (SMPs), each of which performs routing functions. • Processor-intensive tasks (such as BGP speakers and ISIS) can be offloaded from the route processors (RPs) to the DRPs.
Multishelf Systems 2 to 72 line card shelves 1 to 8 fabric card shelves
How to handle packet processing? • Of the shelf CPU • This usually would be a RISC processor. • In low end systems it could be a CISC processor. • ASIC • Specialized high performance ASIC to handle packet processing. • Ideal approach for companies such as IBM and intel, since they are manufacturers of Integrated Circuits
Off-the-shelf CPU Systems • Packet processing is implemented in software running on the CPU. • Modifications, upgrades and debugging is accomplished by simple software updates and downloads • Update time much shorter which is good for both user and developer • Not very efficient: spending many clock cycles on tasks not related to packet processing. • Fastest off-the-shelf CPU can handle about 1 gigabit per second. • Trend is to do deeper packet processing (more on this later).
Memory Bottleneck • The pipeline architecture of CPU enables them to perform billions of instructions per second. • However, in order to sustain the pipeline they should fetch data from memory and store it back continuously. • This can be done with very sophisticated multi-level hierarchy of different memory technology, interleaving memory banks. • This requires prohibitive cost, design complexity and power consumption. • Hence typical processor pipeline end-up being often empty, which reduces the system throughput. • Network traffic statistics models are completely different from local traffic on a computer bus. They do not have the same spatial and temporal locality properties. Hence, the typical processor’s cache systems are not effective.
Sup-optimal Instruction Set • The instruction set that we need for packet processing requires specific bit level operations. • These instructions should be done at wire speed. • These instructions are not available as standard instructions of off-the-shelf CPU. • Hence, we have to assemble multiple standard instructions to perform the intended functionality.
Packet processing with ASIC • ASIC typically delivers higher performance. • ASIC is not programmable: • Adding new functionality new design • Adding new protocol new design • New design Costly for both vendor and the user. • ASIC design is very time consuming • Design cycle takes 12 to 18 months. • If we need some modification we may need to recode the whole design. • Many start-up failures are due to time delay.
ASIC development is costly • Expensive and time-consuming to change. • For testing an ASIC you need to design a system • Expensive development tools (design and verification). • Requires ASIC designers (much more expensive). • Tape out of a chip costs around a million dollar.
So is there a middle ground solution? • Can we have a technology that : • Has flexibility of programmable processors • Has high speed of ASICs Solution is called Network Processor! • Network processor are programmable similar to CPU, but their performance is close to ASIC
Network Processors value proposition • Shorter time to market: • Instead of 18 months it takes about 6 months to complete development cycle of packet processing part. • Longer time in market: • New features can be embedded into a deployed network processor based product. • Increased time in market reduces cost of product ownership over the life of product. • Just-in-time delivery of new features: • We can modify the design and adding new features in the field without penalizing the customer. • Greater focus on other issues of business management • Most functions are already coded in a standard way • Developers can focus on differentiating features
Packet Processing Stages • Remove Link Layer Headers and Decryption: • Ethernet • PPP (Point-to-Point Protocol) Frame • PPP over ATM • PPP over Ethernet over ATM • Identify Ingress Subscriber: • To extract information from the link layer protocol header about the owner of the packet. • Filtering: • To permit of deny specific traffic flows, based on various attributes of the IP and higher layers headers. • Traffic Classification: • To allow different traffic management, QoS, security and routing policies applied to different types of flows. • Traffic Metering, Marking & Policing: • To control Peak and Committed Information Rate. • To determine PHB in the DiffServ Model (chnaging the priority)
Packet Processing Stages • Custom Routing Polices: • To direct some traffic through specific paths (internet, VPN, specific destination) • Virtual Private Routed Network allows users to network in privacy over their own routed network using their own private address. • Sending suspicious traffic to explicit locations for special processing. • NAT & NAPT (Network Address [Port] Translation): • Address translation at the source if the user is using a private address space. • Static one-to-one with NAT and dynamic many-to-one with NAPT. • Route Table Look-up: • Best matching prefix look-up on the destination IP address. • Enforcing the PHB/ PerFlow (Link Sharing): • Priority, WRR, WFQ scheduling, WRED (weighted random early detection). • Egress Side Processing (QoS, filtering, encryption, NAT, Egress Subscriber Identification, Traffic Classification, Link Sharing) • Statistical Collection
Deep Packet Processing • In deep packet processing we need to look at the contents of the packet not just the header. • Why do we need deep packet processing? • Deep packet inspection for firewalls and intrusion detection systems. • Traffic shape or discard P2P traffic • Server load balancing: distribution of traffic among servers based on the web destination • Network Monitoring and Analysis
Packet Processing Implementation issues • We need multiple table look-ups for each packet. • Access to whole packet not just the IP header is necessary. • There can be ten of thousands of simultaneously active subscribers comprising millions of application flows. • In a fully loaded Gigabit Ethernet connection about 1.5 million packets per second must be processed • Modern general processors are optimized for numeric computation rather than processing packets. • Memory read and write speeds become bottlenecks. • Caching and high-speed memory burst capability does not help, since packet processing requires: • Large tables • Short entries • Random access queries
How do network processors do this? • Specialized circuitry and micro-engines to perform all generic packet processing functions. • They also usually embed a major programmable module, usually a tailor-made RISC CPU (and sometimes more than one). • Real time operating system • Handshake communication with other parts of the system
Network Processor Categories • Platform Network Processors objectives: • Handle most packet processing functions • Minimize the number of components and the hardware cost • Optimize the trade-off btw. Performance and flexibility • Accelerate software development cycle • Peripheral Network Processors • Designed to optimize a specific function • Compressor chips • IP security
The other side argument • Every single task can be done in wire-speed. • How about multi-tasks at the same time. • What is a realistic scenario to consider? • Challenge of Benchmarking • Programming complexity