Introduction

INF5061:Multimedia data communication using network processors Introduction 2/9 - 2005

Overview • Course topic and scope • Background • software-based network systems • challenges and new requirements • evolution of network processors • (Very) short overview of some example network processors

INF5061:The Course

Lecturers • Carsten Griwodzemail: griff @ ifi • Pål Halvorsenemail: paalh @ ifi

About INF5061: Topic & Scope • Content: The course gives … • … an overview of network processor cards (architectures and use) • … an introduction of how to program Intel IXP network processors • … some ideas of how to use network processors

About INF5061: Topic & Scope • Lab-assignments:An important part of the course are lab-assignments where the students should make a program for the Intel IXP2400 network processor • wwpingbump – download and run • protocol statistics – extend the wwpingbump to give processor, interface and protocol statistics • packet bridge with ARP support – forward packet to correct interface (of 3 available) • transparent load balancer – balance load and forward packets to the right machine in a cluster of two with same IP address • HTTP protocol translator – add support in the transparent load balancer for HTTP streaming having an RTSP/RTP server

About INF5061: Exam (10sp) • Prerequisite – mandatory assignments: • lab assignment 2: protocol statistics • presentation of a relevant paper • Graded assignments: • lab assignment 4: transparent load balancer • deliver code • short demo/explanation of code (to lecturers only) • lab assignment 5: HTTP protocol translator • deliver code and a short report • present and demonstrate to the class at the end of the course • Final exam: oral exam (???/12-2005) • selected chapters from the Comer book and IXP documentation • lecture slides (including slides from presented papers) • content of lab assignments

About INF5060: “Exam” (5sp) • Mandatory assignment: • lab assignment 5: HTTP protocol translator • deliver code and a short report • present and demonstrate to the class at the end of the course • approved assignment gives a “passed” course (INF5060)

Available Resources • Book: Douglas E. Comer: “Network Systems Design using Network Processors – Intel IXP2xxx Version”, Pearson Prentice Hall, 2004 • Other resources will be placed at • http://www.ifi.uio.no/~paalh/INF5061 • Login: inf5061 • Password: ixp • Manuals for IXP2400: …/~paalh/INF5061/IXP2400 • Code: …/~paalh/INF5061/code

Disclaimer • In the field of network processors, I am a tyro • Definition: Tyro \Ty’ro\, n.; pl. Tyros. A beginner in learning; one who is in the rudiments of any branch of study; a person imperfectly acquainted with a subject; a novice • Then, by definition, in the field of network processors, we are all tyros • In our defense, when it comes to network processors, everyone is a tyro

Background and Motivation

Software-Based Network System • Uses conventional, shared hardware (e.g., a PC) • Software • runs the entire system • allocates memory • controls I/O devices • performs all protocol processing • First generation network systems:

user space kernel space Review of General Data Path on Conventional Computer Hardware Architectures sending: receiving: forwarding: application application application communication system communication system communication system transport (TCP/UDP) network(IP) link

Review of Conventional Computer Hardware Architectures Intel D850MD Motherboard - Intel Hub Architecture (850 Chipset) : RDRAM connectors CPU socket RDRAM interface system bus hub interface PCI bus Memory Controller Hub I/O Controller Hub PCI connectors

system bus (64-bit, 400/533 MHz) I/O controller hub memory controller hub RDRAM RDRAM RAM interface (two 64-bit, 200 MHz) RDRAM RDRAM hub interface (four 8-bit, 66 MHz) PCI slots PCI slots PCI bus (32-bit, 33 MHz) PCI slots Forwarding Example for an Intermediate Node: Intel Hub Architecture application user space kernel space Note:- one single average MPEG-II DVD stream require ~330-660 packets per second of 1500 Bytes (4-8 Mbps) - then use smaller packets, add concurrent clients, other applications, … communication system Pentium 4 Processor registers cache(s) communication system application network card

Main Packet Processing Costs • Copying: used when moving a packet from one memory location to another • expensive (proportional to packet size) • should be avoided whenever possible (use pointers) • Checksuming: used to detect errors • expensive (proportional to packet size) • transport layer: payload + header • network layer: header • Fragmentation/reassembly: needed when packet is larger than smallest MTU • generate headers + header checksum • receiving many small data fragments

Question: • Which is growing faster? • network bandwidth • processing power • Note: if network bandwidth is growing faster • CPU may be the bottleneck • need special-purpose hardware • conventional hardware will become irrelevant • Note: if processing power is growing faster • no problems with processing • network/busses will be bottlenecks

Growth Of Technologies Mbps year

Packet Rates and Software Processing • Packet rates (packets per second): • Packet processing (MIPS, assuming 5K instructions per packet): • the Comer book uses 10K instructions as an upper bound per packet • it varies according to which protocols are used, implementation, data size, etc. • more if moved through a fire wall • engineering rule: 1GHz general purpose CPU = 1Gbps network data rate • Note; this is only processing – time must be added to handle interrupts and move data into memory • Thus, software running on a general-purpose processor is insufficient to handle high-speed networks because the aggregate packet rate exceeds the capabilities of the CPU

The Network System Challenges • Data rates in general keep increasing • Network rate > CPU rate > memory, busses and I/O interfaces • Protocols and applications keep evolving • System design, implementation and testing is time consuming and expensive • Systems often contain errors • Special-purpose hardware (ASIC) designed for one type of system can usually not be reused • Host machine must inspect all incoming packets • … • Challenge: find ways to improve the design and manufacture of complex networking systems

Statement of Hope • If there is hope, it lies in … • 1990: … faster CPUs • 1995: … the application specific integrated circuit (ASIC) designers • 2002: … the programmers! • Programmability • we need a programmable device with more capability than a conventional CPU • key to low-cost hardware for next generation network systems • compared to ASIC designs, it is more flexible, easier and faster to upgrade, and thus, less expensive

General idea: To optimize computation, move operations that account for the most CPU time from software into hardware Onboard… address recognition and filtering onboard buffering DMA buffer and operation chaining Add hardware to NIC off-the-shelf chips for layer 2 ASICs for layer 3 Allows each NIC to operate independently effectively a multiprocessor total processing power increased dramatically First Generation

Designed for greater scale Decentralized architecture additional computational power on each NIC NIC implements classification and forwarding High-speed internal interconnection mechanism interconnects NICs provides fast data path Multiple network interfaces High-speed hardware interconnects NICs General-purpose processor only handles exceptions Sufficient for medium speed interfaces (100 Mbps) Second Generation (early 1990s)

Almost all packet processing off-loaded from CPU Special-purpose ASICs handle lower layer functions Embedded (RISC) processor handles layer 4 CPU only handles low-demand processing Functionality partitioned further Additional hardware on each NIC Onboard… classification forwarding traffic policing monitoring and statistics … Third Generation (late 1990s)

Third Generation (late 1990s) • Enough, are third generation sufficient?? • Almost!! • But not quite! ;-( • What’s the problem? • high cost • long time to market • difficult to test • expensive and time-consuming to change • even trivial changes require silicon respin • 18-20 month development cycle • little reuse across products and versions • require in-house expertise (ASIC designers) • …

Network Processors: The Idea in a Nutshell • Devise new hardware building blocks, but make them programmable • Include support for protocol processing and I/O • General-purpose processor(s) for control tasks • Special-purpose processor(s) for packet processing and table lookup • Include functional units for tasks such as checksum computation, hashing, … • Integrate as much as possible onto one chip • Call the result a network processor

Review of Conventional Computer Hardware Architectures Intel D850MD Motherboard - Intel Hub Architecture (850 Chipset) : RDRAM connectors CPU socket RDRAM interface system bus hub interface PCI bus Memory Controller Hub I/O Controller Hub PCI connectors

Network Processors: Main Idea Traditional system: - slow - resource demanding - shared with other operations Network processors: - a computer within the computer - special, programmable hardware - offloads host resources

Designing a Network Processor • Depends on • operations network processor will perform • role of network processor in overall system • Goals • generality: sufficient for all protocols, all protocol processing tasks and all possible networks • high speed: scale to high bit rates and high packet rates • Key point: A network processor is not designed to process a specific protocol or part of a protocol. Instead, designers seek a minimal set of instructions that are sufficient to handle an arbitrary protocol processing task at high speed

Where to Place Network Processors • Thus, network processors is somewhere in the middle performance • Goal: increase performance and reduce costs ASIC designs • Increase performance: • known issues: • – must partition packet processing into • separate functions • – to achieve highest speed, must handle • each function with separate hardware • unknown issues: • – which functions to choose • – what hardware building blocks to use • – how to interconnect building blocks network processors software on conventional prosessor cost • Decrease costs: • Economics driving a gold rush • – NPs will dramatically lower production • costs for network systems • – good NP designs worth lots of $$

Explosion of Commercial Products • 1990  2000: network processors transformed from interesting curiosity to mainstream product • used to reduce both overall costs and time to market • 2002: over 30 vendors with a vide range of architectures • e.g., • Multi-Chip Pipeline (Agere) • Augmented RISC Processor (Alchemy) • Embedded Processor Plus Coprocessors (Applied Micro Circuit Corporation) • Pipeline of Homogeneous Processors (Cisco) • Pipeline of Heterogeneous Processors (EZchip) • Configurable Instruction Set Processors (Cognigine) • Extensive And Diverse Processors (IBM) • Flexible RISC Plus Coprocessors (Motorola) • Internet Exchange Processor (Intel) • …

Agere PayloadPlus:A Short Overview

Agere PayloadPlus (APP) • Agere PayloadPlus (APP) • consists of both programmable hardware and software • consists of both data and control planes (i.e., slow and fast plane) • APP defines HW architectures, SW mechanisms, interconnection mechanisms and interfaces, BUT does not specify how to implement them. • Several versions of APP exist differing in the number and types of functional units, degree of parallelism and internal bandwidth (2. generation: 5 models)

Classifier extract packets from ingress classify packet send statistics to state engine reassemble blocks pass packet to forwarder together with classification decision State engine initiate, configure and control classifier and traffic manager receives control from classifier update statistics (e.g., packet count) check packets against profiles(and inform classifier) Forwarder get packet from classifier perform traffic shaping and management fragment packet (if necessary) modify headers (if necessary) APP Conceptual Pipeline

APP550 Chip

APP550 Chip • Memory interfaces: • two types of physical memory • fast cycle RAM (FCRAM) for fast memory accesses • double data rate SRAM (DDR-SRAM) for high throughput • the different memory types are usually used like this:

APP550 Chip • Media interfaces: • several to form fast data paths • two external connections: • cell-oriented (ATM) • packet-oriented (Ethernet)

APP550 Chip • Scheduling interface interfaces: • an external scheduling interface • external logic can use information about queues • PCI bus interfaces: • allows communication with host CPU • mainly to control the whole operation

APP550 Chip • Coprocessor interfaces: • APP550 should be able to process a packet • BUT, to accommodate special cases, e.g., adding additional headers a co-processor interface is provided

APP550 Chip

APP550 Chip • Stream Editor (SED) • two parallel engines • modify outgoing packets (e.g., checksum, TTL, …) • configurable, but not programmable • Packet (protocol data unit) assembler • collect all blocks of a frame • not programmable • Pattern Processing Engine • patterns specified by programmer • programmable using a special high-level language • only pattern matching instructions • parallelism by hardware using multiple copies and several sets of variables • access to different memories • Reorder Buffer Manager • transfers data between classifier and traffic manager • ensure packet order due to parallelism and variable processing time in the pattern processing • Traffic Manager • schedule packets and shape traffic flow • programmable via scripts • sends packets to output interface • according to implemented policy: • discard packets • choose queue • State Engine • gather information (statistics) for scheduling • verify flow within bounds • provide an interface to the host • configure and control other functional units

APP550 Full Duplex • Clock rate for APP550 is 233 MHz • One chip cannot manage packet at wire speed in both directions – often two in parallel (one each direction) • all features needed in both direction? • classification only one direction  checks outgoing packets and enqueues using special queue

Intel IXP1200 / 2400:A Short Overview

IXA is a broad term to describe the Intel network architecture (HW & SW, control- & data plane) IXP: Internet Exchange Processor processor that implements IXA IXP1200 is the first IXP chip (4 versions) IXP2xxx has now replaced the first version IXP1200 basic features 1 embedded 232 MHz StrongARM 6 packet 232 MHz µengines onboard memory 4 x 100 Mbps Ethernet ports multiple, independent busses low-speed serial interface interfaces for external memory and I/O busses … IXP2400 basic features 1 embedded 600 MHz XScale 8 packet 600 MHz µengines 3 x 1 Gbps Ethernet ports … IXA: Internet Exchange Architecture

IXP1200 Architecture PCI bus: - allow IXP to connect to I/O devices - enable use of host CPU - rate 2.2 Gbps SRAM bus: - shared bus (several external units) - usually control rather than data - rate 3.71 Gbps Serial line: - connects to the RISC - intended for control and management - rate 38 Kbps • SDRAM bus: - provide access to external SDRAM memory used to store packets - can also pass addresses, control/store operations, etc. - rate 7.42 Gbps • IX (Intel eXchange) bus: • enable higher rates compared to PCI • form fast path (IXP and high-speed interfaces) • - interface to other IXP cards • - 4.4 Gbps

IXP1200 Architecture RISC processor: - StrongARM running Linux - control, higher layer protocols and exceptions - 232 MHz Access units: - coordinate access to external units Scratchpad: - on-chip memory - used for IPC and synchronization Microengines: - low-level devices with limited set of instructions - transfers between memory devices - packet processing - 232 MHz

IXP1200 Processor Hierarchy General-Purpose Processor: - used for control and management - running general applications RISC processor: - chip configuration interface (serial line) - control, higher layer protocols and exceptions I/O processors (microengines): - transfers between memory devices - packet processing • Coprocessors: • - real-time clock and timers • IX bus controller • hashing unit • ... Physical interface processors: - implement layer 1 & 2 processing

IXP1200 Memory Hierarchy

IXP1200 Memory Hierarchy • Different memory types… • …are organized into different addressable data units (words or longwords) • …have different access times • …connected to different busses • Therefore, to achieve optimal performance, programmers must understand the organization and allocate items from the appropriate type

Linux 2.4 vs. IXP 1200 Intel P4 host machine IXP Performance Improvement: Forwarding • The forwarding latency improvement itself may only be relevant to very time-sensitive interactive applications • Offloading at least equally important

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction