Commercial Network Processor Architectures Agere PayloadPlus Vahid Tabatabaee Fall 2007

Commercial Network Processor Architectures Agere PayloadPlus Vahid Tabatabaee Fall 2007

References • Title: Network Processors Architectures, Protocols, and PlatformsAuthor: Panos C. LekkasPublisher: McGraw-Hill • Agere PayloadPlus Family White Papers • Payload+: Fast Pattern Matching & Routing for OC-48, David Kramer, Roger Bailey, David Brown, Sean Mcgee, Jim Greene, Robert Corley, David Sonnier, (Agere Systems) in Hot Chips a Symposium on High Performance Chips, Aug. 19-21, 2001 • Agere Product Brief documents for FPP, RSP, ASI and FPL. • Agere White paper: “The case for a classification Language”, Feb. 2003.

General Information • Agere PayloadPlus is a comprehensive networking processor solution for OC-48. • It has expanded to support OC-192 through the NP10/TM10 (renamed to APP750NP and APP750TM). • This product is discontinued since then. • Originally this was a 3 chip solution but later on it was integrated into a single chip solution. • We review the original solution and APP550 (single chip) which their info. is on the Agere website.

The Big Picture The network processor family has a pipeline architecture and includes (in the original 3 chip solution): • Fast Pattern Processor (FPP) • Takes data from PHY chip • Protocol recognition • Classification • based on layer 2 to 7 • Table lookup with millions of entries and variable lengths • Reassembly • Routing Switch Processor (RSP) • Queueing • Packet Modification • Traffic Shaping • QoS processes • Segmentation • Agere System Interface (ASI) • Management • Tracks state information • Support for RMON (Remote Monitoring)

The 3 Chip Solution • POS-PHY: Packet Over Sonet – PHYsical • UTOPIA: Universal Test & Operation Phy Interface for ATM • FBI: Functional Bus Interface Source: http://nps.agere.com/support/non-nda/docs/FPP_Product_Brief.pdf

Main Responsibilities and Interfaces • FPP receives data from the PHY over a standard interface that can be POS PHY Level 3 (POS-PL3) or a UTOPIA 2 or 3 interface. • FPP classify traffic based on the contained at layer 2 to 7. • FPP send packet over POS-PL3 to RSP. • RSP is responsible for • Queueing, packet modification, shaping, QoS tagging, Segmentation. • The ASI chip is responsible for • Exceptions, maintains state information, interface to host processor, configure FPP and RSP over the CBI interface. • The management-Path Interface (MPI) enables the FPP to receive management frames from the local host. • Functional Bus Interface (FBI) connects the FPP to ASI to externally process function calls.

Memory • 64 bit standard PC-133 synchronous dynamic random access memory (SDRAM) • 133 MHz pipelined zero bus turnaround (ZBT) synchronous static random access memory (SSRAM). • PayloadPlus can use standard off-the-shelf standard DRAM for table lookups and does not need expensive and power hungry Content Addressable Memory (CAM). • Typical power limit for a line card is 150 W.

FPP Features • Programmable classification from layer 2 to 7 • Pipelined multi-threaded processing of PDU • High-level Functional Programming Language (FPL) that implicitly takes care of multiple threads • ATM re-assembly at OC-48 rates (eliminates external SAR) • Table lookup with millions of entries • Eliminates need for external CAMs • Deterministic performance regardless of the table size • Configurable UTOPIA/POS interfaces

FPP Protocol Data Unit (PDU) • FPP is a pipelined multithreaded processor that can simultaneously analyze and classify up to 64 protocol data units (PDU). • Each incoming PDU is assigned its own processing thread which is called a context. • Each PDU consists of one or multiple 64-byte blocks • The context is a processing path that keeps track of: • All blocks of PDU. • Input port Number of the PDU • Data offset for the PDU • The last block information • Program variable associated with the PDU • Classification information of the PDU

FPP Block Diagram Source: http://www.hotchips.org/archives/hc13/3_Tue/13agere.pdf Source: http://nps.agere.com/support/non-nda/docs/FPP_Product_Brief.pdf

FPP Functional Description • The input framer frames incoming data into 64 byte blocks. • It writes blocks into the data buffer (SDRAM) and into block buffers and context memory. • The block buffer stores data that are being processed and other associated context data for the execution of the FPP operations on the incoming data. • The output interface sends the PDU and their classification information to the RSP. • The Pattern Process Engine (PPE) performs pattern matching to determine how the incoming PDUs are classified. • The Queue Engine manages FPP replay contexts, provide address for block buffers and maintains information on blocks, PDUs and connection queues.

FPP Functional Description (two pass) • FPP processes bit streams in two passes. • In the first pass the PDU blocks are read into the queue engine memory • It produces data blocks as separate 64-byte blocks • The data offsets of each block is determined • Links between individual blocks that compose a PDU is established. • The PDU type is identified • In the second pass (replay phase) as the PDU is replayed from the queue engine • The PDU is processed as a whole entity. • Pattern matching is executed • At the same time PDU transmission toward the output unit is done.

FPP Top Level Flow Source: http://nps.agere.com/support/non-nda/docs/FPL_Product_Brief.pdf

RSP (Traffic Manager) Features • 64K queues • Programmable shaping (such as VBR, UBR, CBR) • Programmable discard policies (RED, WRED, EPD) • Programmable QoS (CBR, VBR, UBR) • Programmable CoS (Fixed Priority, Round Robin, WRR, WFQ, GFR) • Programmable packet modification • Support for multicast • Generates required checksums/CRC

RSP overview Source: http://www.hotchips.org/archives/hc13/3_Tue/13agere.pdf http://nps.agere.com/support/non-nda/docs/RSP_Product_Brief.pdf

RSP Functional Description • RSP handles classification and analysis results of the FPP on the incoming PDU. • It supports up to 64 logical input port. • For each PDU there is a command from the FPP that instructs RSP how to handle the PDU. • The PDU is added to a queue and stored in the PDU SDRAM. • RSP supports up to 64K programmable queues. • Processed data is output on a configurable 32-bit interface • There is also an 8-bit POS-PHY level 3 management interface. • RSP uses custom logic and three Very Large Instruction Word (VLIW) compute engines to process PDU

VLIW Compute Engines • The compute engines operate in a pipeline fashion • Each compute engine is dedicated to a processing function • Traffic Management Engine enforces, discard policies, and keeps queue statistics. • Traffic Shaper Engine ensures QoS and CoS for each queue. • Stream Editor Engine performs necessary PDU modifications • In each queue definition, the RSP includes, destination, scheduling information, and pointers to programs for each of the three VLIW compute engines. • Therefore, RSP can run multiple protocols at the same time. • The external CPU can also add queue definitions to set up ATM virtual circuits, for example.

RSP Data Flow The RSP 3 major processing stages: • Prepares and queues the PDU for scheduling • Assembles the blocks into a PDU in SDRAM • Determines the destination queue • Determines if the PDU should be queued. If it should, it is added to the appropriate queue for scheduling • Selects the next PDU block to be scheduled • Selects the physical port • Selects the logical port • Selects the scheduler • Selects the QoS queue Selects the CoS queue • Modifies and transmits the PDU on the appropriate output ports • Adjusts the QoS transmit intervals and CoS priority • Performs PDU modifications • Perform AAL5 CRC if necessary http://nps.agere.com/support/non-nda/docs/RSP_Product_Brief.pdf

Hierarchical Scheduling (Internal Scheduling Logic) • Channels: The output interface supports a 32-bit data channel which supports 1-4 POS-PHY or UTOPIA channels. It also has an 8-bit management output. • Physical Ports: Physical output ports are assigned to channels. There are up to 32 physical ports since there are 32 back pressure signals. • Logical ports: The RSP supports up to 256 logical output ports. • Schedulers: A set of schedulers is defined for each logical port. The RSP supports CBR, VBR and UBR schedulers. • QoS queues: Each of the QoS queues is assigned to a single scheduler. • COS queues: Up to 16 CoS queues feed a single QoS queue. http://nps.agere.com/support/non-nda/docs/RSP_Product_Brief.pdf

ASI • ASI seamlessly integrates FPP and RSP with the host processor. • It makes it possible for the designer to do the following: • Centralized initialization and configuration of the NP system and its physical interfaces. • Send routing and VPI/VCI updates to the system. • Implement various routing and management protocols. • Handle any occurring exceptions. • ASI enables high speed flow-oriented state maintenance: • Gathering Remote Network Monitoring (RMON) statistics • Time stamping packets • Checking Packet Sequence • Policing ATM and frame relay up to OC-48 rates • 8-bit POS-PHY interface over which the ASI sends packets to the FPP and receives them from RSP

How Does ASI Work? • It has a PCI interface for communication with host processor. • 32-bit high speed interface (FBI) to get functional call from FPP. • Two ALUs for processing FPP external function requests for: • Maintaining state and statistics. • Policing (leaky bucket) • Two SSRAM interface to allow memory access for different tasks without contention http://nps.agere.com/support/non-nda/docs/ASIProductBrief.pdf

ASI Configuration Capabilities • ASI enables host processor to configure up to 8 devices • The configuration bus is compatible with both Intel and Motorola bus formats. • It is used to : • Initialize and configure FPP and RSP • Load the program code for the FPP and RSP • Load the dynamic updates to the FPP tables and RSP queues • Configure third party external framers and physical interfaces

Policy and Conformance Checking • ASI performs conformance checking or policing for up to 64k connections at OC-48 rate. • It only does marking, not scheduling or shaping • Several variations of GCRA (leaky-bucket) algorithm can be used • For the dual leaky bucket case, the ASI indicates whether cells or frames are compliant or not and from which bucket the nonconformance was derived.

FPL • FPL is a functional language for classification. • In the functional language the programmer tells the computing resources what to do rather than how to do it. • In FPL you describe the protocol and the actions to process them. • In C you have to say how to process protocols. • FPL codes would be much shorter, easier to debug, and modify.

FPL Main Features • Fast pattern matching and classification of the data stream. • Defining functions for the FPP to execute based on the recognized patterns • Easy to read semantics • Dynamic updating of the code in the FPP • Software development tool set

Two Pass Processing • Recall the two pass processing in FPP • The first pass does preliminary process such as identifying the PDU type. • In the second pass (replay) it can simply transmit the PDU and conclusions or process a higher level protocol. • The queue engine allows you to process PDUs embedded in higher layer protocols in the replay phase.

Sample FPL Program Flow Source: http://nps.agere.com/support/non-nda/docs/FPL_Product_Brief.pdf

FPL code example Source: http://www.hotchips.org/archives/hc13/3_Tue/13agere.pdf

Dynamic FPL Program Changes • You can add and delete certain types of FPL statements from the image code in FPL dynamically. • FPL supports two types of pattern statement structures: • Single-rule patterns have a single pattern to match with one or two functions to perform. • These are called flows • These can not be added or removed dynamically • Multiple rule pattern statements allow you to define tables • This is used to define IP routing tables • These are called trees • You can add or delete statements from existing trees • You can not add a tree dynamically

Performance of the Network Processor • Drop in the performance due to the N+1 problem

Network Processor Performance • Performance evaluation for a mixture of packet sizes • Performance drops when the number of computations per packet increases

Commercial Network Processor Architectures Agere PayloadPlus Vahid Tabatabaee Fall 2007

Commercial Network Processor Architectures Agere PayloadPlus Vahid Tabatabaee Fall 2007

Presentation Transcript

Processor architectures

Content Addressable Memories Vahid Tabatabaee Fall 2007

Switch Fabric Architectures Vahid Tabatabaee Fall 2006

Processor architectures

Traffic Manager Vahid Tabatabaee Fall 2007

Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006

Anatomy of an IP Router Vahid Tabatabaee Fall 2007

Processor Architectures and Program Mapping

Network Architectures

Network Architectures

Processor Architectures and Program Mapping

Network Architectures

Processor Architectures and Program Mapping

Network Processor

Network Architectures

Processor Architectures and Program Mapping

Processor Architectures and Program Mapping

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS

Processor Architectures and Program Mapping

Processor Architectures and Program Mapping

Content Addressable Memories Vahid Tabatabaee Fall 2007