440 likes | 673 Views
CRS-1 overview TAU – Mar 07. Rami Zemach. Cisco’s high end router CRS-1. Future directions. Agenda. CRS-1’s NP Metro (SPP). CRS-1’s Fabric. CRS-1’s Line Card. What drove the CRS?. A sample taxonomy. OC768 Multi chassis Improved BW/Watt & BW/Space New OS (IOS-XR)
E N D
CRS-1 overviewTAU – Mar 07 Rami Zemach
Cisco’s high end routerCRS-1 Future directions Agenda CRS-1’s NP Metro (SPP) CRS-1’s Fabric CRS-1’s Line Card
What drove the CRS? A sample taxonomy • OC768 • Multi chassis • Improved BW/Watt & BW/Space • New OS (IOS-XR) • Scalable control plane
Multiple router flavours A sample taxonomy • Core • OC-12 (622Mbps) and up (to OC-768 ~= 40Gbps) • Big, fat, fast, expensive • E.g. Cisco HFR, Juniper T-640 • HFR: 1.2Tbps each, interconnect up to 72 giving 92Tbps, start at $450k • Transit/Peering-facing • OC-3 and up, good GigE density • ACLs, full-on BGP, uRPF, accounting • Customer-facing • FR/ATM/… • Feature set as above, plus fancy queues, etc • Broadband aggregator • High scalability: sessions, ports, reconnections • Feature set as above • Customer-premises (CPE) • 100Mbps • NAT, DHCP, firewall, wireless, VoIP, … • Low cost, low-end, perhaps just software on a PC
Routers are pushed to the edge A sample taxonomy • Over time routers are pushed to the edge as: • BW requirements grow • # of interfaces scale • Different routers have different offering • Interfaces types (core is mostly Eathernet) • Features. Sometimes the same feature is implemented differently • User interface • Redundancy models • Operating system • Costumers look for: • investment protection • Stable network topology • Feature parity Transparent scale
What does Scaling means … A sample taxonomy • Interfaces (BW, number, variance) • BW • Packet rate • Features (e.g. Support link BW in a flexible manner) • More Routes • Wider ECO system • Effective Management (e.g. capability to support more BGP peers and more events) • Fast Control (e.g. distribute routing information) • Availability • Serviceability • Scaling is both up and down (logical routers)
CPU Buffer Memory Route Table CPU Line Interface Line Interface Line Interface Memory MAC MAC MAC Typically <0.5Gb/s aggregate capacity Low BW feature rich – centralized Off-chip Buffer Shared Bus Line Interface
Fwding Table High BW – distributed “Crossbar”: Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory Line Interface CPU Routing Table Memory Fwding Table MAC MAC Typically <50Gb/s aggregate capacity
Distributed architecture challenges (examples) • HW wise • Switching fabric • High BW switching • QOS • Traffic loss • Speedup • Data plane (SW) • High BW / packet rate • Limited resources (cpu, memory) • Control plane (SW) • High event rate • Routing information distribution (e.g. forwarding tables)
Shelf controller Shelf controller Sys controller Shelf controller Shelf controller Shelf controller Sys controller CRS-1 System View Line Card Shelves Contains Route Processors, Line cards, System controllers Fabric Shelves Contains Fabric cards, System Controllers 100m NMS (Full system view) Out of band GE control bus to all shelf controllers
Line Card 8 of 8 Line Card S1 S2 S3 Modular Service Card 2 of 8 8K Qs Cisco SPP 1 of 8 Interface Module MID-PLANE S1 S2 S3 8K Qs Cisco SPP Route Processor Route Processor S1 S2 S3 µ µ CRS-1 System Architecture Fabric Chassis FORWARDING PLANE • Up to 1152x40G • 40G throughput per LC MULTISTAGE SWITCH FABRIC 1296x1296 non-blocking buffered fabric Roots of Fabric architecture from Jon Turner’s early work DISTRIBUTED CONTROL PLANE Control SW distributed across multiple control processors
Switch Fabric challenges • Scale - many ports • Fast • Distributed arbitration • Minimum disruption with QOS model • Minimum blocking • Balancing • Redundancy
Previous solution: GSR – Cell based XBAR w centralized scheduling • Each LC has variable width links to and from the XBAR, depending on its bandwidth requirement • Central scheduling ISLIP based • Two request-grant-accept rounds • Each arbitration round lasts one cell time • Per destination LC virtual output queues • Supports • H/L priority • Unicast/multicast
CRS Cell based Multi-Stage Benes • Multiple paths to a destination • For a given input to output port, the no. of paths is equal to the no. of center stage elements • Distribution between S1 and S2 stages. Routing at S2 and S3 • Cell routing
Fabric speedup • Q-fabric tries to approximate an output buffered switch • to minimize sub-port blocking • Buffering at output allows better scheduling • In single stage fabrics a 2X speedup very closely approximates an output buffered fabric * • For multi-stage the speedup factor to approx output buffered behavior is not known • CRS-1 fabric’s ~5X speed up • constrained by available technology • * Balaji prabhakar and nick McKeown computer systems technical report CSL-TR-97-738. November 1997.
Fabric Flow ControlOverview • Discard - time constant in the 10’s of mS range • Originates from ‘from fab’ and is directed at ‘to fab’. • Is a very fine level of granularity, discard to the level of individual destination raw queues. • Back Pressure - time constant in the 10’s of mS range. • Originates from the Fabric and is directed at ‘to fab’. • Operates per priority at increasingly coarse granularity: • Fabric Destination (one of 4608) • Fabric Group (one of 48 in phase one and 96 in phase two) • Fabric (stop all traffic into the fabric per priority)
Reassembly Window • Cells transitioning the Fabric take different paths between Sprayer and Sponge. • Cells for the same packet will arrive out of order. • The Reassembly Window for a given Source is defined as the the worst-case differential delay two cells from a packet encounter as they traverse the Fabric. • The Fabric limits the Reassembly Window
Linecard challenges • Power • COGS • Multiple interfaces • Intermediate buffering • Speed up • CPU subsystem
Cisco CRS-1 Line Card MODULAR SERVICES CARD PLIM Egress Packet Flow From Fabric 4 OC192Framer and Optics IngressQueuing RX METRO 3 2 Interface Module ASIC OC192Framer and Optics 1 SquidGW CPU MIDPLANE OC192Framer and Optics 8 From Fabric ASIC TXMETRO EgressQueuing OC192Framer and Optics 6 7 5
MODULAR SERVICES CARD PLIM Power Regulators Egress Packet Flow From Fabric 4 Egress Queuing OC192Framer and Optics IngressQueuing RX METRO 3 2 Interface Module ASIC Egress Metro From Fabric OC192Framer and Optics 1 SquidGW CPU MIDPLANE OC192Framer and Optics Fabric Serdes Line Card CPU 8 From Fabric ASIC TXMETRO EgressQueuing OC192Framer and Optics 6 7 Ingress Metro Ingress Queuing 5 Cisco CRS-1 Line Card
Power Regulators Egress Queuing From Fabric Fabric Serdes Line Card CPU Ingress Queuing Cisco CRS-1 Line Card Egress Metro Ingress Metro
Cisco CRS-1 Line Card Ingress Metro
Metro Subsystem • What is it ? • Massively Parallel NP • Codename Metro • Marketing name SPP (Silicon Packet Processor) • What were the Goals ? • Programmability • Scalability • Who designed & programmed it ? • Cisco internal (Israel/San Jose) • IBM and Tensilica partners
QDR2 SRAM • 250Mhz DDR • 5 Channels • Policing state Classification results Queue length state • Metro • 2500 Balls • 250Mhz 35W • TCAM • 125MSPS • 128kx144-bit entries • 2 channels • FCRAM • 166Mhz DDR • 9 Channels • Lookups and Table Memory Metro Subsystem
Packet Out • 96 Gb/s BW Packet In 96 Gb/s BW Control Processor Interface Proprietary 2Gb/s • 18mmx18mm - IBM .13um • 18M gates • 8Mbit SRAM and RAs Metro Top Level
Gee-whiz numbers • 188 32-bit embedded Risc cores • ~50 Bips • 175 Gb/s Memory BW 78 MPPS peak performance
100k+ of adjacencies Pointer to Statistics Counters L3 load balance entry L2 info L3 Millions of Routes info Hundreds of Load balancing Entries per leaf Lookup Load Balancing and Adjacencies : Sram/DRAM Sram/Dram Increasing pressure to add 1-2 level of increased indirection for High Availability and increased update rates policy based PBR associative routing TCAM data table 1:1 TCAM Sram/DRAM Why Programmability ?Simple forwarding – not so simple Example FEATURES: • IPv4 Unicast lookup algorithm • MPLS–3 Labels • Link Bundling (v4) • Load Balancing L3 (v4) • 1 Policier Check • Marking • TE/FRR • Sampled Netflow • WRED • ACL • IPv4 Multicast • IPv6 Unicast • Per prefix accounting • GRE/L2TPv3 Tunneling • RPF check (loose/strict) v4 • Load Balancing V3 (v6) • Link Bundling (v6) • Congestion Control L2 Adjacency Programmability also means Ability to juggle feature ordering Support for heterogeneous mixes of feature chains Rapid introduction of new features (Feature Velocity)
On-Chip Packet Buffer Resource Fabric 188 PPE Metro Architecture Basics Packet tails stored on-chip Packet Distribution 96G 96G 96G 96G Run-to-completion (RTC) simple SW model efficient heterogeneous feature processing RTC and Non-Flow based Packet distribution means scalable architecture Costs High instruction BW supply Need RMW and flow ordering solutions PPE ~100Bytes of packet context sent to PPEs Resource Resource
On-Chip Packet Buffer Resource Fabric 188 PPE Metro Architecture Basics Packet Gather 96G 96G 96G 96G Gather of Packets involves : Assembly of final packets (at 100Gb/s) Packet ordering after variable length processing Gathering without new packet distribution PPE Resource Resource
Resource Fabric 188 PPE Metro Architecture Basics Packet Buffer accessible as Resource On-Chip Packet Buffer 96G 96G 96G 96G Resource Fabric is parallel wide multi-drop busses Resources consist of Memories Read-modify-write operations Performance heavy mechanisms PPE Resource Resource
Metro Resources Statistics 512k Interface Tables Policing 100k+ Queue Depth State Lookup Engine 2M Prefixes TCAM Lookup Engine uses TreeBitmap Algorithm FCRAM and on-chip memory High Update rates Configurable performance Vs density CCR April 2004 (vol. 34 no. 2) pp 97-123. “Tree Bitmap : Hardware/Software IP Lookups with Incremental Updates”, Will Eatherton et. Al. Table DRAM (10’sMB)
Packet Processing Element (PPE) 16 PPE Clusters Each Cluster of 12 PPE’s .5sqmm per PPE
Packet Processing Element (PPE) ICACHE • Tensilica Xtensa core with Cisco enhancements • 32-bit, 5-stage pipeline • Code Density : 16/24 bit instructions • Small instruction cache and data memory • Cisco DMA engine – allows 3 outstanding Descriptor DMAs • 10’s Kbytes Fast instruction memory To12 PPE’s Cluster Instruction Memory Global Instruction Memory instruction bus From Resources Pkt Distribution 32-bit RISC ProcessorCore Memory mapped Regs Distribution Hdr To12 PPE’s Cluster Data Mux Unit Cisco DMA Pkt Hdr Scratch Pad Pkt Gather DATA Mem To Resources PPE
Programming Model and Efficiency Metro Programming Model • Run to completion programming model • Queued descriptor interface to resources • Industry leveraged tool flow Efficiency Data Points • 1 ucoder for 6 months: IPv4 with common features (ACL, PBR, QoS, etc..) • CRS-1 initial shipping datapath code was done by ~3 people
Challenges • Constant power battle • Memory and IO • Die Size Allocation • PPEs Vs HW acceleration • Scalability • On-chip BW vs off-chip capacity • Procket NPU 100MPPS - limited scaling • Performance
future directions POP convergence Edge and core differences blur Smartness in the network More integrated services into the routing platforms Feature sets needing acceleration expanding Must leverage feature code across platforms/markets Scalability (# of processors, amount of memory, BW)
Summary Router business is diverse Network growth push routers to the edge Costumers expect scale from one hand … and smart network Routers become a massive parallel processing machines
Questions ? Thank You
CRS-1 Positioning • Core router (overall BW, interfaces types) • 1.2 Tbps, OC-768c Interface • Distributed architecture • Scalability/Performance • Scalable control plane • High Availability • Logical Routers • Multi-Chassis Support
Networks planes • Networks are considered to have three planes / operating timescales • Data: packet forwarding [μs, ns] • Control: flows/connections [ ms, secs] • Management: aggregates, networks [ secs, hours ] • Planes coupling is in descendent order (control-data more, management-control less)
log2N N entries Exact Matches in Ethernet Switches Trees and Tries Binary Search Tree Binary Search Trie < > 0 1 < > < > 0 1 0 1 010 111 Lookup time bounded and independent of table size, storage is O(NW) Lookup time dependent on table size, but independent of address length, storage is O(N)
Exact Matches in Ethernet Switches Multiway tries 16-ary Search Trie Ptr=0 means no children 0000, ptr 1111, ptr 1111, ptr 0000, 0 1111, ptr 0000, 0 000011110000 111111111111 Q: Why can’t we just make it a 248-ary trie?