1 / 21

Can Google Route?

Can Google Route?. Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002. Outline. Motiavation: Cost of switches The Basic Unit Switching Fabrics Analysis & Reality Check. PC Gigabit Ethernet Card $34.99 (32 Bit) $49.35 (64 Bit) Layer 2

karma
Download Presentation

Can Google Route?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew HollimanQ2/2002

  2. Outline • Motiavation: Cost of switches • The Basic Unit • Switching Fabrics • Analysis & Reality Check

  3. PC Gigabit Ethernet Card $34.99 (32 Bit) $49.35 (64 Bit) Layer 2 24 Port Switch $2,200 16 Port Switch $1,500 8 Port Switch $830 4 Port Switch $460 Layer 3 24 Port Switch $7,467 Routers (High End) CISCO 7600Max 15-80 GbitBase price $60kLine Cards $10kTotal $100k-250k??? CISCO 12400Max 160 GbitBase Price $120kLine Cards $25k-$50k Total $300-$1m??? Price Survey of Gigabit SwitchesApples and Oranges

  4. Cost/Gigabit vs. Total Switching CapacityCost/Gigabit dramatically increases with aggregate speed

  5. Let’s build a Switch the Google wayUse large numbers of cheap PC components • Use cheap PC boards ($250) • 16 MBytes of memory • No Disk • Use cheap copper Gigabit Ethernet Cards • Use Clos Networks to build larger fabrics out of smaller ones AddressLookup OutputQueuing PC PC PC PC PC PC Ethernet Ethernet

  6. PC Data FlowThe PCI bus is the bottleneck PC bus layout CPU 3.2 GB/s DRAM CPU processing MCH PCI bus NIC raises interrupt 0.5 GB/s Gig-e Gig-e PCI bus arbitration/address PCI burst write Terminate transaction NIC transfers packet(s) NIC FIFO Ethernet frame arrivals

  7. Computational FlowInterrupts are a bottle neck for short packets Incoming packet buffer DRAM access Copy packet to output queue Readpacket header Read packet header Copy packet tooutput queue Lookup Lookup … NIC transfers packet(s) Interrupt ~100 ns 5 μs • Packet processing is done from/to DRAM • Packets are written from to network cards in bursts to save IRQ overhead and PCI bandwidth

  8. Per Port Throughput vs. Burst Size We need 66Mhz, 64-bit system x 100 Mbits 11 1 GBitThreshold 10 9 33 MHz, 32-bit 66 MHz, 32-bit 8 33 MHz, 64-bit 66 MHz, 64-bit 7 Per-port read/write bandwidth 6 5 4 3 2 0 500 1000 1500 BurstSize

  9. CPU clock cycles per byte vs. Packet Size For 100%throughput we need to aggregate short packets 4 10 N=1, 5 us AveragePacketSize N=10, 5 us N=30, 5 us N=1, 11 us 3 10 N=10, 11 us N=30, 11 us N = # of packets per burst 5us = HW latency only 11us = With driver latency 2 CPU clocks per byte 10 1 GBitThreshold 1 10 0 10 0 500 1000 1500 bytes Packet size

  10. PC Performance SummaryToday’s PCs are just fast enough to operate as a 4x4 switch • To build a 4x4 half duplex (2x2 full duplex) switch we need: • 66 MHz/64 Bit PCI bus • 1 Gbyte/s Memory Bandwidth • NIC must have sufficient buffers to aggregate short packets to bursts (about 2kBytes) • Software has to run w/o interrupts • e.g. Linux in halted mode

  11. Building larger SwitchesClos Network for an 16x16 Switch made of 4x4 Switches Requires • 12 PCs • 48 Network Cards • 8 GBit capacity • Total cost: $5400 3Com is cheaper

  12. Building larger Switches Clos Network for an 256x256 Switch out of 16x16 switches Requires • 576 PCs • 2304 network cards • 128 Gbit capacity • Total cost: $260k Now we are cheaper! Well, sort of… 1 1 16 2 3 16

  13. Switch size vs. Basic Units NeededScaling is slightly worse than n log(n) • How many 4x4 switches do we need for an NxN switch? • 4 x 4 1 switch • 16 x 16 12 switches • 256 x 256 576 switches • 42^n x 42^n 3n42^n/4 switches • General: • N x N needs (N/4) log4N (1.5)^log2log4Nswitches • Could you build a switch with less basic units • Maybe, but not much • Theoretical limit is O( (N/4) log4N ) • Differing term (1.5)^log2log4N is small

  14. Scheduling – The Problem How do we do scheduling? • For n=k Clos Network we need dynamic matching • For 256x256 alogrithm is time-consuming • How to pass traffic information between inputs, scheduler and nodes • More network connections are costly • Timing critical

  15. Solution: Buffered Clos Networks Two ideas: 1. Add a randomization stage (Chang et. al.) • Now we can use round robin as scheduler • This is deterministic and requires no synchronization between nodes 2. Use the PC’s capability to buffer data • Each node has a few Mbytes • If there is a collision re-send packets • We use randomization

  16. Randomized, Buffering Clos Network Stage 1: • Pseudo Random(no coordination needed) • Never blocks Stage 2: • Round Robin(no coordination needed) • Never blocks. PseudoRandom Round Robin

  17. Stability Analysis of the Switch • First stage Analysis • Matching is random, distribution of packets on middle column of nodes is I.I.D. Uniform • No blocking can occur • Queue length at the middle stage is equivalent to an IQR with k inputs, VOQs and Round Robin scheduling • We know such a IQR has 100% throughput under I.I.D. Uniform traffic • Second stage • No blocking can occur, 100% throughput if all VOQs in middle stage are occupied • Queue length at the middle stage is equivalent to an output queued router with k inputs. • Output queued router has 100% throughput System has 100% throughput for any admissible traffic

  18. Reality Check • This might look like a good idea… • Cheap • Scalable – switch can grow • Some Redundancy - node failure reduces throughput by 1/16 worst case • …but is probably not exactly what Carriers want • High power consumption (50 kW vs. 2.5 kW) • Major space requirements (10-20 racks) • Packet reordering can happen (TCP won’t like it) • Maintenance – One PC will fail per day!

  19. Research OutlookWhy this could still be interesting • We can do this in hardware • Implement in VLSI • Build from chipsets that 24x24 switch manufacturers use. • We could use better base units • E.g. 1.15 TBit half duplex (fastest in the world?) • 576x576 using 24x24 Netgear switches (GS524T ) • Cost: $158k • (We might get a volume discount from Netgear) • So far we don’t use intelligence of the nodes • We can re-configure matchings periodically • Distribute lookup

  20. Backup

  21. Randomized/HoQ Buffering Clos Network Stage 1: • Pseudo Random(no coordination needed) • Never blocks Stage 2: • Head of Queue(no coordination needed) • If it blocks, buffer packet and resend. Note: We can overprovision middle layer! PseudoRandom Round Robin

More Related