1 / 50

Scaling Internet Routers Using Optics UW, October 16 th , 2003

Scaling Internet Routers Using Optics UW, October 16 th , 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu. Department of Electrical Engineering, Stanford University

benjamin
Download Presentation

Scaling Internet Routers Using Optics UW, October 16 th , 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Internet Routers Using OpticsUW, October 16th, 2003 Nick McKeown Joint work with research groups of: David Miller, Mark Horowitz, Olav Solgaard. Students: Isaac Keslassy, Shang-Tse Chuang, Kyoungsik Yu. Department of Electrical Engineering, Stanford University Paper: http://klamath.stanford.edu/~nickm/papers/sigcomm2003.pdf Web site: http://klamath.stanford.edu/or

  2. Backbone router capacity 1Tb/s 100Gb/s 10Gb/s Router capacity per rack 2x every 18 months 1Gb/s

  3. Backbone router capacity 1Tb/s 100Gb/s Traffic 2x every year 10Gb/s Router capacity per rack 2x every 18 months 1Gb/s

  4. Extrapolating 100Tb/s 2015: 16x disparity Traffic 2x every year Router capacity 2x every 18 months 1Tb/s

  5. Consequence • Unless something changes, operators will need: • 16 times as many routers, consuming • 16 times as much space, • 256 times the power, • Costing 100 times as much. • Actually need more than that…

  6. Optical Switch Electronic Linecard #1 Electronic Linecard #625 160-320Gb/s 160-320Gb/s 40Gb/s • Line termination • IP packet processing • Packet buffering • Line termination • IP packet processing • Packet buffering 40Gb/s 160Gb/s 40Gb/s 100Tb/s = 640 * 160Gb/s 40Gb/s Stanford 100Tb/s Internet Router Goal: Study scalability • Challenging, but not impossible • Two orders of magnitude faster than deployed routers • We will build components to show feasibility

  7. Throughput Guarantees • Operators increasingly demand throughput guarantees: • To maximize use of expensive long-haul links • For predictability and planning • Despite lots of effort and theory, no commercial router today has a throughput guarantee.

  8. Requirements of our router • 100Tb/s capacity • 100% throughput for all traffic • Must work with any set of linecards present • Use technology available within 3 years • Conform to RFC 1812

  9. What limits router capacity? Approximate power consumption per rack Power density is the limiting factor today

  10. Crossbar Linecards Switch Linecards Trend: Multi-rack routersReduces power density

  11. Juniper TX8/T640 Alcatel 7670 RSP TX8 Avici TSR Chiaro

  12. Limits to scaling • Overall power is dominated by linecards • Sheer number • Optical WAN components • Per packet processing and buffering. • But power density is dominated by switch fabric

  13. Limit today ~2.5Tb/s • Electronics • Scheduler scales <2x every 18 months • Opto-electronic conversion Switch Linecards Trend: Multi-rack routersReduces power density

  14. Multi-rack routers Switch fabric Linecard In WAN Out In WAN Out

  15. Question • Instead, can we use an optical fabric at 100Tb/s with 100% throughput? • Conventional answer: No. • Need to reconfigure switch too often • 100% throughput requires complex electronic scheduler.

  16. Outline • How to guarantee 100% throughput? • How to eliminate the scheduler? • How to use an optical switch fabric? • How to make it scalable and practical?

  17. R R ? R R ? Out ? R R ? R R R R ? R R R ? R Out ? R R R R ? ? R Out Switch capacity = N2R Router capacity = NR 100% Throughput In In In

  18. R R/N R/N Out R/N R/N R R R R/N R/N Out R/N R R/N R/N Out If traffic is uniform R In R In R In

  19. R R R R ? R/N In R R/N Out R/N R/N R R R R R In R R R/N R/N Out R/N R R R R/N In R/N Out Real traffic is not uniform

  20. Out Out Out Out Out 100% throughput for weakly mixing, stochastic traffic. [C.-S. Chang, Valiant] Two-stage load-balancing switch R R R R/N R/N In Out R/N R/N R/N R/N R/N R/N R R R In R/N R/N R/N R/N R/N R/N R R R R/N R/N In R/N R/N Load-balancing stage Switching stage

  21. Out Out Out R R In 3 3 3 R/N R/N 1 R/N R/N R/N R/N R/N R/N R R In 2 R/N R/N R/N R/N R/N R/N R/N R R R/N In 3 R/N R/N

  22. Out Out Out R R In R/N R/N 1 R/N R/N 3 R/N R/N R/N R/N R R In 2 R/N R/N 3 R/N R/N R/N R/N R/N R R R/N In 3 R/N R/N 3

  23. Chang’s load-balanced switchGood properties • 100% throughput for broad class of traffic • No scheduler needed a Scalable

  24. FOFF: Load-balancing algorithm • Packet sequence maintained • No pathological patterns • 100% throughput - always • Delay within bound of ideal • (See paper for details) Chang’s load-balanced switchBad properties • Packet mis-sequencing • Pathological traffic patterns a Throughput 1/N-th of capacity • Uses two switch fabricsa Hard to package • Doesn’t work with some linecards missinga Impractical

  25. One linecard R R Out R R Out R R Out Single Mesh Switch 2R/N In 2R/N 2R/N 2R/N In 2R/N 2R/N 2R/N 2R/N In 2R/N

  26. 2R/N 2R/N Backplane Out R 2R/N 2R/N 2R/N 2R/N Out R 2R/N 2R/N R/N Out R Packaging R In R In R In

  27. C1, C2, …, CN C1 C2 C3 CN In In In In Out Out Out Out Many fabric options N channels each at rate 2R/N Any permutation network Options Space: Full uniform mesh Time: Round-robin crossbar Wavelength: Static WDM

  28. A, A, A, A A, B, C, D B, B, B, B A, B, C, D C, C, C, C A, B, C, D D, D, D, D A, B, C, D 4 WDM channels, each at rate 2R/N In In In In Out Out Out Out Static WDM switching Array Waveguide Router (AWGR) Passive andAlmost ZeroPower A B C D

  29. 2 2 2 2 2 2 l1 R l1, l2,.., lN WDM lN R l1 l1, l2,.., lN R R WDM 2 lN Out l1 R l1, l2,.., lN R 1 1 1 1 WDM lN Linecard dataflow In l1 l1, l2,.., lN R R WDM lN 1 3 1 1 1 1 2 3 4 1 1 1 1

  30. Problems of scale • For N < 64, WDM is a good solution. • We want N = 640. • Need to decompose.

  31. Decomposing the mesh 2R/8 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8

  32. WDM TDM Decomposing the mesh 1 2R/8 2R/8 1 2R/4 2R/8 2R/8 2 2 3 3 4 4 5 5 6 6 7 7 8 8

  33. 1 L 1 2 2 L When N is too largeDecompose into groups (or racks) Group/Rack 1 2R Array Waveguide Router (AWGR) l1, l2, …, lG 2R 1 2R Group/Rack G 2R l1, l2, …, lG 2R G 2R

  34. When a linecard is missing • Each linecard spreads its data equally over every other linecard. • Problem: If one is missing, or failed, then the spreading no longer works.

  35. R R 2R/3 + 2R/3 = (4/3)R 2R/3 + 2R/6 + 2R/3 + 2R/6 = 2R 2R/3 + 2R/6 Out 2R/3 + 2R/6 R R Out R R 2R/3 + 2R/6 Out 2R/3 + 2R/6 When a linecard fails 2R/3 In 2R/3 2R/3 • Solution: • Move light beams • Replace AWGR with MEMS switch. • Reconfigure when linecard added, removed or fails. • Finer channel granularity • Multiple paths. 2R/3 In 2R/3 2R/3 2R/3 2R/3 In 2R/3

  36. 1 MEMS Switch G 1 MEMS Switch G 1 MEMS Switch G L 1 2 1 2 L SolutionUse transparent MEMS switches Group/Rack 1 MEMS switches reconfigured only when linecard added, removed or fails. 2R 2R 2R Group/RackG=40 2R 2R 2R Theorems: 1. Require L+G-1 MEMS switches 2. Polynomial time reconfiguration algorithm

  37. Middle-Stage First-Stage Final-Stage GxG Middle Switch 1 1 Linecard 1 Linecard 1 2 2 LxM Local Switch MxL Local Switch Linecard 2 Linecard 2 3 3 1 Linecard L M M Linecard L GxG Middle Switch Group 1 Group 1 1 1 2 Linecard 1 Linecard 1 2 2 LxM Local Switch MxL Local Switch Linecard 2 Linecard 2 3 3 GxG Middle Switch Linecard L M M Linecard L 3 Group 2 Group 2 1 1 Linecard 1 Linecard 1 2 2 LxM Local Switch MxL Local Switch Linecard 2 Linecard 2 3 3 GxG Middle Switch Linecard L M M Linecard L M Group G Group G Hybrid Architecture: Logical View

  38. Static MEMS Electronic Switches Fixed Lasers Optical Receivers Electronic Switches GxG MEMS 1 1 Linecard 1 Linecard 1 LxM Crossbar MxL Crossbar 2 2 Linecard 2 Linecard 2 3 3 1 Linecard L Linecard L M M GxG MEMS Group 1 Group 1 1 1 Linecard 1 Linecard 1 2 LxM Crossbar MxL Crossbar 2 2 Linecard 2 Linecard 2 3 3 GxG MEMS Linecard L Linecard L M M 3 Group 2 Group 2 1 1 Linecard 1 Linecard 1 LxM Crossbar MxL Crossbar 2 2 Linecard 2 Linecard 2 3 3 GxG MEMS Linecard L Linecard L M M M Group G Group G Hybrid Electro-Optical Architecture

  39. Number of MEMS Switches R R R Linecard 1 Crossbar Crossbar Linecard 1 R R Linecard 2 Linecard 2 R R Linecard 3 Crossbar Crossbar Linecard 3 R R R R R Linecard 4 Linecard 4 StaticMEMS R R R Linecard 1 Crossbar Crossbar Linecard 1 R R R Linecard 2 Linecard 2 R R Linecard 3 Crossbar Crossbar Linecard 3 R R R R Linecard 3 Linecard 4

  40. Number of MEMS Switches R R 4R/3 Linecard 1 Crossbar Crossbar Linecard 1 R R Linecard 2 Linecard 2 R R Linecard 3 Crossbar Crossbar Linecard 3 2R/3 2R/3 R/3 StaticMEMS R R R Linecard 1 Crossbar Crossbar Linecard 1 R/3 R 2R/3 R Linecard 2 Linecard 2 R/3 R R Linecard 3 Crossbar Crossbar Linecard 3 2R/3

  41. Number of MEMS needed for a schedule • Li: number of linecards in group i, 1 ≤ i ≤ G. Group i needs to send to group j: • Assume each group can send at most R to each MEMS. Number of MEMS needed between groups i and j:

  42. Number of MEMS needed for a schedule • The number of MEMS needed for group i to send to group j is Aij. • The total number of MEMS needed for group i is the sum of the Aij’s

  43. Constraints for the TDM Schedule • Latin Square: In any period N, each transmitting linecard is connected to each receiving linecard exactly once. • MEMS constraint: In any time-slot, there are at most Aijconnections between transmitting group i and receiving group j, where:

  44. Example • Assume L1=3, L2=2, L3=1 • Then • E.g., at most 2 packets from the first group to the first group at each time-slot

  45. Bad TDM Transmit Schedule

  46. Good TDM Transmit Schedule

  47. Configuration Algorithm • Assign connections between groups, so MEMS constraint is satisfied. • Assign group connections to specific linecards, so there is exactly one connection per linecard pair in the schedule. Comments: • Algorithm is surprisingly complex. • Best running time so far: 40 seconds for 640 linecards.

  48. Low-cost, low-power optoelectronic conversion? l1 Pkt Switch How to build a 250ms 160Gb/s buffer? WDM lG l1 R R WDM lG Challenges In l1 Address Lookup l1, l2,.., lG R R WDM lG R l1, l2,.., lG l1, l2,.., lG 1 1 1 2 2 R=160Gb/s 3 4 Out l1 R l1, l2,.., lG R WDM lG

  49. Chip #2: 16 x 55 Opto-electronic crossbar 55 x 10Gb/s 55 x 10Gb/s 1500nm Optical source 16 x 10Gb/s CMOS ASIC To Linecards To Optical Fabric What we are building 250ms DRAM 320Gb/s Chip #1: 160Gb/s Packet Buffer Buffer Manager 90nm ASIC 160Gb/s 160Gb/s Optical Detector Optical Modulator

  50. 40 x 40 MEMS Linecard Rack 1 Linecard Rack G = 40 Switch Rack < 100W L = 16 160Gb/s linecards L = 16 160Gb/s linecards 1 2 55 56 100Tb/s Load-Balanced Router L = 16 160Gb/s linecards

More Related