150 likes | 312 Views
Ethernet Data Center Routing Challenges and 802.1aq/SPB new work PETER ASHWOOD-SMITH peterashwoodsmith@huawei.com. A) Tweak Bridge Priorities Here. B). S 1 … S 16. 802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However:
E N D
Ethernet Data Center Routing Challengesand 802.1aq/SPB new work PETER ASHWOOD-SMITH peterashwoodsmith@huawei.com
A) TweakBridgePrioritiesHere B) S1 … S16 802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However: A) Need to tweak 2nd layer switch priorities to guarantee all 16 are used. B) Need at least 16 subnets (C/S-Vlan’s) to assign one per 802.1aq B-VID.
Can we eliminate ‘tweaking*’ • David Allan et al. have a presentation on this so I won’t spend much time on it. • In general a network with N equal cost paths from ‘some source’ to ‘some destination’ requires #ECT about 25-40% greater than N (to statistically capture them all). • Therefore when #ECT == N some ‘tweaking’ is usually required (for DC its trivial to do however). • Dave et al. suggest non-independence between ECT algorithms as way to address this (maximize diversity) … *Tweaking = adjustingBridge Priorities up/down fromdefaults.
A1 A2 B1 B2 B3 B4 S1,1 S32,1 S3,1 S1,160 S32,160 S3,160 “Example” 802.1aq switching cluster – assume 100GE NNI links/groups A15 A16 Goodnumbers“16” & “2”levels. 32 x 100GE 16 x 32 x 100GE = 51.2T using 48 x 2T switches 16 x 100GE 160 x 10GE B29 B30 B31 B32 5120 x 10GE • 48 switch non blocking 2 layer L2 fabric • 16 at “upper” layer A1..A16 • 32 at “lower” layer B1.. B32 • 16 uplinks per Bn, & 160 UNI links per Bn • 32 downlinks per An • (16 x 100GE per Bn)x32 = 512x100GE = 51.2T • 160 x 10GE server links (UNI) per Bn • (32 x 160)/2 = 2560 servers @ 2x10GE per • uFIB = 16 x 48 B-mac = 768 entries • mFIB = 16 subnet x 48 src = 768 entries 1536 FIB/node
ECT-ALG#12SourceNode (1) S1 … S16 For a given ECT-ALGk, Aj is a member of every SPF-TREE(B*,ECT-ALGk) Properly tuned no two ECT-ALGorithms will use the same Aj as a fork point.
Subnet Ni maps to I-SIDj and then to a unique A (j mod 16 ) A1 A2 A15 A16 B1 B2 B3 B4 B29 B30 B31 B32 I-SIDi I-SIDi I-SIDi I-SIDj I-SIDj I-SIDj So load spreading allows each Aito transit a complete subnet. Problem#1 - Unable to further spread such that Aiand Aj(i != j) each handle subset of flows in I-SID j
This is an issue under failure of Aj A1 A2 A15 A16 B1 B2 B3 B4 B29 B30 B31 B32 I-SIDi I-SIDi I-SIDi I-SIDj I-SIDj I-SIDj Recovery will move entire subnet traffic to another Ai node. A preferable solution is to spread affected load over remaining A*
Possible solution – head end hashing (unicast only) A1 A2 A15 A16 B1 B2 B3 B4 B29 B30 B31 B32 I-SIDi I-SIDi I-SIDi I-SIDj I-SIDj I-SIDj Allow unicast I-SIDi and I-SIDjtraffic to be hashed based on smaller flows to different B-VIDs (ECT-ALGorithms) This breaks the symmetry and congruence rules but allows edge balancing at smaller granularity. No changes to multicast.Requires learning <C-DA, B-DA> , independent of B-VID Unicast Mcast
A1 A15 A2 A16 B1 B29 B2 B30 B31 B3 B4 B32 Interconnection of fabrics creates more than 16 paths (exponential ) O(16x2x16) C1 C2 O(16x2) A1 A2 A15 A16 O(16) B29 B30 B31 B32 B1 B2 B3 B4 Number of paths can grow exponentially with increasing levels. Constant number of paths always << number of paths in many networks. Growing 802.1aq ECT to say 32 or even 100 ECMP causes larger unicast FIBs.
Horizontal Growth – not too bad but need more ECT-ALGORITHMS. A1 A2 A15 A16 A17 B33 B34 B29 B30 B31 B32 B1 B2 B3 B4 Horizontal growth by 1 just increases number of ECT by 1 Not too big a problem but we would need to define new ECT (via Opaque).
Choosepath from N x B-VID General Issue O(degree) D S O(diameter) #paths ~= O( diameter degree) So head end ECT in worst case requires O(exp(# B-VIDs))
A feasible solution … Single B-VID S D Choosepath from N x nxt hop Choosepath from N x nxt hop Re-assign traffic to path at each hop Tandem “ECMP” just like IP. Need to keep O(degree) number of next hops Only need one B-VID .. removes O(diameter) from state cost Flip side is you have no control – just hope for fine scale statistical distribution
What about loops in this mode? 802.1aq Ingress Check is very strong in the case of a single next hop and hence a single possible ingress for an SA. 802.1aq Ingress Check is weakened in the case of a multiple next hop and hence Multiple possible ingress for an SA. However 802.1aq Agreement Protocol functions correctly in the context of multiple possible Next Hops for the same B-VID (refer to Mick’s proof). But …
Agreement Protocol Concerns Is it too complex? it is clearly non trivial, we need implementation/emulation experience. Is it overly Draconian. For example the bounds on movement are what is required for a mathematical proof by induction .. However there are probably many cases where further movement would not loop. What isthe degree of ‘overkill’ ? Is it marketable? – this is unfortunately a legitimate concern!!! 802.1aq can be deployed without AP until we introduce hash basedforwarding at which point we either require a symmetric AP and/oran on-data-path loop detection/drop mechanism. Believe that an on-data-path loop detection mechanism is requiredfor hash based ECMP until we have more experience with AP. Recommend we standardize a TTL TAG either stand-alone or as a new form of I-TAG.
View of New Work Requirements R1) New ECT-ALGorithms with improved spreading properties. R2) Allow optional head end hash assignment of 802.1aq SPBM UNI known unicasttraffic to one of multiple next hop interfaces/B-VIDs. Very similar to Link Ag.Minimally HASH (seed, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO] ) R3) Allow optional tandem hash assignment of 802.1aq SPBM B-VID NNI unicasttraffic to one of multiple next hop interfaces. Essentially a new SPBM ECT-ALGwith its own B-VID. (i.e. new ECT-ALGorithms, all usable at same time)Minimally HASH (seed, B-VID, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO ]) R4) minor OA&M changes in support of R2 and R3, because symmetry/congruence broken. R5) More experience with AP, emulations, simulations etc. +addition of TTL to new I-TAG or a TTL-TAG.