261 likes | 389 Views
Ethernet Routing for Large Scale Distributed Data Center Fabrics. Dave Allan, J á nos Farkas, Panagiotis Saltsid i s, Jeff Tantsura Ericsson. Introduction. This is a concept and architecture for a distributed Cloud
E N D
Ethernet Routing for LargeScale DistributedData Center Fabrics Dave Allan, János Farkas, Panagiotis Saltsidis, Jeff Tantsura Ericsson
Introduction • This is a concept and architecture for a distributed Cloud • One purpose is to illustrate the capabilities and the scalability of the “state of the art” Ethernet • The components of the proposed architecture are progressing in standards, either complete or in progress • The architecture is built on • IEEE Shortest Path Bridging – MAC mode (SPBM) • As standardized in IEEE 802.1aq-2012 • IETF Ethernet Virtual Private Network (EVPN) as extended for SPBM interworking • This is being standardized in draft-ietf-l2vpn-spbm-evp
A Bit of History Payload Payload • Key antecedentsto SPB • Provider Backbone Bridges (PBB)[802.1ah] • Full MAC-in-MAC encapsulation • 24-bit I-SID, which is a 24-bit L2 Virtual Network ID • PBB Traffic Engineering (PBB-TE) [802.1Qay] • Enabled external control of bridge forwarding with complete route freedom, i.e. • Software Defined Networking (SDN) with geographical separation Payload Ethertype Ethertype C-VID C-VID C-tag Ethertype Payload optional Ethertype Ethertype VID S-VID S-VID Ethertype S-tag Ethertype Ethertype Ethertype SA C-SA C-SA SrcAddr DA C-DA C-DA DstAddr 802.1Q-1998 I-tag 802.1D-1990 ProviderBridges (PB) 802.1ad-2005 I-SID Ethertype B-VID B-tag Ethertype B-SA B-MAC B-DA Provider Backbone Bridges (PBB) 802.1ah-2008
What is Shortest Path Bridging (802.1aq SPB)? • SPB is a routed Ethernet solution that has been specified by the IEEE link state for bridges • IS-IS aspects documented in IETF RFC 6329 • All control functionality has been collapsed into a single protocol (IS-IS) • Unicast and multicast tree construction, VLAN registration etc. • Two SPB modes are defined: • SPBM: SPB MAC • MAC based • Designed to leverage the scalability provided by PBB MAC-in-MAC • No flooding and learning • Managed environments • SPBV: SPB VID • VID based • Applicable to all types of VLANs • Flooding and learning • Plug&play
What is important to understand about SPBM? • It is compute based: computation instead of signaling • It uses multiple shortest path trees instead of shared spanning trees • Unicast and multicast frames follow the same path between any two points in a given VLAN • So no frame misordering & you get meaningful OAM support • It uses loop mitigation AND loop prevention • It uses edge based load spreading • It is backwards compatible with, and is consistent with the full body of Ethernet standardization (IEEE 802.1) • CFM, EVB, lossless Ethernet etc. • It implements the full MEF 12.1 set of service constructs • E-LINE, E-LAN, E-TREE
Problems Already Solved • Ability to utilize more richly connected topologies • SPBM supports up to 16 way multi-pathing and is extensible to go further • Each multipath instance is a full mesh of the network • Large scale virtualization • PBB data plane scales to billion virtual networks (24-bit I-SID over 12-bit B-VID: 224 * 212) • Operational simplicity • All information contained in a single control protocol IS-IS • Single touch adds/moves and changes • Computed multicast • Reduced CP messaging combined with a computation driven convergence of unicast & multicast is a virtuous circle…
Solution Objectives Ubiquity and reach • Interconnect different flavors of “Ethernet”, across the dominant WAN technology (MPLS) Preserve operational simplicity • Preserve “single touch” add/move/delete automation • Minimal configuration • Alignment of BGP and IS-IS control plane paradigms • Break the scaling barriers of a single routing domain • Combined SPBM-EVPN allows much larger topologies • Domain isolation to “divide and conquer” state • Operate each SPBM domain on a “need to know” basis • Non-relevant information is excluded from routing advertisement • Minimize Filtering Database (FDB) state
B-VLAN1 B-VLAN2 BEB BEB Tenant Virtual Network: I-SID1 Tenant’s overlay, e.g. IP subnet or VLAN PE PE Solution Overview • There are a number of aspect of the solution • Topology hiding and abstraction • “Need to know” filtering • Independence of local multi-pathing • Multicast summarization EVPN SPBM SPBM EVPN I-SID1 I-SID1 I-SID1 B-VID2 B-VID1 DCN2 LSP DCN1 MPLS
SPBM and EVPN • Shortest Path Trees (SPT) are the basic connectivity construct for SPBM • They are edge rooted shortest path, and much finer grained than the shared spanning trees but they are still TREEs • Which constrains the set of network interconnect mechanisms • The set of fine grained MAC based trees are aggregated into Backbone VLANs (B-VLAN), where each B-VLAN delineates full mesh connectivity • EVPN is IP/MPLS based, and uses BGP to sort out mirroring of attached Ethernet networks • But once in EVPN we can map SPBM connectivity to any paradigm • The trick is interconnecting them
Mapping between SPBM & EVPN • Trees have ROOTs…. • Which means interworking needs to pin way points which can then permit the required design strategies work • For SPBM-EVPN interworking, we make the interworking function on the EVPN-PE into a “pinned waypoint” • This has the desirable effect of keep “churn” in subtending SPBM networks out of BGP • An EVPN-PE that is a “pinned waypoint” for a set of VLANs is known as a “designated forwarder”
Designated Forwarder • The set of EVPN-PEs attached to an SPBM network self elect which subset of VLANs they will act as Designated Forwarder (DF) for • This is based on local B-VID • The DF is then responsible for the relaying of all required state associated with the subset of VLANs it owns between the two control planes, and the interworking of data plane traffic between the SPBM and EVPN networks • This is simply in the form of a list of I-SIDs/B-MAC tuples • No topology information is leaked, the DF condenses all topology behind it down to a single node representation into the peer network • The DF also “re-roots” all (S,G) multicast trees that transit it by “blindly” rewriting “S” (Source)
DF Control Plane Interworking • DF has a Control Plane Interworking function • It proxies B-MAC/I-SID announcements from ISIS-SPB into BGP for the set of I-SIDs it is DF for • It will only proxy B-MAC/I-SID announcements from EVPN into ISIS-SPB if there is already locally registered interest in the I-SID PE DC WAN IS-IS BGP Control Plane Interworking Function IS-IS PDUs BGP PDUs MPLS PBBN IS-IS Database BGP Database BGP has the whole picture, IS-IS is “need to know”
BEB2 BEB1 DF2 DF1 EVPN-SPBM data plane C-SA: VM2 C-SA: VM1 C-SA: VM2 C-SA: VM1 C-SA: VM1 C-DA: VM1 C-DA: VM2 C-DA: VM1 C-DA: VM2 C-DA: VM2 I-SID1 I-SID1 I-SID1 I-SID1 I-SID1 B-VID2 B-VID2 B-VID1 B-VID1 B-SA: DF1 B-SA: DF2 B-SA: BEB2 B-SA: BEB1 B-SA: DF1 B-DA: DF2 Payload Payload Payload Payload Payload Payload B-DA: DF2 B-DA: BEB2 B-DA: BEB1 B-DA: DF1 MPLS VM1 VM2 SPBM SPBM EVPN I-SID1 I-SID1 I-SID1 B-VID1 B-VID2 DCN2 LSP DCN1 MPLS C-SA: VM2 C-DA: VM1 I-SID1 B-SA: DF2 B-DA: DF1 MPLS
DF Data Plane Procedures • Islands are decoupled by keeping B-Tags out of the EVPN core • What the core sees is MPLS encapsulated B-MACs and I-SIDs • B-Tags stripped by PEs on ingress to EVPN • B-Tags locally added by PEs on egress from EVPN • So the core is independent of however multi-pathing is implemented in each subtending island, or whether a PBBN exists at all (e.g. PBB-PEs) • Multicast MACs are aggregated at SPBM ingress PBBN DF MPLS MPLS Packets Add label stack BMAC lookup Ethernet Frames Strip tags MPLS Packets Add tags Strip label stack Ethernet Frames BMAC lookup Unicast interworking
Add Multicast in the MPLS Core • Objective is to get away from the inefficiencies of edge based replication in the PEs while minimizing the multicast state impact in the core • VLAN emulation can use lots of Multicast Distribution Trees (MDTs) • These can be aggregated into shared MDTs between larger sites • Shared MDTs can substantially reduce the amount of multicast state in the MPLS core to service large sites • Smaller sites may more likely benefit from service specific MDTs • So we will support both
Shared Multicast Distribution Trees • Issue is how to resolve VLANs to shared trees without getting into resolution servers or provisioning • One way to do this is to algorithmically “name” the tree • (*,G) or (S,G) where G is a sorted list of leaf node IDs • Via BGP every PE has sufficient information to construct the names of the MDTs • mLDP permits arbitrary opaque identifiers for MDTs to be used as a multicast FEC so the algorithmically constructed names can be used directly in signaling
Example RSTP CE 802.1ad PBN IS-IS CE BGP 802.1aq SPBM PBBPE7 EVPN + mLDP PE3 DF PE4 IS-IS PE5 PE1 DF 802.1aq SPBM IS-IS 802.1aq SPBM PE2 PE6 DF CE CE PE2, PE3 and PE5 are DFs for a common set of VLANs
Example RSTP CE 802.1ad PBN IS-IS CE BGP 802.1aq SPBM PBBPE7 EVPN + mLDP PE3 DF mLDP PE4 IS-IS PE5 PE1 DF 802.1aq SPBM IS-IS 802.1aq SPBM PE2 PE6 DF CE CE
I am PE 3, and I have 10 VLANs that need (*,G) multicast to myself and PEs 2, and 5 so the FEC is PE2+PE3+PE5 Example RSTP CE 802.1ad PBN IS-IS CE BGP 802.1aq SPBM PBBPE7 EVPN + mLDP PE3 DF mLDP PE4 IS-IS PE5 PE1 DF 802.1aq SPBM IS-IS 802.1aq SPBM PE2 PE6 DF CE CE
Example RSTP CE 802.1ad PBN IS-IS CE BGP 802.1aq SPBM PBBPE7 EVPN + mLDP I am PE 2, and I have 10 VLANs that need (*,G) multicast to myself and PEs 3, and 5 so the FEC is PE2+PE3+PE5 PE3 DF PE4 IS-IS PE5 PE1 DF 802.1aq SPBM IS-IS 802.1aq SPBM mLDP PE2 PE6 DF CE CE
Example RSTP CE 802.1ad PBN I am PE 5, and I have 10 VLANs that need (*,G) multicast to myself and PEs 2, and 3 so the FEC is PE2+PE3+PE5 IS-IS CE BGP 802.1aq SPBM PBBPE7 EVPN + mLDP PE3 DF PE4 IS-IS PE5 PE1 DF 802.1aq SPBM IS-IS 802.1aq SPBM PE2 PE6 DF CE CE
Example RSTP CE 802.1ad PBN IS-IS CE BGP 802.1aq SPBM PBBPE7 EVPN + mLDP PE3 DF PE4 Resulting MDT IS-IS PE5 PE1 DF 802.1aq SPBM IS-IS 802.1aq SPBM PE2 PE6 DF CE CE
What does this get me? • mLDP like PIM is rather chatty, and based on transactional convergence • If I had 10000 VLANs spread across the 3 sites in the example I WOULD have 10000 (*,G) or 30000 (S,G) trees • For 3 dual homed sites, there are ONLY 8 possible (*,G) and 24 possible (S,G) shared trees • It becomes practical to simply “nail them up” and modify the membership set of each tree at the ingress • Result is both scalable and stable
Key Insights & Next steps • Assumption of rich mesh hidden from SPBM in the first place • Exposing a large highly regular CLOS topology in link state simply burdens the control plane • Some topological summarization is required in the first place to usefully scale individual sites to 100,000 servers+ with existing technology • There is lots that can be done to engineer an SPBM network both with the vanilla standard, and with techniques currently under research • Deterministic aggregated trees lend themselves to “demand engineering” with automation • Work needs to be done to seamlessly extend this into the EVPN realm
Summary • The totality, completeness and self-consistency of IEEE data center networking solutions is impressive • From OAM to Edge Virtual Bridging • SPB permits this to scale to orders of magnitude beyond what Ethernet previously was capable of • Adding EVPN is a form of “multi-area” solution adds orders of magnitude beyond what SPB alone can do…