440 likes | 572 Views
TIPC as TML. draft-maloy-tipc-01.txt. Jon Maloy, Ericsson Steven Blake, Modularnet Maarten Koning, WindRiver Jamal Hadi Salim,Znyx Hormuzd Khosravi,Intel. IETF-61, Washington DC, Nov 2004. TIPC. A transport protocol for cluster environments
E N D
TIPC as TML draft-maloy-tipc-01.txt Jon Maloy, Ericsson Steven Blake, Modularnet Maarten Koning, WindRiver Jamal Hadi Salim,Znyx Hormuzd Khosravi,Intel IETF-61, Washington DC, Nov 2004
TIPC • A transport protocol for cluster environments • Connectionless and Connection Oriented; Reliable or Unreliable. • Reliable or Unreliable Multicast • Usage not limited to ForCES context • A framework for detecting, supervising and maintaining cluster topology • Available as portable open source code package under BSD licence • 12000 lines of C code, 112 kbyte Linux kernel module • Runs on 4 OS:es so far, and more to come • Proven concept, used and deployed in several Ericsson products
CE PL (ForCES Protocol) FE PL (ForCES Protocol) CE TML FE TML Transport (IP,TCP,RapidIO,Ethernet…) Transport (IP,TCP,RapidIO,Ethernet…) ForCES Protocol Framework ForCES Protocol Messages
CE PL (ForCES Protocol) FE PL (ForCES Protocol) TIPC TML TIPC TML L2 Transport (RapidIO,Ethernet…) L2 Transport (RapidIO,Ethernet…) TIPC as L2 TML ForCES Protocol Messages
CE PL (ForCES Protocol) FE PL (ForCES Protocol) TIPC TML TIPC TML L2 Transport (RapidIO,Ethernet…) L2 Transport (RapidIO,Ethernet…) Interface Adaptation Interface Adaptation Interface Adaptation ForCES Protocol Messages
Fulfilling Requirements(1) • Reliability • Reliable transport in all modes • Can be made unreliable per socket/direction • Security • Only secure within closed networks. • No explicit authentication/encryption support yet, but planned • Not IP-based, no router will forward TIPC messages!! • Congestion Control • At three levels: Connection/Transport, Signalling Link and Carrier level • Will give feedback to PL layer if connection is broken or message rejected • Multicast/Broadcast • Supported
Fulfilling Requirements(2) • Timeliness • Immediate delivery (No Nagle algorithm) • Inter-node delivery time in the order of 100 microseconds • HA Considerations • L2 link failure detection and failover handled transparently for user • Connection abortion with error code if no redundant carrier available • Peer node failure detection after 0.5-1.5 seconds • Encapsulation • 24 byte extra header • 40 extra for connectionless • Priorities • Supports 4 message importance priorities, determining congestion levels and abort/rejection levels • Is 8 levels really needed ?
Connection Directly on TIPC CE CE Object FB X FB Y TIPC FE FE Object LFB 1 LFB 2
Connections via FE/CE Object CE CE Object FB X FB Y TIPC FE FE Object LFB 1 LFB 2
Connection Usage CE CE Object FB X FB Y Traffic Data Connection: Low Priority Reliable CE->FE Unreliable FE->CE Control Connection: High Priority Reliable in both directions TIPC FE FE Object LFB 1 LFB 2
foo,33 Functional Addressing: Unicast • Function Address • Persistent, reusable 64 bit port identifier assigned by user • Consists of type number and instance number • Function Address Sequence • Sequence of function addresses with same type Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, instance = 33) Server Process, Partition A bind(type = foo, lower=0, upper=99)
Address Mapping -Unicast CE RSVP77 CE Object FB X tml_bind(RSVP,77) TML API bind(RSVP,77,77) TIPC API TIPC FE TIPC API bind(meter,44,44) Meter44 FE Object TML API LFB 1 tml_bind(meter,44)
tml_bind(RSVP,77) TML API bind(RSVP,77,77) Connection Setup CE 8 RSVP77 CE Object FB X TIPC API TIPC FE 17 connect(RSVP,77,node=8) Meter44 FE Object LFB 1 tml_connect(RSVP,77, CEID=8) If instance numbers are coordinated over whole cluster there is no need for LFBs to know CEID
Functional Addressing: Multicast • Based on Function Address Sequences • Any partition overlapping with the range used in the destination address will receive a copy of the message • Client defines “multicast group” per call Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) foo,33,133 Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99)
Address Mapping -Multicast CE RSVP77 CE Object tml_mcast(meter_mc, group=X) FB X sendto(meter_mc,X,X) TIPC FE Meter13 bind(meter_mc,X,X) Meter44 bind(meter_mc,X,X) FE Object tml_join(meter_mc,X) tml_join(meter_mc,X)
Why TIPC in ForCES ? • Congestion control at three levels • Connection level, signalling link level and media level • Based on 4 importance priorities • Simple to configure • Each node needs to know its own identity, that is all • Automatic neighbour detection using multicast/broadcast • Lightweigth, Reactive Connections • Immediate connection abortion at node/process failure or overload • Toplogy Subscription Service • Functional and physical topology
Functional View Socket API Adapter Port API Adapter Other API Adapters User Adapter API Address Subscription Address Resolution Address Table Distribution Connection Supervision Route/Link Selection Reliable Multicast Neighbour Detection Link Establish/Supervision/Failover Node Internal Fragmentation/De-fragmentation Packet Bundling Congestion Control Sequence/Retransmission Control Bearer Adapter API Ethernet UDP SCTP Infiniband Mirrored Memory
Cluster <1.2> Cluster <1.1> Internet/ Intranet Network Topology Zone <1> Zone <2> Cluster <2.1> Slave Node <2.1.3333> Node <1.2.3>
foo,33 Functional Addressing: Unicast • Function Address • Persistent, reusable 64 bit port identifier assigned by user • Consists of type number and instance number • Function Address Sequence • Sequence of function addresses with same type Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, instance = 33) Server Process, Partition A bind(type = foo, lower=0, upper=99)
Functional Addressing: Multicast • Based on Function Address Sequences • Any partition overlapping with the range used in the destination address will receive a copy of the message • Client defines “multicast group” per call Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) foo,33,133 Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99)
Location Transparency • Location of server not known by client • Lookup of physical destination performed on-the-fly • Efficient, no secondary messaging involved Node <1.1.1> Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99)
Location Transparency • Location of server not known by client • Lookup of physical destination performed on-the-fly • Efficient, no secondary messaging involved Node <1.1.2> Node <1.1.1> Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99)
Node <1.1.2> Node <1.1.3> Location Transparency • Location of server not known by client • Lookup of physical destination performed on-the-fly • Efficient, no secondary messaging involved Node <1.1.1> Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99)
Address Binding • Many sockets may bind to same partition • Closest-First or Round-Robin algorithm chosen by client Server Process, Partition A’ Client Process bind(type = foo, lower=0, upper=99) sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99)
Address Binding • Many sockets may bind to same partition • Closest-First or Round-Robin algorithm chosen by client • Same socket may bind to many partitions Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A+B’ foo,33,133 bind(type = foo, lower=0, upper=99) bind(type=foo, lower=100, upper=199)
Address Binding • Many sockets may bind to same partition • Closest-First or Round-Robin algorithm chosen by client • Same socket may bind to many partitions • Same socket may bind to different functions Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A foo,33,133 bind(type = foo, lower=0, upper=99) bind(type=bar, lower=0, upper=999)
foo,0,99 foo,100,199 Functional Topology Subscription • Function Address/Address Partition bind/unbind events Server Process, Partition B Client Process bind(type = foo, lower=100, upper=199) subscribe(type = foo, lower = 0, upper = 500) Server Process, Partition A bind(type = foo, lower=0, upper=99)
TIPC bind(type = node, lower=0x1001003, upper=0x1001003) TIPC bind(type = node, lower=0x1001002, upper=0x1001002) Network Topology Subscription • Node/Cluster/Zone availability events • Same mechanism as for function events Node <1.1.3> Node <1.1.1> Client Process node,0x1001003 subscribe(type = node, lower = 0x1001000, upper = 0x1001009) Node <1.1.2> node,0x1001002
LFB <IPv4F,5> LFB <IPv4F,1> LFB <CNT,17> LFB <CNT,32> ForCES Applied on TIPC Network Equipment Control Element OSPF, RIP COPS, CLI, SNMP Other Applications ForCES Protocol/TIPC Forwarding Element
Internet Internet LFB <IPv4F,5> LFB <IPv4F,1> LFB <CNT,32> LFB <CNT,17> ForCES applied on TIPC Network Equipment Control Element Control Element Control Element OSPF, RIP COPS, CLI, SNMP Other Applications ForCES Protocol/TIPC Forwarding Element Forwarding Element
CONNECTIONS • Establishment based on functional addressing • Selectable lookup algorithm, partitioning, redundancy etc • No protocol messages exchanged during setup/shutdown • Only payload carrying messages • Traditional TCP-style connection setup/shutdown as alternative • End-to-end flow control • SOCK_SEQPACKET • SOCK_STREAM • SOCK_RDM for connectionless and multicast • SOCK_DGRAM can easily be added if needed • Same with “Unreliable SOCK_SEQPACKET”
Client Process CONNECTIONS • No protocol messages exchanged during setup/shutdown • Only payload carrying messages Server Process, Partition B foo,117 sendto(type = foo, instance = 117)
Client Process CONNECTIONS • No protocol messages exchanged during setup/shutdown • Only payload carrying messages Server Process, Partition B connect(client) send()
Client Process CONNECTIONS • No protocol messages exchanged during setup/shutdown • Only payload carrying messages Server Process, Partition B connect(server)
Client Process abort CONNECTIONS • Immediate “abortion” event in case of peer process crash Server Process, Partition B
Client Process CONNECTIONS • Immediate “abortion” event in case of peer node crash Node <1.1.5> Node <1.1.3> Server Process, Partition B abort
Client Process CONNECTIONS • Immediate “abortion” event in case of communication failure Node <1.1.5> Node <1.1.3> Server Process, Partition B abort
Client Process abort CONNECTIONS • Immediate “abortion” event in case of node overload Node <1.1.5> Node <1.1.3> Server Process, Partition B
Client Process Network Redundancy • Retransmission protocol and congestion control at signalling link level • Normally two links per node pair, for full load sharing and redundancy Node <1.1.5> Node <1.1.3> Server Process, Partition B
Client Process Network Redundancy • Retransmission protocol and congestion control at signalling link level • Normally two links per node pair, for full load sharing and redundancy • Smooth failover in case of single link failure, with no consequences for user level connections Node <1.1.5> Node <1.1.3> Server Process, Partition B
Remaining Work Implementation • Reliable Multicast not fully implemented yet (exp. end of Q1) • Re-stabilization after most recent changes • Re-implementation of multi-cluster neighbour detection and link setup Protocol • Fully manual inter cluster link setup • Guaranteeing Name Table consistency between clusters • Slave node Name Table reduction • ?????