360 likes | 518 Views
Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011. Partition-Tolerant Distributed Publish/Subscribe Systems. Content-Based Publish/Subscribe. NY. London. P. P. Publish. P. Toronto. Pub/Sub. S. S. S. S. S. P. S. sub = [STOCK=IBM].
E N D
Reza SherafatKazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDSOctober 6, 2011 Partition-Tolerant Distributed Publish/Subscribe Systems
Content-Based Publish/Subscribe NY London P P Publish P Toronto Pub/Sub S S S S S P S sub = [STOCK=IBM] Trader 1 S Trader 2 sub= [CHANGE>-8%] Stock quote dissemination SRDS 2011
Goals • Pub Fault-tolerance (against concurrent failures): • Broker crashes • Link failures • Recoveries Reliability: • Publications match subscriptions • Per-source in-order delivery • After some point in time Exactly-once delivery(no loss, no duplicates) • Assumptions: • Clients are light-weight (broker network is responsible for reliability) • A time t after which the system provides guaranteed delivery P/S Reliabledelivery • Sub Sub SRDS 2011
System Architecture Tree dissemination networks: One path from source to destination • Pros: • Simple, loop-free • Preserves publication order(difficult for non-tree content-based P/S) • Cons: • Trees are highly susceptible to failures Primary tree:Initial spanning tree that is formed as brokers join the system • Maintain neighborhood knowledge • Allows brokers to reconfigure overlayafter failures on the fly ∆-Neighborhood knowledge: ∆ is configuration parameterensures handling ∆-1 concurrent failures (worst case) • Knowledge of other brokers within distance ∆ Join algorithm • Knowledge of routing paths within neighborhood Subscription propagation algorithm 3-neighborhood 2-neighborhood 1-neighborhood SRDS 2011
Overview of the Approach Single chain SRDS 2011
Overlay Management Alg. Maintains end-to-end connectivity despite failures in the overlay. SRDS 2011
Overlay Partitions • When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links. • Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers • Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree. • Only active connections are used by brokers P F E D C B A S x Active connection to E D pid1=<C, {D}> Brokers on the partition Brokers beyondthe partition Brokers onthe partition ? SRDS 2011 Partition detector
Overlay Partitions – 2 Adjacent Failures • What if there are more failures, particularly adjacent failures? • If ∆ is large enough the same process can be used for larger partitions. P F E D C B A S Active connection to F D E pid1=<C, {D}> + pid2=<C, {D, E}> Brokers beyondthe partition Brokers onthe partition SRDS 2011
Overlay Partitions - ∆ Adjacent Failures • Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay. • Brokers “on” and “beyond” the partition are unreachable. P F E D C B A S No new active connection F D E pid1=<C, {D}> pid2=<C, {D, E}> + pid3=<C, {D, E, F}> Brokers beyondthe partition Brokers onthe partition SRDS 2011
Subscription Propagation Alg. How correct routing tables are maintained despite overlay partitions? SRDS 2011
Subscription Propagation Algorithm • Establishes end-to-end routing state among brokers while taking into account overlay partitions. • Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections. • Primary tree is the “basis” of constructing end-to-end forwarding paths. • Each subscription contains: SUB = <Id, Predicates, Anchor> • Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”] • Anchor is a reference to brokers along the propagation path of the subscription SRDS 2011
Subscription Propagation in Absence of Overlay Partitions • Subscription anchor field is updated to a broker point up to ∆ hops closer to subscriber • Accepting a subscription is to add it into routing tables • Only after confirmations are received, a subscription is accepted (i.e., will be used for matching) • Observation: Matching publications are delivered to a subscriber once its local broker accepts subscription P E D C B A S s.anchor s s s s s s Subscriptions conf conf conf conf conf conf ☑ ☑ ☑ ☑ ☑ ☑ ☑ Confirmations SRDS 2011 ∆ hops ∆ hops
Subscription Propagation in Presence of overlay Partitions • Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation • Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed. • Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table. P E D C B A S conf s s s s D B Subscriptions conf* conf* conf ☑ ☑ ☑ ☑ Confirmations ☑ ☑* C pidtag is alsostored alongwith s SRDS 2011 * Tag conf with pid
Publication Forwarding Alg. How accepted subscriptions and their partition tags are used to achieve reliable delivery? SRDS 2011
Publication Forwarding in Absence of Overlay Partitions • Forwarding only uses subscriptions accepted brokers. • Steps in forwarding of publication p: • Identify anchor of accepted subscriptions that match p • Determine active connections towards matching subscriptions’ anchors • Send p on those active connections and wait for confirmations • If there are local matching subscribers, deliver to them • If no downstream matching subscriber exists, issue confirmation towards P • Once confirmations arrive, discard p and send a conf towards P P E D C B A S p p p p p p conf conf conf conf conf conf p Publications Subscriptions E C Deliver to localsubscribers ☑ ☑ ☑ ☑ ☑ ☑ ☑ SRDS 2011
Publication Forwarding in Presence of Overlay Partitions • Key forwarding invariant to ensure reliability:we ensure that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription. • Case1: Sub s has been accepted with no pid. It is safe to bypass intermediate brokers P E D C B A S Publications Subscriptions p p p p B D conf conf conf conf ☑ ☑ C Deliver to localsubscribers ☑ ☑ ☑ ☑ ☑ SRDS 2011
Publication Forwarding (cont’d) • Case2: Sub s has been accepted with some pid. • Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so: It is safe to deliver publications from sources beyond the partition. P E D C B A S Publications Subscriptions p p p p B D conf conf conf conf Depending on when this link has been establishedeither recovery or subscription propagation ensureC accepts s prior to receiving p ☑ ☑ C ☑ ☑ ☑* SRDS 2011
Publication Forwarding (cont’d) • Case2: Subscription s is accepted with some pid tags. • Case 2b: Publisher’s broker has not accepted s: It is unsafe to deliver publications from this publisher (invariant). P E D C B A S Subscriptions Publications p p p p p* p ☑* s was acceptedat S with the same pid tag ☑ Tag with pid SRDS 2011
Evaluation Using a mix of simulation and experimental deployments on large-scale testbed. SRDS 2011
Simulation Results ∆=1 Size of brokers’ Neighborhoods as a function of ∆ ∆=2 ∆=3 ∆=4 • Network size of 1000 • Broker fanout of 3 ∆=1 ∆=2 ∆=3 ∆=4 Size of ∆-neighborhoods SRDS 2011
Impact of Failures on End-to-End Broker Reachability • Using a graph simulation tool. • Overlay setup: • Network size 1000 Brokers with fanout=3 • Failure injection: • Failures: up to 100 brokers • We randomly marked a given number of nodes as failed • Measurements: • We counted the number of end-to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain. ∆=1 ∆=2 ∆=3 ∆=4 ∆=1 ∆=4 SRDS 2011
Experimental Deployments:Impact of Failures on Pub Delivery Expected ∆=4 ∆=3 ∆=1 ∆=1 • 500 brokers deployed on 8-core machines in a cluster: • Network setup: Overlay fanout=3. • We measured aggregate pub. delivery count in an interval of 120s • Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers). ∆=4 ∆=3 ∆=2 ∆=1 SRDS 2011
Conclusions • We developed a reliable P/S system that tolerateconcurrent broker and link failures: • Configuration parameter ∆ determines level of resiliency against failures (in the worst case). • Dissemination trees augmented with neighborhood knowledge. • Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures. • We studied the performance of the system when numberof failures far exceeds ∆: • A small value for ∆ ensures good connectivity. SRDS 2011
Questions… Thanks for your attention! SRDS 2011
Challenges Responsibility on P/S messaging system • Why “end-to-end” principle does not work? • Publishers and subscribers are decoupled andunaware of each other. • Routing paths are established by dynamicallyinserted subscriptions • Subscription propagation is also subject tobroker/link failure. • Selective delivery makes in-order deliveryover redundant path difficult • Subscribers are only interested in a subset ofwhat is published. Subscription propagation algorithm We use a special form of tree dissemination SRDS 2011
Store-and-Forward • A copy is first preserved on disk • Intermediate hops send an ACK to previous hop after preserving • ACKed copies can be dismissed from disk • Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery • This ensures reliable delivery but may cause delays while the machine is down P P P P Tohere Fromhere ack ack ack SRDS'09
Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001] • Use a mesh network to concurrently forward msgs on disjoint paths • Upon failures, the msg is delivered using alternative routes • Pros: Minimal impact on delivery delay • Cons: Imposes additional traffic & possibility of duplicate delivery Fromhere Tohere P P P P SRDS'09
Replica-based Approach [Bhola , et al., DSN 2002] • Replicas are grouped into virtual nodes • Replicas have identical routing information PhysicalMachines Virtual node SRDS'09
Replica-based Approach[Bhola , et al., DSN 2002] • Replicas are grouped into virtual nodes • Replicas have identical routing information • We compare against this approach Virtual node P P P P P P SRDS'09
Publication Forwarding (cont’d) • Case2: Sub s has been accepted with some pid. • Case 2b (Partition barrier): Publisher’s broker has also not accepted s P E D C B A S Subscriptions Publications p1* p1* p1 p1 p1 p1 p1 R ☑ ☑r ☑r ☑r ☑r ☑r ☑r ☑r p2 & p1 matches s matches r & s ☑* s was acceptedat S with the same pid tag ☑ Tag with pid SRDS 2011
Subscription Propagation with Partitions • Partition islands: • Simply confirm (and accept) subscriptions over available • If partition brokers are reachable fromthe other side of the partition • Intuition: • Publications from P may only be lost if they arrive at B • But this will not happen sincethere is no link towards B from F • Correctness proof argues on the precedence of acceptance and creation of links Subscriptions C P P E D C B A S ☑ ☑ Confirmations Leadbroker ☑ ☑ ☑ Will acceptduring recovery B A SRDS 2011
Subscription Propagation with Partition Barriers • If a portion of the network that includes publishers is on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures • Lead broker “partially confirms” the subscription and tags the confirmation with the partition information • Accepting brokers store the partition information along with the subscription • This ensures liveness P G F E C B A A S Forward ☑* ☑* Leadbroker D Partialconf C B SRDS 2011 Δ hops
Publication Forwarding • Only accepted subscriptions are stored in SRT and used for matching • At each point in time, a broker has a number of connections to its nearest reachable neighbors • This set of active connections may change over time Publication forwarding steps: • Store publication in a FIFO internal message queue • Match and compute set of {from} for subscriptions that match • For each partially confirmed subscription, tag the publication with the partition information • Send the publication to the closest reachable neighbors towards {from} • Once all confirmations arrive, discard publication and issue confirmation towards publisher P queue P P P P A (δ+1)-neighborhood S S S SRDS 2011
Evaluations ∆=1 Size of brokers’ Neighborhoods as a function of ∆ Network size of 1000 Broker fanout of 7 ∆=1 ∆=2 ∆=2 ∆=3 ∆=3 ∆=4 ∆=4 • Network size of 1000 • Broker fanout of 3 Size of ∆-neighborhoods Size of ∆-neighborhoods SRDS 2011
Overlay Links Management • Sessions: FIFO communication links between brokers. • Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B. Primary tree ∆ = 2 SRDS 2011
Agenda • Challenges of reliability and fault-tolerance in P/S • Our approach • Topology neighborhood knowledge • Subscription propagation • Publication forwarding • Recovery procedure • Evaluation results • Conclusions SRDS 2011