2.06k likes | 2.19k Views
Locality Sensitive Distributed Computing. David Peleg Weizmann Institute. Structure of mini-course. Basics of distributed network algorithms Locality-preserving network representations Constructions and applications. Part 1: Basic distributed algorithms. Model Broadcast
E N D
Locality Sensitive Distributed Computing David PelegWeizmann Institute
Structure of mini-course • Basics of distributed network algorithms • Locality-preserving network representations • Constructions and applications
Part 1: Basic distributed algorithms • Model • Broadcast • Tree constructions • Synchronizers • Coloring, MIS
The distributed network model Point-to-point communication network
The distributed network model Described by undirected weighted graph G(V,E,w) V={v1,…,vn} - Processors (network sites) E - bidirectional communication links
The distributed network model w: E R+ edge weight function representing transmission costs (usually satisfies triangle inequality) Unique processor ID's: ID : V S S={s1,s2,…} ordered set of integers
Communication Processor v has deg(v,G)ports (external connection points) Edge e represents pair ((u,i),(v,j)) = link connecting u's port i to v's port j
Communication • Message transmission from u to neighbor v: • u loads M onto port i • v receives M in input buffer of port j
Assumption: At most one message can occupy a communication link at any given time (Link is available for next transmission only after previous message is removed from input buffer by receiving processor) Communication Allowable message size = O(log n) bits (messages carry a fixed number of vertex ID's, e.g., sender and destination)
Issues unique to distributed computing There are several inherent differences between the distributed and the traditional centralized-sequential computational models
Communication • In centralized setting: Issue nonexistent • In distributed setting: Communication • has its limits (in speed and capacity) • does not come “for free” should be treated as a computational resource such as time or memory (often - the dominating consideration)
Communication as a scarce resource One common model: LOCAL Assumes local processing comes for free (Algorithm pays only for communication)
Incomplete knowledge In centralized-sequential setting: Processor knows everything (inputs, intermediate results, etc.) In distributed setting: Processors have very partial picture
Partial topological knowledge Model of anonymous networks: Identical nodes no ID's no topology knowledge Intermediate models: Estimates for network diameter, # nodes etc unique identifiers neighbor knowledge
Partial topological knowledge (cont) Permissive models: Topological knowledge of large regions, or even entire network Structured models: Known sub-structure, e.g., spanning tree / subgraph / hierarchical partition / routing service available
Other knowledge deficiencies • know only local portion of the input • do not know who else participates • do not know current stage of other participants
Coping with failures In centralized setting: Straightforward - Upon abnormal termination or system crash: Locate source of failure, fix it and go on. In distributed setting: Complication - When one component fails, others continue Ambitious goal: ensure protocol runs correctly despite occasional failures at some machines (including “confusion-causing failures”, e.g., failed processors sending corrupted messages)
Fully synchronous network: • All link delays are bounded • Each processor keeps local clock • Local pulses satisfy following property: Timing and synchrony Message sent from v to neighbor u at pulse p of v arrives u before its pulse p+1 Think of entire system as driven by global clock
Machine cycle of processors - composed of 3 steps: • Sendmsgs to (some) neighbors • Wait to receivemsgs from neighbors • Perform some local computation Timing and synchrony
Asynchronous model • Algorithms are event-driven : • No access to global clock • Messages sent from processor to neighbor arrive within finite but unpredictable time
Asynchronous model Clock can't tell if message is coming or not: perhaps “the message is still on its way” Impossible to rely on ordering of events (might reverse due to different message transmission speeds)
Nondeterminism Asynchronous computations are inherently nondeterministic (even when protocols do not use randomization)
Nondeterminism Reason: Message arrival order may differ from one execution to another (e.g., due to other events concurrently occurring in the system – queues, failures) Run same algorithm twice on same inputs - get different outputs / “scenarios”
Complexity measures • Traditional (time, memory) • New (messages, communication)
Time For synchronous algorithm P: Time(P) = (worst case) # pulses during execution For asynchronous algorithm P ? (Even a single message can incur arbitrary delay ! )
Time For asynchronous algorithm P: Time(P) = (worst-case) # time units from start to end of execution, assuming each message incurs delay < 1 time unit (*)
Note: • Assumption (*) is used only for performance evaluation, not for correctness. • (*) does not restrict set of possible scenarios – any execution can be “normalized” to fit this constraint • “Worst-case” means all possible inputs and all possible scenarios over each input Time
Memory Mem(P) = (worst-case) # memory bits used throughout the network MaxMem(P) = maximum local memory
Message complexity Basic message = O(log n) bits Longer messages cost proportionally to length Sending basic message over edge costs 1 Message(P) = (worst case) # basic messages sent during execution
Distance definitions Length of path (e1,...,es) = s dist(u,w,G) = length of shortest u - w path in G Diameter: Diam(G) = maxu,vV {dist(u,v,G)}
Distance definitions (cont) Radius: Rad(v,G) = maxwV {dist(v,w,G)} Rad(G) = minvV {Rad(v,G)} A center of G: vertex v s.t. Rad(v,G)=Rad(G) Observe:Rad(G) < Diam(G) < 2Rad(G)
M M M Broadcast Goal: Disseminate message M originated at sourcer0 to all vertices in network M M M M
Basic lower bounds • Thm: • For every broadcast algorithm B: • Message(B) > n-1, • Time(B) > Rad(r0,G) = W(Diam(G))
Tree broadcast • Algorithm Tcast(r0,T) • Use spanning treeT of G rooted at r0 • Root broadcasts M to all its children • Each node v getting M, forwards it to children
Tree broadcast (cont) Assume: Spanning tree known to all nodes (Q: what does it mean in distributed context?)
Tree broadcast (cont) • Claim: For spanning tree T rooted at r0: • Message(Tcast) = n-1 • Time(Tcast) = Depth(T)
BFS (Breadth-First Search) tree = Shortest-paths tree: The level of each v in T is dist(r0,v,G) Tcast on BFS tree
Corollary: • For BFS tree T w.r.t. r0: • Message(Tcast) = n-1 • Time(Tcast) < Diam(G) • (Optimal in both) Tcast (cont) But what if there is no spanning tree ?
The flooding algorithm • Algorithm Flood(r0) • Source sends M on each outgoing link • For other vertex v: • On receiving M first time over edge e: • store in buffer; forward on every edge ≠ e • On receiving M again (over other edges): • discard it and do nothing
Flooding - correctness • Lemma: • Alg. Flood yields correct broadcast • Time(Flood)=Q(Rad(r0,G)) = Q(Diam(G)) • Message(Flood)=Q(|E|) • in both synchronous and asynchronous model • Proof: • Message complexity: Each edge delivers m at most once in each direction
Gl(v) = l-neighborhood of v = vertices at distance l or less from v Neighborhoods G0(v) G1(v) G2(v)
Time complexity Verify (by induction on t) that: After t time units, M has already reached every vertex at distance < t from r0 (= every vertex in the t-neighborhood Gt(r0) ) Note: In asynchronous model, M may have reached additional vertices (messages may travel faster)
Time complexity • Note: Algorithm Flood implicitly constructs • directed spanning treeT rooted at r0, • defined as follows: • The parent of each v in T • is the node from which v received M • for the first time Lemma: In the synchronous model, T is a BFS tree w.r.t. r0, with depth Rad(r0,G)
Flood time Note: In the asynchronous model, T may be deeper (< n-1) r0 Note: Time is still O(Diam(G)) even in this case!
Broadcast with echo Goal: Verify successful completion of broadcast Method: Collect acknowledgements on a spanning tree T
Broadcast with echo • Converge(Ack) process - code for v • Upon getting M do: • For v leaf in T: • - Send up an Ack message to parent • For v non-leaf: • - Collect Ack messages from all children • - Send Ack message to parent
“Joint ack” for entire subtree Tv rooted at v, signifying that each vertex in Tv received M Semantics of Ack from v r0 receives Ack from all children only after all vertices received M • Claim: On tree T, • Message(Converge(Ack)) = O(n) • Time(Converge(Ack))=O(Depth(T))
Tree broadcast alg: Take same tree used for broadcast. Time / message complexities grow by const factor. Flooding alg: Use tree T defined by broadcast Synch. model: BFS tree - complexities double Asynch. model: no guarantee Tree selection