Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman

Dariusz Kowalski University of Connecticut & Warsaw University joint work with AlexShvartsman University of Connecticut & MIT Performing Tasks in Asynchronous Environments

Do-All problem ([DHW] et al.) DA(p,t)problem abstracts the basic problem of cooperation in a distributed setting: p processors must perform t tasks, andat least one processor must know about it[Dwork Halpern Waarts 92/98] Tasks are: • known to every processor • similar - each takes similar number of local steps • independent - may be performed in any order • idempotent - may be performed concurrently Performing Work with Asynchronous Processors

Do-All: synchronous model with crashes Model: processors are synchronous, may fail by crashes Solutions: problem well understood, results close to optimal • Shared-memory model -- communication by read/write • Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) • Message-passing model -- communication by exchanging messages • Dwork, C., Halpern, J., Waarts, O. Performing work efficiently in the presence of faults. SIAM Journal on Computing, 27 (1998) • De Prisco, R., Mayer, A., Yung, M. Time-optimal message-efficient work performance in the presence of faults. Proc. of 13th PODC, (1994) • Chlebus, B., De Prisco, R., Shvartsman, A.A. Performing tasks on synchronous restartable message- passing processors. Distributed Computing, 14 (2001) Performing Work with Asynchronous Processors

Do-All: asynchronous models Models: • Shared-memory model -- communication by read/write -- widely studied, but solutions far from optimal • Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) • Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) • Kedem, Z., Palem, K., Raghunathan, A., Spirakis, P.: Combining tentative and definite executions for very fast dependable parallel computing. Proc. of 23rd STOC, (1991) • Message-passing model -- communication by exchanging messages -- no interesting solutions until recently Performing Work with Asynchronous Processors

Shared-Memory vs. Message-Passing Shared-Memory (atomic registers): • processors communicate by read/write in shared-memory • atomicity - guarantees that read outputs the last written value • one read/write operation per local clock cycle • information propagatesand information ispersistent Hence cooperation is always possible, although delayedHere processor scheduling is the major challenge Message-Passing: • processors communicate by exchanging messages • duration of a local step may be unbounded • message delays may be unbounded • information may not propagate -- send/recv depend on delay Performing Work with Asynchronous Processors

Message-delay-sensitive approach Even if messages delay are bounded by d (d-adversary),cooperation may be difficult Observation: If d = (t) then work must be (t ·p) This means that cooperation is difficult, and addressing scheduling alone is not enough - - algorithm design and analysis must be d-sensitive Message-delay-sensitive approach • C. Dwork, N. Lynch and L. Stockmeyer.: Consensus in the presence of partial synchrony. J. of the ACM, 35 (1988) Performing Work with Asynchronous Processors

Measures of efficiency Termination time: the first time when all tasks are done and at least one processors knows about it • Used only to define work and message complexity • Not interesting on its own: if all processors but one are delayed then trivially time is (t) Work :measures the sum, over all processors, of the number of local steps taken until termination time Message complexity (message-passing model):measures number of all point-to-point messages sent until termination time Performing Work with Asynchronous Processors

Structure of the presentation Part 2: Message-passing model. • Model: asynchrony, message delay, and modeling issues • Delay-sensitive lower bounds for Do-All • Progress-tree Do-All algorithms • Simulating shared-memory and Anderson-Woll (AW) • Asynch. message-passing progress-tree algorithm • Permutation Do-All algorithms Part 1: Shared-memory model • Model and bibliography • Improving AW algorithm in shared-memory by better scheduling processors (task load-balancing) Performing Work with Asynchronous Processors

Shared-Memory - model and goal We consider the following model: • pasynchronous processors with PID in {0,…,p-1} • processors communicate by read/write in shared-memory • atomicity - read outputs the last written value • one read/write operation per local clock cycle Write-All : write 1’s into t locations of given array Goal: improve scheduling of cooperating asynchronous processors leading to better load-balancing wrt tasks Performing Work with Asynchronous Processors

Write-All: Selected Bibliography Introducing Write-All problem • Kanellakis, P.C., Shvartsman, A.A.: Efficient parallel algorithms can be made robust. PODC (1989), Distributed Computing (1992) AW algorithm with work O(t p ) • Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Randomized algorithm with work (t + plog p) • Martel, C., Subramonian, R.: On the complexity of Certified Write-All algorithms. J. Algorithms 16 (1994) First work-optimal deterministic algorithm for t = (p4log p) • Malewicz, G.: A work-optimal deterministic algorithm for the asynchronous Certified Write-All problem. PODC (2003) Performing Work with Asynchronous Processors

Progress tree algorithms [BKRS, AW] • Shared memory • p processors, t tasks(p = t) • q permutations of [q] • q-ary progress tree of depth logq p • nodes are binary completion bits • Permutations establish the order in which the children are visited • p processors traverse the tree and use q-ary expansion of their PID to choose permutations • [Anderson Woll] 1 2 3 q 1 2 3 q 1 2 3 q Performing Work with Asynchronous Processors

Algorithm AWT [Anderson Woll] 3 1 2 3 1 2 2 3 1 2 3 1 • Progress tree data structure is stored in shared memory p, t = 9 , q = 3  : list of 3 schedules from S3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID = 0,3,6 1 2 3 1 PID = 1,4,7 0 2 PID = 2,5,8 7=213 1 2 3 7=213 4 5 6 7 8 9 10 11 12 Performing Work with Asynchronous Processors

Contention of permutations Sn - group of all permutations on set [n], with composition  and identity n ,  - permutations in Sn  - set of q permutations from Sn • i is lrm (left-to-right maximum) in  if (i) > maxj<i (j) • LRM( ) - number of lrm in  [Knuth] • Cont(, ) =  LRM( -1  ) • Contentionof : Cont( ) = maxCont(, ) [AW] Theorem: [AW] For any n > 0 there exists set  of n permutations from Sn with Cont( )  3nHn = (n log n). [Knuth] Knuth, D.E.: The art of computer programming Vol. 3 (third edition). Addison-Wesley Pub Co. (1998) 3 5 2 4 6 1 9 7 8 11 10 Performing Work with Asynchronous Processors

Procedure “Oblivious Do” n - number of jobs and units  - list of n schedules from Sn Procedure Oblivious : Forall processors PID= 0 to n-1 fori = 1 tondo perform Job( PID(i)) Execution of Job( PID(i)) by processor PID is primary, if job  PID(i) has not been previously performed Lemma:[AW] In algorithm Oblivious with n units, n jobs, and using the list of n permutations from Sn, the number of primary job executions is at most Cont( ). Performing Work with Asynchronous Processors

AWT(q)- new progress tree traversal algorithm 4 1 2 3 3 1 4 2 2 3 1 4 4 1 2 3 • Instead of using q permutations on set [q],we use q permutations on set [n], where n = q2 log q p = 6 , t = 16 , q = 2, n = 4  : list of 2 schedules from S4 T : 4-ary tree of 16 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID : even 1 PID : odd 0 5=1014 4 1 2 3 5=1014 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Performing Work with Asynchronous Processors

Main result • Set n =q2 log q and let belist of q schedules from Sn • Define Cont(, ) = max  Cont(, ) Lemma: For sufficiently large q and any set  of at most exp(q2 log2q) permutations on set [q2 log q], there is a list of q schedules from Sn such that Cont(, ) q2 log q + 6q log q • Take q = log p and from above Lemma Theorem: For every  > 0, sufficiently large p and t = (p2+), algorithm AWT(q) performs work O(t). Performing Work with Asynchronous Processors

Message-Passing - model and goals We consider the following model: • pasynchronous processors with PID in {0,…,p-1} • processors communicate by message passing • in one local step each processor can send a message to any subset of processors • messages incur delays between send and receive • processing of all received messages can be done during one local step Goal: understand the impact of message delay on efficiency of algorithmic solutions for Do-All Performing Work with Asynchronous Processors

Lower bound - randomized algorithms Theorem: Any randomized algorithm solving DA with t tasks using p asynchronous message-passing processors performs expected work (t+pdlogd+1t) against any d-adversary. Proof (sketch): Adversary partitions computation into stages, each containingd time units, and constructs delay pattern stage after stage: delays all messages in stage to be received at the end of stage  delays linear number of processors (which want to perform more than (1-1/(3d)) fraction of undone tasks) during stage selection is on-line, with high probability has good properties Performing Work with Asynchronous Processors

Simulating shared-memory algorithms Write-All algorithm AWT • Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Quorum systems & Atomic memory services • Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robust-ly in message passing systems. J. of the ACM, 42 (1996) • Lynch, N., Shvartsman, A.: RAMBO: A Reconfigurable Atomic Memory Service. Proc. of 16th DISC, (2002) Emulating asynchronous shared-memory algorithms : • Momenzadeh, M.: Emulating shared-memory Do-All in asynchronous message passing systems. Masters Thesis, CSE, University of Conn, (2003) Performing Work with Asynchronous Processors

Atomic memory is not required • We use q-ary progress trees as the main data structure that is “written” and “read” -- note that atomicity is not required • If the following two writes occur (the entire tree is written), then a subsequent read may obtain a third value that was never written: • Property of monotone progress : • 1 at a tree node i indicates that all tasks attached to the leaves in the sub-tree rooted in i have been performed • If 1 is written at a node i in the progress tree of a processor, it remains 1 forever 0 0 0 write write read 0 1 1 0 1 1 Performing Work with Asynchronous Processors

Algorithm DAq- traverse progress tree 3 1 2 3 1 2 2 3 1 2 3 1 • Instead of using shared memory, processors broadcast their progress trees as soon as local progress is recorded p, t = 9 , q = 3  : list of 3 schedules from S3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID = 0,3,6 1 2 3 1 PID = 1,4,7 0 2 PID = 2,5,8 7=213 1 2 3 7=213 4 5 6 7 8 9 10 11 12 Performing Work with Asynchronous Processors

Algorithm DAq - casep  t Performing Work with Asynchronous Processors

Procedure DOWORK Performing Work with Asynchronous Processors

Algorithm DAq- analysis Modification of algorithm DAq for p < t : • We partition the t tasks into pjobs of size t /p and let the algorithm DAq work with these jobs. • It takes a processor O(t /p) work (instead of constant) to process such a job (job unit). • In each step, a processor broadcasts at most one message to p-1 other processors, we obtain: Theorem 4: For any constant  > 0 there is a constant q such that the algorithm DAq has work W(p,t,d) = O(tp + pdt /d  ) and message complexity O(p W(p,t,d)) against any d-adversary (d=o(t)). Performing Work with Asynchronous Processors

Permutation algorithms - case p  t Algorithms proceed in a loop: • select the next task using ORDER+SELECT rule • perform selected task • send messages, receive messages, and update state ORDER+SELECT rules: PARAN1 : initially processor PIDpermutes tasks randomly PID selects first task remaining on his schedule PARAN2 : no initial order PID selects task from remaining sets randomly PADET : initially processor PID chooses schedule PID in  PID selects first task remaining on schedule PID  - list of p schedules from St Performing Work with Asynchronous Processors

d-Contention of permutations We introduce the notion of d-Contention : • i is d-lrm in  if |{j < i | (i) < (j)}| < d d = 2 • LRMd() - number of d-lrm in  • Contd(, ) =  LRMd( -1  ) • d-Contentionof : Contd( ) = maxContd(, ) Theorem: For sufficiently large p and n, there is a list  of p permutations from Sn such that, for every integer d >1, Contd( ) n log n + 5pd ln(e+n/d). Moreover, random  is good with high probability. 3 5 2 4 6 1 9 7 8 11 10 Performing Work with Asynchronous Processors

d-Contention and work Lemma: For algorithms PADET and PARAN1, the respective worst case work and expected work is at most Contd( ) against any d-adversary. Example: p = 2, t = 11, d = 2 Order of tasks to perform : 1,2,3,4,5,6,7,8,9,10,11 1 1 3 3 2 2 5 5 7 7 4 9 9 8 6 11 11 10 10 2 2 4 4 6 6 8 8 10 10 11 11 9 7 5 3 1 Performing Work with Asynchronous Processors

Permutation algorithms - results Theorem: Randomized algorithms PARAN1 and PARAN2 perform expected work O(tlog p + pdlog(t /d)) and have expected communication O(tplog p + p2dlog(t /d)) against any d-adversary (d=o(t)). Corollary: There exists a deterministic list of schedules  such that algorithm PADET performs work O(tlog p + pmin{t,d}log(2+t /d)) and has communication O(tplog p + p2min{t,d}log(2+t /d)) when p  t. Performing Work with Asynchronous Processors

Conclusions and open problems • Work-optimal Write-All algorithm for t = (p2+) • First message-delay-sensitive analysis of the Do-All problem for asynchronous processors in message-passing model • lower bounds for deterministic and randomized algorithms • deterministic and randomized algorithms with subquadratic(in p and t ) work for any message delay d as long as d=o(t) • Among the interesting open questions are • is there work-optimal scheduling for t = (p log p) • for algorithm PADET : how to construct list  of permutations efficiently • closing the gap between the upper and the lower bounds • investigate algorithms that simultaneously control work and message complexity Performing Work with Asynchronous Processors

Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman