Robust Network Supercomputing with Malicious Processes

Robust Network Supercomputing with Malicious Processes(Reliably Executing Tasks Upon Estimating the Number of Malicious Processes) Kishori M. Konwar* Sanguthevar Rajasekaran Alexander A. Shvartsman *Computer Science & Engineering Department University of Connecticut Storrs, CT

Motivation • Internet supercomputing is increasingly becoming a powerful tool for harnessing massive amounts of computational resources • availability of high bandwidth Internet connections • there is an enormous number of processes around the world • comes at a cost substantially lower than acquiring a supercomputer or building a cluster of powerful machines

TASKS

PrimeNet Server • PrimeNet Server is a distributed, massively parallel scientific computing Internet Supercomputer • Supported by Entropia.com and ranks among the most powerful computers in the world • A project comprised of about 30,000 PCs and laptops • Currently sustains a 22,296 billion floating point operations per second (gigaflops) (operations that involve fractional numbers )

SETI@home • SETI@home project a massive distributed cooperative computer • Used for analysis of gigabytes of data for Search for Extraterrestrial Intelligence (SETI) • Comprises of millions of voluntary machines around • SETI@home project reported its speed to be more than 57,290 billion floating point operations per second

Reliability Issues • The master and perhaps certain workers are reliable • they will correctly execute the tasks assigned by the server • However, workers are commonly unreliable • they may return to the master incorrect results due to unintended failures caused, e.g., by over-clocked processors • may deceivingly claim to have performed assigned work so as to obtain incentive such as getting higher rank

Some Previous Studies • [FGLS05] Assumed the worker processes might act maliciously and hence deliberately return wrong results. • goal is to design algorithm that enable the master to accept correct results with high probability at a lower cost • they provided a randomized algorithm • unfortunately the cost complexity results depend on several parameters and hard to interpret

Some Previous Studies (cont’d) • [GM05] considered the problem of maximizing the expected number of correct result • the tasks are dependent • any worker computes correctly with probability p < 1 any incorrectly computed task corrupts all dependent tasks • the goal is to compute a schedule that maximizes expected number of correct results under a given time constraint • they showed the optimization problem to be NP-hard • provided some solutions on a restricted DAG

Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions

Models of Computation • Processes takes steps in lock steps, i.e., in synchrony • Processes communicate by exchanging messages • The tasks are independent and idempotent • Processes are subject to failures and can return incorrect results maliciously • Workers, P = {1,2, . . ., n} and a master M

Work Complexities • [CDS01] defined as work complexity or available processor steps • All steps taken by processes during execution of the algorithm are counted including the steps of the idling and waiting non-faulty processes • work • [DHW92] define work as the number of performed tasks counting multiplicities • Approach does not charge for idling and waiting this is called task oriented work

Few Comments • work • We say that an even E occurs with high probability (w.h.p.) to mean that Pr[E] = 1 – O(n -) for some constant  > 0.

Modeling Failures • Failure model Fa • f-fraction, 0 < f < ½ of the n workers may fail • Each possibly faulty worker independently exhibits faulty behavior with probability 0 < p < ½. • The master has no a priori knowledge of f and p.

Modeling Failures (cont’d) • Failure model Fb • There is a fixed bound on the f-fraction, 0 < f < ½ of the n workers that can be faulty • Any worker from the remaining (1-f)-fraction of the workers fails with probability 0 < p <1/2 independently of other workers • The master knows the values of f and p.

Algorithmic Template • procedure for master process M, task T Choose a set S  P Send task T to each processor p  S Wait for the results from the processes in S Decide on the result value v from the responses • procedure for worker w  P Wait to receive a task from master M Upon receiving a task from M Execute the task Send the result to M

(, )-approximation algorithm • Z is a random variable distributed in the interval [0,1] with mean Z • Z1, Z2, Z3 .... are independently and identically distributed according to the random variable Z • An (, )-approximation algorithm, with 0 <  < 1,  > 0 for estimating Z satisfies Pr[Z(1-  )   Z(1+  )] > 1 -  where is the estimated value of Z

Stopping Rule Algorithm [Dagum, Karp, Luby, and Ross 1995] Input Parameters (, ) with 0 <  < 1,  > 0 Let 1 = 1 + (1+  )  //  = 0.72 &  = 4 log(2/  )/2 InitializeN  0 , S  0 WhileS < 1 do: N  N+1, S  S + ZN Output: Z 1 /N

Stopping Rule Theorem Theorem (Stopping Rule Theorem) [Dagum, Karp, Luby, and Ross] Let Z be a random variable in [0,1] with Z = E[Z] > 0. Let be the estimate produced and let NZ be the number of experiments that SRA runs with respect to Z on input  and . Then, (i) Pr[Z(1-  )   Z(1+  ) ] > 1 -  (ii) E[NZ ]  1 /Z and (iii) Pr[NZ >(1+  ) 1 /Z ]   /2

Algorithm Af,p to estimate f and p

Work Complexity of Af,p Theorem: Algorithm Af,p is an (, )-approximation algorithm, 0 <  < 1,  > 0, for the estimation of f and p with work complexity O(log2n), complexity O(n log n), message complexity O(log2 n) and time complexity O(log n), with high probability.

Detection of Faulty Processors • Lemma: It is not possible to perform all the n tasks correctly, in the failure model Fa with linear complexity (i.e., O(n)) with high probability.

Detection of Faulty Processors • procedure for master process M Initially, F  Fort = 0, …. k log n, k > 0 Choose a set S  P \ F Send each process p  S “test” task Wait for the results from the processes in S Ifthe response is faulty F F  {p: p is a faulty process} End If End For

Detection of Faulty Processors • Lemma: The algorithm detects all faulty processes among the n workers in O(log n) time with O(n) work with high probability • Theorem[Karp 04]: Suppose that a(x) is a non-decreasing, continuous function that is strictly increasing on {x | a(x) >0}, and m(x) is a continuous function. Then for every positive real x and every positive integer t, Pr[T(x) > u(x) + ta(x)]  (m(x)/x)t where u(x) is the solution to the equation u(x)=a(x) + u(m(x)) with m0(x) :=0 and mi+1(x):= m(mi(x)).

Performing Tasks under Fa procedure for master process M: Initially, C  , J set of n tasks Randomly choose a set, possibly with repetition, SP, |S|=kn/log nworkers k>0 is a constant Fori= 1, …, k' log n, k' > 0 Send to each worker pS a “test” task Collect the responses from all the workers. End For If all the responses from a worker pS are correct then C  C  {p} End if For i=1, …, n/|C| Send |C| jobs from J, not sent in previous iteration, one to each worker in C. Collect the responses from the C workers End For

Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has O(n) work and complexities, with high probability.

Performing Tasks under Fb procedure for master process M, Fort = 0, …. k log n, k > 0 Choose a random permutation RSn Foreach j  [n] Send task to processor (j) End For Collect the responses from all the workers End For Foreach j  [n] Choose the majority of the results of computation for task as the result End For

Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has and work complexities O(n log n), for 0 < p, f < ½ and (1- f)(1- p) > ½ with high probability

Conclusions • Perform tasks under above models where the tasks are dependent • The dependency graph can be DAG • Quantify work and time complexities on some characteristics of the DAG

Robust Network Supercomputing with Malicious Processes