370 likes | 380 Views
This study explores the challenges of executing tasks in a distributed supercomputing network with unreliable and potentially malicious processes. Various algorithmic approaches and fault detection methods are investigated to ensure reliable and accurate task execution.
E N D
Robust Network Supercomputing with Malicious Processes(Reliably Executing Tasks Upon Estimating the Number of Malicious Processes) Kishori M. Konwar* Sanguthevar Rajasekaran Alexander A. Shvartsman *Computer Science & Engineering Department University of Connecticut Storrs, CT
Motivation • Internet supercomputing is increasingly becoming a powerful tool for harnessing massive amounts of computational resources • availability of high bandwidth Internet connections • there is an enormous number of processes around the world • comes at a cost substantially lower than acquiring a supercomputer or building a cluster of powerful machines
PrimeNet Server • PrimeNet Server is a distributed, massively parallel scientific computing Internet Supercomputer • Supported by Entropia.com and ranks among the most powerful computers in the world • A project comprised of about 30,000 PCs and laptops • Currently sustains a 22,296 billion floating point operations per second (gigaflops) (operations that involve fractional numbers )
SETI@home • SETI@home project a massive distributed cooperative computer • Used for analysis of gigabytes of data for Search for Extraterrestrial Intelligence (SETI) • Comprises of millions of voluntary machines around • SETI@home project reported its speed to be more than 57,290 billion floating point operations per second
Reliability Issues • The master and perhaps certain workers are reliable • they will correctly execute the tasks assigned by the server • However, workers are commonly unreliable • they may return to the master incorrect results due to unintended failures caused, e.g., by over-clocked processors • may deceivingly claim to have performed assigned work so as to obtain incentive such as getting higher rank
Some Previous Studies • [FGLS05] Assumed the worker processes might act maliciously and hence deliberately return wrong results. • goal is to design algorithm that enable the master to accept correct results with high probability at a lower cost • they provided a randomized algorithm • unfortunately the cost complexity results depend on several parameters and hard to interpret
Some Previous Studies (cont’d) • [GM05] considered the problem of maximizing the expected number of correct result • the tasks are dependent • any worker computes correctly with probability p < 1 any incorrectly computed task corrupts all dependent tasks • the goal is to compute a schedule that maximizes expected number of correct results under a given time constraint • they showed the optimization problem to be NP-hard • provided some solutions on a restricted DAG
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
Models of Computation • Processes takes steps in lock steps, i.e., in synchrony • Processes communicate by exchanging messages • The tasks are independent and idempotent • Processes are subject to failures and can return incorrect results maliciously • Workers, P = {1,2, . . ., n} and a master M
Work Complexities • [CDS01] defined as work complexity or available processor steps • All steps taken by processes during execution of the algorithm are counted including the steps of the idling and waiting non-faulty processes • work • [DHW92] define work as the number of performed tasks counting multiplicities • Approach does not charge for idling and waiting this is called task oriented work
Few Comments • work • We say that an even E occurs with high probability (w.h.p.) to mean that Pr[E] = 1 – O(n -) for some constant > 0.
Modeling Failures • Failure model Fa • f-fraction, 0 < f < ½ of the n workers may fail • Each possibly faulty worker independently exhibits faulty behavior with probability 0 < p < ½. • The master has no a priori knowledge of f and p.
Modeling Failures (cont’d) • Failure model Fb • There is a fixed bound on the f-fraction, 0 < f < ½ of the n workers that can be faulty • Any worker from the remaining (1-f)-fraction of the workers fails with probability 0 < p <1/2 independently of other workers • The master knows the values of f and p.
Algorithmic Template • procedure for master process M, task T Choose a set S P Send task T to each processor p S Wait for the results from the processes in S Decide on the result value v from the responses • procedure for worker w P Wait to receive a task from master M Upon receiving a task from M Execute the task Send the result to M
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
(, )-approximation algorithm • Z is a random variable distributed in the interval [0,1] with mean Z • Z1, Z2, Z3 .... are independently and identically distributed according to the random variable Z • An (, )-approximation algorithm, with 0 < < 1, > 0 for estimating Z satisfies Pr[Z(1- ) Z(1+ )] > 1 - where is the estimated value of Z
Stopping Rule Algorithm [Dagum, Karp, Luby, and Ross 1995] Input Parameters (, ) with 0 < < 1, > 0 Let 1 = 1 + (1+ ) // = 0.72 & = 4 log(2/ )/2 InitializeN 0 , S 0 WhileS < 1 do: N N+1, S S + ZN Output: Z 1 /N
Stopping Rule Theorem Theorem (Stopping Rule Theorem) [Dagum, Karp, Luby, and Ross] Let Z be a random variable in [0,1] with Z = E[Z] > 0. Let be the estimate produced and let NZ be the number of experiments that SRA runs with respect to Z on input and . Then, (i) Pr[Z(1- ) Z(1+ ) ] > 1 - (ii) E[NZ ] 1 /Z and (iii) Pr[NZ >(1+ ) 1 /Z ] /2
Work Complexity of Af,p Theorem: Algorithm Af,p is an (, )-approximation algorithm, 0 < < 1, > 0, for the estimation of f and p with work complexity O(log2n), complexity O(n log n), message complexity O(log2 n) and time complexity O(log n), with high probability.
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
Detection of Faulty Processors • Lemma: It is not possible to perform all the n tasks correctly, in the failure model Fa with linear complexity (i.e., O(n)) with high probability.
Detection of Faulty Processors • procedure for master process M Initially, F Fort = 0, …. k log n, k > 0 Choose a set S P \ F Send each process p S “test” task Wait for the results from the processes in S Ifthe response is faulty F F {p: p is a faulty process} End If End For
Detection of Faulty Processors • Lemma: The algorithm detects all faulty processes among the n workers in O(log n) time with O(n) work with high probability • Theorem[Karp 04]: Suppose that a(x) is a non-decreasing, continuous function that is strictly increasing on {x | a(x) >0}, and m(x) is a continuous function. Then for every positive real x and every positive integer t, Pr[T(x) > u(x) + ta(x)] (m(x)/x)t where u(x) is the solution to the equation u(x)=a(x) + u(m(x)) with m0(x) :=0 and mi+1(x):= m(mi(x)).
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
Performing Tasks under Fa procedure for master process M: Initially, C , J set of n tasks Randomly choose a set, possibly with repetition, SP, |S|=kn/log nworkers k>0 is a constant Fori= 1, …, k' log n, k' > 0 Send to each worker pS a “test” task Collect the responses from all the workers. End For If all the responses from a worker pS are correct then C C {p} End if For i=1, …, n/|C| Send |C| jobs from J, not sent in previous iteration, one to each worker in C. Collect the responses from the C workers End For
Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has O(n) work and complexities, with high probability.
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
Performing Tasks under Fb procedure for master process M, Fort = 0, …. k log n, k > 0 Choose a random permutation RSn Foreach j [n] Send task to processor (j) End For Collect the responses from all the workers End For Foreach j [n] Choose the majority of the results of computation for task as the result End For
Work and Time Complexities Theorem: The algorithm performs all n tasks correctly in O(log n) time and has and work complexities O(n log n), for 0 < p, f < ½ and (1- f)(1- p) > ½ with high probability
Overview • Models of Computation • Stopping Rule Algorithm based solution • Detection of Faulty Processors • Performing Tasks with Faulty Workers • Conclusions
Conclusions • Perform tasks under above models where the tasks are dependent • The dependency graph can be DAG • Quantify work and time complexities on some characteristics of the DAG