Research on Fault Tolerance
-
Highly Fault-Tolerant Parallel Computation
When the function of a processor is critical, it is common to
use three in place of the one so that even if one fails, its
instructions will be overridden by the other two.
Similarly, one could protect against the failure of
k processors by following the instructions of a majority
of 2k+1.
But, what if these processors are components of a large parallel
machine?
Is it necessary to replicate each processor 2k+1 times to
insure against the loss of any k?
We show that, in many cases, one can do better.
- Fault diagnosis in a small constant number of parallel testing rounds
Assume that processors can be used to test whether others are faulty, we
show how to find the faulty ones provided that a strict majority
function correctly.
Daniel A. Spielman
Last modified: Thu Aug 2 16:53:55 EDT 2001