Highly Fault-Tolerant Parallel Computation

Author: Daniel A. Spielman.

Bibliographic Information: Appeared in Proceedings of the 37th Annual IEEE Conference on Foundations of Computer Science, pp. 154-163, 1996.


We re-introduce the {\em coded model} of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope for exponentially reliable computation with devices in which each component can fail with some constant probability.

We consider fine-grained parallel computations in which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time $t$ on $w$ processors can be performed on a faulty machine in the coded model using $w \log^{O(1)}w $ processors and time $t \log^{O(1)}w$. The failure probability of the computation will be $t \cdot exp(-w^{1/4})$. We show that any computation that can be performed efficiently by a parallel machine can be performed in the coded model by an exponentially reliable fault-tolerant circuit whose size is greater by a polylogarithmic factor: Any circuit of width $w$ and height $h$ can be simulated in the coded model by a circuit of size $h w \log^{O(1)} w$ that will fail with probability at most $h \cdot \mbox{{\rm exp}}(-w^{1-1/\log \log w})$.

The codes used to communicate with our fault-tolerant circuits can be encoded and decoded in $\O{n \log^{2} n}$ sequential time and are independent of the circuit they are used to communicate with.

We show how techniques from this paper can be used to self-correct many linear functions in parallel with arbitrarily small overhead.

You can download this paper as Postscript, compressed Postscript, PDF, or compressed PDF.