*For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.*

Like SynchronousAgreement except that we replace crash failures with **Byzantine failures**, where a faulty process can ignore its programming and send any messages it likes. Since we are operating under a universal quantifier, this includes the case where the Byzantine processes appear to be colluding with each other under the control of a centralized adversary.

# 1. Lower bounds

## 1.1. Minimum number of rounds

We've already seen an f+1 lower bound on rounds for crash failures (see IndistinguishabilityProofs). Applies *a fortiori* to Byzantine failures, since Byzantine failures can simulate crash failures.

## 1.2. Minimum number of processes

We can also show that we need n ≥ 3f+1 processes. For n = 3 and f = 1 the intuition is that Byzantine B can play A and C off against each other, telling A that C is Byzantine and C that A is Byzantine. Since A is telling C the same thing about B that B is saying about A, C can't tell the difference and doesn't know who to believe. Unfortunately, this tragic soap opera is not a real proof, since we haven't actualy shown that B can say exactly the right thing to keep A and C from guessing B is evil.

The real proof (see also AttiyaWelch §5.2.3): consider an artifical execution where (non-Byzantine) A, B, and C are duplicated and then placed in a ring A_{0} B_{0} C_{0} A_{1} B_{1} C_{1}, where the digits indicate inputs. We'll still keep the same code for n=3 on A_{0}, B_{0}, etc, but when A_{0} tries to send a message to what it thinks of as just C we'll send it to C_{1} but messages from B_{0} will instead go to C_{0}. For any adjacent pair of processes (e.g. A_{0} and B_{0}), the behavior of the rest of the ring could be simulated by a single Byzantine process (e.g. C), so each process in the 6-process ring behaves just as it does in some 3-process execution with 1 Byzantine process. It follows that all of the processes terminate and decide in the unholy 6-process Frankenexecution^{1} the same value that they would in the corresponding 3-process Byzantine execution. So what do they decide?

Given two processes with the same input e.g. A_{0} and B_{0}, the giant execution is indistinguishable from an A_{0} B_{0} C execution where C is Byzantine. Validity says A_{0} and B_{0} must both decide 0. Since this works for any pair of processes with the same input, we have each process deciding its input. But now consider the execution of C_{0} A_{1} B, where B is Byzantine. In the big execution, we just proved that C_{0} decides 0 and A_{1} decides 1, but since the C_{0} A_{1} B execution is indistinguishable from the big execution to C_{0} and A_{1}, they do the same thing here and violate agreement.

This shows that with n=3 and f=1, we can't win. We can generalize this to n = 3f. Suppose that there were an algorithm that solved Byzantine agreement with n=3f processes. Group the processes into groups of size f, and let each of the n=3 processes simulate one group (giving everybody the same input). Then we get a protocol for n=3 and f=1, an impossibility.

## 1.3. Minimum connectivity

So far, we've been assuming a complete communication graph. If the graph is not complete, we may not be able to tolerate as many failures. In particular, we need the connectivity of the graph (minimum number of nodes that must be removed to split it into two components) to be at least 2f+1. See LynchBook §6.5 for the full proof. The essential idea is that if we have an arbitrary graph with a vertex cut of size k < 2f+1, we can simulate it on a 4-process graph where A is connected to B and C (but not D), B and C are connected to each other, and D is connected only to B and C. Here B and C each simulate half the processes in the size-k cut, A simulates all the processes on one side of the cut and D all the processes on the other side. We then construct an 8-process artificial execution with two non-faulty copies of each of A, B, C, and D and argue that if one of B or C can be Byzantine then the 8-process execution is indistinguishable to the remaining processes from a normal 4-process execution. An argument similar to the n ≥ 3f+1 proof then shows we violate one of validity or agreement.

Conversely, if we have connectivity 2f+1, then the processes can simulate a general graph by sending each other messages along 2f+1 predetermined vertex-disjoint paths and taking the majority value as the correct message. Since the f Byzantine processes can only corrupt one path each (assuming the non-faulty processes are careful about who they forward messages from), we get at least f+1 good copies overwhelming the f bad copies. This reduces the problem on a general graph with sufficiently high connectivity to the problem on a complete graph, allowing Byzantine agreement to be solved if the other lower bounds are met.

# 2. Upper bounds

## 2.1. Exponential information gathering gets n = 3f+1

We'll show that a variant of Exponential Information Gathering for SynchronousAgreement works with n ≥ 3f+1 in f+1 rounds.

Recall EIG gives us at each node a set of pairs (path, value) where path spans all sequences of 0 to n distinct ids and value is the input value forwarded along that path. We think of the set of paths as a tree where w is the parent of wj for each path w and each id j not in w. To apply EIG in the Byzantine model, ill-formed messages received from j are treated as missing messages, but otherwise the data-collecting part of EIG proceeds as in the crash failure model. However, we compute the decision value recursively as follows. First replace any pair (w, ⊥) with |w| = f+1 with (w, 0). Then for each path w, define val'(w, i) = majority value among val'(wj, i) for all j, or val(w, i) if |w| = f+1. Finally, decide val'(< >, i).

In computing the final decision we ignore any values attached to short paths, replacing them with the computed majority values. One way to think about this is that I don't trust j to give me the right value for wj, so instead a take a majority of values of wj that j allegedly reported to other people---which are in turn not computed from the untrustworthy accounts of those unreliable witnesses but from majorities repeating what they said.

### 2.1.1. Proof of correctness

This is just a sketch of the proof from LynchBook §6.3.2; essentially the same argument appears in AttiyaWelch §5.2.4.

- Lemma
- If i, j, and k are all non-faulty then for all w, val(wk, i) = val(wk, j) = val(w, k).
- Proof
- Trivial: k announces the same value val(w, k) to both i and j.
- Lemma
- If j is non-faulty then val'(wj, i) = val(wj, i) for all non-faulty i and all w.
- Proof
By induction on f+1-|wj|. If |wj| = f+1, then val'(wj, i) = val(wj, i) = val(w, j) = val(wj, i). If |wj| < f+1, then val(wj, k) = val(w, j) for all non-faulty k. It follows that val(wjk, i) = val(w, j) for all non-faulty i and k (that do no appear in w). The bad guys report at most f bad values val(wj, k'), but the good guys report at least n-f-|wj| good values val(wj, k). Since n ≥ 3f + 1 and |wj| <= f we have n - f - |wj| ≥ 3f+1 - f - f ≥ f+1 good values, which are a majority.

- Proof of Validity
- If everybody has the same input v, we always get a majority of val' values equal to this input at each step, and we win.
- Proof of Agreement
Call a node

*common*if val'(w, i) = val'(w, j) for all non-faulty i, j. Lemma above says wk is common if k is good. With n ≥ 3f+1, taking majorities ensures any parent of only common nodes is also common. Observe that every path has a common node on it, since a path travels through f+1 nodes and one of them is good. Now suppose root is not common: it must have a not-common kid, that node must have a not-common kid, etc. But this constructs a path with a not-common kid which we just proved can't happen.

## 2.2. Phase king gets constant-size messages

The following algorithm, due to Berman, Garay, and Perry , achieves Byzantine agreement in 2(f+1) rounds using constant-size messages, provided n ≥ 4f+1. The description here is drawn from AttiyaWelch §5.2.5. There is also a variant that achieves n = 3f+1 in 3(f+1) rounds (see http://cm.bell-labs.com/who/garay/bit.ps), but it's more complicated.

### 2.2.1. The algorithm

The basic idea of the algorithm is that we avoid the recursive majority voting of EIG by running a vote in each of f+1 "phases" through a "phase king," some process chosen in advance to run the phase. Since the number of phases exceeds the number of faults, we eventually get a non-faulty phase king. The algorithm is structured so that one non-faulty phase king is enough to generate agreement and subsequent faulty phase kings can't undo the agreement.

Here is pseudo-code for the algorithm:

Each processes i maintains an array pref

_{i}[j] where j ranges over all process ids.- There are also utility values majority and multiplicity.
Initially, pref

_{i}[i] = input and pref_{i}[j] = 0 for j <> i.- Phase k consists of rounds 2k-1 and 2k:
- In round 2k-1 (first round of phase k):
send pref

_{i}[i] to all processes (including myself)set pref

_{i}[j] = v_{j}where v_{j}is the value received from process j.set majority = majority value in pref

_{i}and multiplicity = # of times majority appears in pref_{i}.

- In round 2k (second round of phase k):
- if i = k then send majority to all processes (only the phase king does this)
- receive king-majority from the phase king (everybody does this)
if multiplicity > n/2 + f then

pref

_{i}[i] = majority

- else
pref

_{i}[i] = king-majority

if k = f+1 then decide pref

_{i}[i].

- In round 2k-1 (first round of phase k):

What is going on here is that in each phase, everybody announces their current preference (initially the inputs). If the majority of these preferences is large enough (e.g. all inputs are the same), everybody adopts the majority preference. Otherwise everybody adopts the preference of the phase king.

### 2.2.2. Proof of correctness

Termination is immediate from the algorithm.

For Validity, suppose all inputs are v. We'll show that all non-faulty i have pref_{i}[i] = v after every phase. In the first round of each phase, process i receives at least n-f messages containing v; since n ≥ 4f + 1 we have n-f ≥ 3f+1 and n/2 + f <= (4f+1)/2 + f = 3f+1/2, and thus these n-f messages exceed the n/2+f threshold for adopting them as the new preference. So all non-faulty processes ignore the phase king and stick with v, eventually deciding v after round 2(f+1).

For Agreement, we'll ignore all phases up to the first phase with a non-faulty phase king. Let k be the first such phase, and assume that the pref values are set arbitrarily at the start of this phase. We want to argue that at the end of the phase, all non-faulty processes have the same preference. There are two ways that a process can set its new preference in the second round of the phase:

The process i observes a majority of more than n/2+f identical values v and ignores the phase king. Of these values, more than n/2 of them were sent by non-faulty processes. So the phase king also receives these values (even if the faulty processes change their stories) and chooses v as its majority value. Similarly, if any other process j observes a majority of n/2+f identical values, the two >n/2 non-faulty parts of the majorities overlap, and so j also chooses v.

- The process i takes its value from the phase king. We've already shown that i then agrees with any j that sees a big majority; but since the phase king is non-faulty, process i will agree with any process j that also takes its new preference from the phase king.

This shows that after any phase with a non-faulty king, all processes agree. The proof that the non-faulty processes continue to agree is the same as for Validity.

### 2.2.3. Performance of phase king

It's not hard to see that this algorithm sends exactly (f+1)(n^2+n) messages of 1 bit each (assuming 1-bit inputs). The cost is doubling the minimum number of rounds and reducing the tolerance for Byzantine processes. As mentioned earlier, a variant of phase-king with 3-round phases gets optimal fault-tolerance with 3(f+1) rounds (but 2-bit messages). Still better is a rather complicated descendant of the EIG algorithm due to Garay and Moses (see http://cm.bell-labs.com/who/garay/3t.pdf), which gets f+1 rounds with n ≥ 3f+1 while still having polynomial message traffic.

# 3. Weak Byzantine agreement

(See LynchBook §6.6.)

Like regular Byzantine agreement but Validity is only required to hold if there are no faulty processes at all, i.e. if there is a single faulty process, the non-faulty processes can output any value regardless of their inputs (as long as they agree on it). Sadly, this weakening doesn't improve things much: even weak Byzantine agreement can be solved only if n ≥ 3f+1.

Proof: As in the strong Byzantine agreement case, we'll construct a many-process Frankenexecution to figure out a strategy for a single Byzantine process in a 3-process execution. The difference is that now the number of processes in our synthetic execution is much larger, since we want to build an execution where at least some of our test subjects think they are in a non-Byzantine environment. The trick is to build a very big, highly-symmetric ring so that at least some of the processes are so far away from the few points of asymmetry that might clue them in to their odd condition that the protocol terminates before they notice.

Fix some protocol that allegedly solves weak Byzantine agreement, and let r be the number of rounds for the protocol. Construct a ring of 6r processes A_{01} B_{01} C_{01} A_{02} B_{02} C_{02} ... A_{0r} B_{0r} C_{0r} A_{10} B_{10} C_{10} ... A_{1r} B_{1r} C_{1r} where each X_{ij} runs the code for process X in the 3-process protocol with input i. For each adjacent pair of processes, there is a 3-process Byzantine execution which is indistinguishable from the 6r-process execution for that pair: since Agreement holds in all Byzantine executions, each adjacent pair decides the same value in the big execution and so either everybody decides 0 or everybody decides 1 in the big execution.

Now we'll show that means that Validity is violated in some no-failures 3-process execution. We'll extract this execution by looking at the execution of processes A0_{r/2} B0_{r/2} C0_{r/2}. The argument is that up to round r, any input-0 process that is at least r steps in the ring away from the nearest 1-input process acts like the corresponding process in the all-0 no-failures 3-process execution. Since A0_{r/2} is 3r/2 > r hops away from A1r and similarly for C0_{r/2}, our 3 stooges all decide 0 by Validity. But now repeat the same argument for A1_{r/2} B1_{r/2} C1_{r/2} and get 3 new stooges that all decide 1. This means that somewhere in between we have two adjacent processes where one decides 0 and one decides 1, violating agreement in the corresponding 3-process execution where the rest of the ring is replaced by a single Byzantine process.

CategoryDistributedComputingNotes

Not a real word. (1)