1. Abstract
First, what is Byzantine fault tolerance?
In simple terms, a distributed system is said to have Byzantine fault tolerance if it continues to work correctly, even if some of the nodes in the system have failed, stopped, or not worked correctly.
This paper proposes a new replication-based algorithm that can tolerate Byzantine faults.
Most previous work solves this problem, assuming the system is synchronous.
The proposed algorithm works in an asynchronous environment and has almost the same level of performance as the previous works.
2. Introduction
Let’s start with a quote directly from the paper:
This paper presents a new, practical algorithm for state machine replication that tolerates Byzantine faults.
This nicely puts the work area of the paper. The state machine replication is the core.
The algorithm offers both liveness and safety provided at most (n-1)/3 replicas are simultaneously faulty out of a total of n.
Most of the earlier work can be divided into two categories - too many inefficient techniques or synchronous systems ( bounded message delays).
Examples of synchronous systems are Rampart and SecureRing.
The algorithm uses only one message round trip to execute read-only operations and two to execute read-write operations.
The team has implemented a replication library to implement a Byzantine-fault-tolerant distributed file system that supports the NFS protocol. The result shows that the system is only 3% slower than the standard NFS daemon.
The paper makes the following contributions:
It describes the first state-machine replication protocol that correctly survives Byzantine faults in asynchronous networks.
It describes several important optimizations that allow the algorithm to perform well in real systems.
It describes the implementation of a Byzantine-fault tolerant distributed file system.
It provides experimental results that quantify the cost of the replication technique.
3. System Model
The first point is the asynchronous environment. The team assumes an asynchronous distributed system where the network (to connect nodes) may fail to deliver, delay, duplicate, or deliver messages out of order.
The model for nodes’s operation is Byzantine. It means faulty nodes may behave arbitrarily.
One important thing to note: The node failure is assumed to be independent. Therefore, steps must be taken to make this assumption valid. This can be done using different service code implementations and operating system implementations. The root password and administrator are different for nodes.
The team uses cryptographic techniques to prevent spoofing and replays and to detect corrupted messages.
The messages transferred in the algorithm contain public key signatures, message authentication codes, and message digests.
The digest of a message is signed and appended to the plain text message.
All replicas in the system know the public keys of other replicas.
The algorithm assumes that the adversary node cannot delay correct nodes indefinitely. Also, they’re computationally bound (unable to subvert the cryptographic techniques).
A few examples of this bound are the following:
The adversary cannot produce a valid signature of a non-faulty node
The adversary cannot compute the information summarized by a digest from
the digest.
The adversary cannot find two messages with the same digest.
4. Service Properties
The algorithm can implement any deterministic replicated service with a state and some operations.
Clients request the replicated service to invoke operations and block waiting for a reply.
The algorithm provides both safety and liveness, as said previously.
Safety means that the replicated service satisfies linearizability. Safety is provided regardless of how many faulty clients are using the service.
The safety property is insufficient to guard against faulty clients. So, the algorithm limits the amount of damage a faulty client can do by providing access control.
The algorithm does not rely on synchrony to provide safety. It relies on synchrony to provide liveliness.
The paper defines a delay(t) as the time between the moment a message is sent for the first time and the moment it is received by its destination (retry allowed).
The algorithm guarantees liveness, which means the client will eventually receive replies to their requests. It assumes that at most floor( (n-1)/3) replicas are faulty, and delay(t) does not grow faster than t indefinitely.
The algorithm has optimal resiliency. To provide safety and liveness properties when up to f replicas are faulty, we need a minimum of 3f + 1 replicas in the asynchronous system.
5. The Algorithm
The algorithm is a form of state machine replication.
The replicas move through a succession of configurations called views. In a view, one replica is the primary, and the others are backups. Views are numbered consecutively. View change happens when it appears that the primary has failed.
The simple steps of algorithms are as follows:
A client sends a request to the primary.
The primary multicasts the request to the backups.
The Replicas execute the request and send a reply to the client.
The client waits for f+1 replies from different replicas with the same result.
There are two requirements for replicas:
They must be deterministic.
They must start with the same state.
The algorithm ensures the safety property by guaranteeing that all non-faulty replicas agree on a total order for the execution of requests despite failures.
5.1 The Client
A client c requests the execution of state machine operation o by sending a REQUEST message <REQUEST, o, t, c > to the primary.
Here, timestamp t ensures exactly once semantics for executing client requests. The timestamps for a client’s requests are totally ordered. It means later requests have higher timestamps than earlier ones;
Each message the replicas send to the client includes the current view number.
The client waits for f + 1 replies with valid signatures from different replicas.If the client does not receive replies soon enough, it broadcasts the request to all replicas.
5.2 Normal-Case Operation
The state of each replica includes
The state of the service.
A message log containing messages the replica has accepted.
An integer denoting the replica’s current view.
When the primary receives a client request, a three-phase protocol is started to atomically multicast the request to the replicas. Buffered requests are multicast later as a group.
The three phases are pre-prepare, prepare, and commit.
The pre-prepare and prepare phases are used to totally order requests sent in the same view.
The prepare and commit phases are used to ensure that requests that commit are totally ordered across views.
In the pre-prepare phase, the primary assigns a sequence number, n, to the request, multicasts a preprepare message with m piggybacked to all the backups, and appends the message to its log.
A replica (including the primary) accepts prepare messages. It adds them to its log provided their signatures are correct, their view number equals the replica’s current view, and their sequence number is between the range of low and high water marks.
The replicas verify whether the prepares match the pre-prepare by checking that they have the same view, sequence number, and digest.
So, in simple terms, The pre-prepare and prepare phases of the algorithm guarantee that non-faulty replicas agree on a total order for the requests within a view.
Replicas multicast the commit message to the other replicas when prepared is true.
This starts the commit phase. Replicas accept commit messages and insert them in their log provided they are correctly signed, the view number in the message is equal to the replica’s current view, and the sequence number is between the low and high water marks.
Also, the algorithm ensures that any request that commits locally at a non-faulty replica will eventually commit at f + 1 or more non-faulty replicas.
After executing the requested operation, replicas send a reply to the client.
The diagram above shows the algorithm's operation in the normal case of no primary faults. Replica 0 is the primary, replica 3 is faulty, and C is the client.
5.3 View Changes
The view-change protocol provides liveness.
A backup is waiting for a request if it received a valid request and has not been executed.
View changes are triggered by timeouts that prevent backups from waiting indefinitely for requests to execute.
A backup starts a timer when it receives a request, and it is not already running. It stops the timer when it no longer waits to execute the request and restarts it if it is waiting to execute some other request.
If the timer of backup expires in view v, the backup starts a view change to move the system to view v + 1. It stops accepting messages (other than checkpoint, view-change, and new-view messages) and multicast a VIEW-CHANGE message to all replicas.
When the primary of view v +1 receives 2f valid view-change messages for view v + 1 from other replicas, it multicasts a NEW-VIEW message to all other replicas.
The primary enters view v +1 and can now accept messages for view v + 1.
A backup accepts a new-view message for view v + 1 if it is appropriately signed, if the view-change messages it contains are valid for view v +1.
5.4 Non-Determinism
We have assumed that State machine replicas are deterministic until now.In practice, some services involve some form of non-determinism.
The algorithm has a solution for this. The primary needs to select the non-deterministic value either independently or based on values provided by the backups.
Suppose the primary selects the non-deterministic value independently. In that case, it concatenates the value with the associated request. It executes the three-phase protocol to ensure that non-faulty replicas agree on a sequence number for the request and value.
If replicas have to participate in selecting the non-deterministic value to satisfy a service’s specification, the algorithm adds an extra phase to the protocol. The primary obtains authenticated values proposed by the backups. It concatenates 2f + 1 of them with the associated request and starts the three-phase protocol for the concatenated message.
I have left safety, liveliness proof, and garbage collection as these are advanced topics that must be understood only when implementing the paper.
6. Optimizations
6.1 Reducing Communication
The algorithm uses three optimizations to reduce the cost of communication.
Avoid sending most large replies
Reduce the number of message delays for an operation invocation from 5
to 4.
Improve the performance of read-only operations that do not modify the service
state.
6.2 Cryptography
The algorithm uses digital signatures only for view-change and new-view messages, which are sent rarely. It authenticates all other messages using message authentication codes (MACs).
MACs cannot prove that a message is authentic to a third party. The team modified the algorithm to circumvent the problem by taking advantage of specific invariants.
The invariant is that no two different requests prepare with the same view and sequence number at two non-faulty replicas.
MACs can be computed three orders of magnitude faster than digital signatures. Each node, including active clients, shares a 16-byte secret session key with each replica.
The algorithm computes message authentication codes(MAC) by applying MD5 to the message's concatenation with the secret key. Then, it uses the ten least significant bytes from the final MD5 digest.
7. Implementation
As we have understood the algorithm, let’s understand the implementation done by the team behind this paper.
The team created a replication library and used it to implement a replicated NFS.
7.1 The Replication Library
The replication library has a single procedure - invoke.
The client calls invoke with one argument, an input buffer containing a request to invoke a state machine operation. It returns a pointer to a buffer containing the operation result. On the server side, the replication code makes many upcalls to procedures implemented by the server.
There are procedures to:
Execute requests (execute),
Maintain checkpoints of the service state (make checkpoint, delete checkpoint),
Obtain the digest of a specified checkpoint (get digest)
Obtain missing information (get checkpoint, set checkpoint).
UDP is used for point-to-point communication between nodes. UDP over IP multicast is used for multicast to the group of replicas.
The algorithm tolerates out-of-order delivery and rejects duplicates.
7.2 BFS: A Byzantine-Fault-tolerant File System
Using the replication library, the team implemented BFS, a Byzantine-fault-tolerant NFS service.
The diagram below shows the architecture of BFS:
Like any regular NFS file system, a file system is mounted on the client machine.
Application processes run unmodified and interact with the mounted file system through the NFS client in the kernel. The team relies on user-level relay processes to mediate communication between the standard NFS client and the replicas.
The relay receives NFS protocol requests, calls the invoke procedure, and returns the result to the NFS client.
Each replica runs a user-level process with the replication library and our NFS V2 daemon snfsd (for simple nfsd).
The replication library receives requests from the relay, interacts with snfsd by making upcalls, and packages NFS replies into replication protocol replies that it sends to the relay.
7.3 Maintaining Checkpoints
The saved executes file system operations directly in the memory-mapped file. The snfsd maintains a copy-on-write bit for every 512-byte block in the memory-mapped file.
When the replication code invokes the make_checkpoint upcall, snfsd sets all the copy-on-write bits and creates a (volatile) checkpoint record, containing the current sequence number.
When a block of the memory-mapped file is modified, snfsd checks the copy-on-write bit for the block and, if it is set, stores the block’s current contents and its identifier in the checkpoint record for the last checkpoint.
After that, it overwrites the block with its new value and resets its copy-on-write bit.
To serve get_checkpoint snfsd first searches for the block in the checkpoint record of the stable checkpoint, and then searches the checkpoint records of any later checkpoints.
7.4 Computing Checkpoint Digests
The snfsd computes a digest of a checkpoint state as part of a make checkpoint upcall.
The snfsd uses an incremental collision-resistant one-way hash function called AdHash for it. This function divides the state into fixed-size blocks and uses some other hash function (e.g., MD5) to compute the digest of the string obtained by concatenating the block index with the block value for each block.
The digest of the state is the sum of the digests of the blocks modulo some large integer.
To compute the digest for the state incrementally, the snfsd maintains a table with a hash value for each 512-byte block.
When make checkpoint is called, the following things happen:
The snfsd obtains the digest d for the previous checkpoint state.
It computes new hash values for each block whose copy-on-write bit is reset.
It adds the new hash value to d, subtracts the old hash value from, and updates the table to contain the new hash value.
8. Performance Evaluation
The team has done great work evaluating the system's performance. For curious readers, I suggest reading the original paper’s facts and figures.
9. Related Work
Most previous work on replication techniques ignored Byzantine faults or assumed a synchronous system model.
Rampart and SecureRing are two systems that are most closely related to this paper’s work.
10. Conclusions
The paper has described a new state-machine replication algorithm.
It is a practical algorithm that works correctly in an asynchronous system and can tolerate Byzantine faults.
It also improves the performance of previous algorithms by more than an order of magnitude.