Insights from Paper(Part I) - ZooKeeper: Wait-free coordination for Internet-scale systems
Introduction
ZooKeeper is a service for coordinating processes of distributed applications.
To begin, let’s clarify the concept of coordination. For instance, consider a configuration. A configuration is a list of one or more parameters.
All processes running in a distributed system need to know these parameters. How all of the processes are in sync with this knowledge?
You will say they must coordinate for that; we are talking about this coordination in the current context.
There are more things in a distributed system, like configuration. Some of them are group membership and leader election. The configuration itself can be dynamic.
What could be the possible solution?
To effectively coordinate various tasks, one solution is to create a tailored coordination service specific to each task type. For example, a leader election service can be implemented in a distributed system.
Another option is to develop low-level primitives that users can use as a foundation to construct high-level primitives for coordination tasks.
Zookeeper follows the latter approach by providing an API for developers to create their own primitives, resulting in a wide range of coordination options to meet diverse needs.
The main benefit of this approach is its flexibility in accommodating different coordination needs.
The next important point is what kind of primitive Zookeeper provides. Again let’s take another example. Zookeeper could have provided the locks, and the user could have built blocking primitives on top of that. But It would have been difficult for users ( clients) to handle all failure detection and other associated things.
So Zookeeper decides to build APIs that manipulate wait-free objects organized hierarchically as in file systems.
Zookeeper is similar to Chubby without the lock methods, open and close.
The wait-free property is important but not sufficient for coordination. The client needs order guarantees for operations.
Zookeeper provides FIFO client ordering of all operations and linearizable writes. This condition is proved by research to be sufficient for implementing consensus for any number of processes.
The ZooKeeper service comprises an ensemble of servers. It uses replication for high availability and performance.
The team built ZooKeeper using simple pipeline architecture. It has two main benefits.
Low latency, even hundreds of thousands of requests are outstanding
FIFO order execution of operations from a single client
The critical gain from guaranteeing a FIFO client order is that clients can submit operations asynchronously. With asynchronous operations, a client can have multiple outstanding operations at a time.
ZooKeeper provides linearizable writes by implementing a leader-based atomic broadcast protocol called Zab. I will cover the Zab paper in the future.
ZooKeeper is a read-heavy system. It processes read operations locally from a server and does not use Zab to totally order them. This is the reason ZooKeeper scales read throughput.
One important technique ZooKeeper uses is caching, which helps in read operations. It uses a watch mechanism to enable clients to cache data without managing the client cache directly.
This paper discusses how the team implements some coordination primitives with ZooKeeper.
The ZooKeeper service
ZooKeeper provides a client library. Using this library, the client can call ZooKeeper APIs and submit the request. The library manages network connectivity between the client and server.
Before we proceed, let’s define some terminology.
Client: A user of the zooKeeper service
Server: Process providing zooKeeper service
Znode: In-memory data node
Data tree: hierarchical namespace of znodes
Update and write — Any operation which modifies the data tree
Service Overview
In a nutshell, ZooKeeper provides an abstraction of a hierarchical namespace of znodes to its clients. And the client can manipulate these nodes using ZooKeeper API in the client library.
The diagram below shows the hierarchical namespace.
To refer to a given znode, the team used the standard UNIX notation for file system paths. For example p_3 znode will be denoted by path /app1/p_3.
Client can create two different kind of znodes:
Regular: The client create and delete them explicitly.
Ephemeral: The client creates them explicitly, but they can be deleted explicitly or the system can delete when session terminates.
All znodes store data. Ephemeral znodes cannot have children.
ZooKeeper implements watches to allow clients to receive notifications of changes.
When a client issues a read operation with a watch flag set, the operation completes as normal, and the server promises to notify the client when read data changes at the server.
Watches are registered within a session, and they are unregistered when session closes.
Session events (like connection loss) are also sent to watch callbacks. This helps clients know that watch events may be delayed.
Data model
We have seen the hierarchical namespace. So ZooKeeper data model is essentially a file system plus the API for full data read and write on znodes.
Znodes map to abstractions of the client application. Typically they corresponds to meta-data used for coordination purposes. For example, in the previous diagram, /app1 and /app2 znodes represent two different applications.
ZooKeeper allows clients to store some information that can be used for meta-data or configuration in a distributed computation. Please note znodes are not general-purpose storage nodes like they are in a file system.
Sessions
A client connects to ZooKeeper and initiates a session. Sessions have an associated timeout.
A session can be closed in two ways. The first client closes it explicitly. Second, ZooKeeper does not receive anything from its session for more than that timeout period.
Client API
In the last section we understood the data model. Now let’s learn the APIs exposed by ZooKeeper to manipulate the znodes.
A few important request in APIs are listed below:
create(path, data, flags): Creates a znode with path name path, stores data[] in it, and returns the name of the new znode. Flags are used for type of znode and sequential nature.
delete(path, version): Deletes the znode path if that znode is at the expected version.
exists(path, watch): Returns true if the znode with path name path exists. The watch flag enables a client to set a watch on the znode.
getData(path, watch): Returns the data and meta-data, such as version information, associated with the znode. ZooKeeper does not set the watch if the znode does not exist.
setData(path, data, version): Writes data[] to znode path if the version number is the current version of the znode.
getChildren(path, watch): Returns the set of names of the children of a znode.
sync(path): Waits for all updates pending at the start of the operation to propagate to the server that the client is connected to. The path is currently ignored.
All methods have both a synchronous and an asynchronous version available through the API.
When an application needs to execute a single ZooKeeper operation, it uses synchronus call. This call blocks, and the assumption here is that there are no concurrent tasks to execute.
The asynchronous API enables an application to have multiple outstanding ZooKeeper operations and other tasks executed in parallel.
Each of the update methods takes an expected version number. If the actual version number of the znode does not match the expected version number the update fails with an unexpected version error.
ZooKeeper guarantees
ZooKeeper provides two basic ordering guarantees:
Linearizable writes: All requests that update the state of ZooKeeper are serializable and respect precedence.
FIFO client order: All requests from a given client are executed in the order that they were sent by the client.
Let’s understand the linearizability. There is a definition provided in Herlihy's linearizability paper. The ZooKeeper definition is not same. They call it A-linearizability (asynchronous linearizability). So what is the difference?
The original definition says a client can only have one outstanding operation at a time.
In the ZooKeeper definition, the client can have multiple outstanding operations at a time but they will be served in the FIFO order.
Take note that A-linearizability also satisfies original linearizability. Because only update requests are A-linearizable, ZooKeeper processes read requests locally at each replica.
A system comprising a number of processes elects a leader.
We have two important requirements:
As the new leader starts making changes, we do not want other processes to start using the configuration that is being changed;
If the new leader dies before the configuration has been fully updated, we do not want the processes to use this partial configuration.
With ZooKeeper, the new leader can designate a path as the ready znode. Other processes will only use the configuration when that znode exists.
The new leader makes the configuration change by deleting ready, updating the various configuration znodes, and creating ready. All of these changes can be pipelined and issued asynchronously to quickly update the configuration state.
Let’s check if both previous requirements are met.
Because of the ordering guarantees, if a process sees the ready znode, it must also see all the configuration changes made by the new leader.
If the new leader dies before the ready znode is created, the other processes know that the configuration has not been finalized and do not use it.
It looks good, but there is an issue here. What happens if a process sees that ready exists before the new leader starts to make a change and then starts reading the configuration while the change is in progress?
The above problem is solved by the ordering guarantee for the notifications.
If a client is watching for a change, the client will see the notification event before it sees the new state of the system after the change is made.
There is one more problem to solve. Let’s understand that with an example. Suppose there are two different clients, A and B. And they have their own shared communication channel.
If A changes the shared configuration in ZooKeeper and tells B of the change through the shared communication channel, B would expect to see the change when it re-reads the configuration.
If B’s ZooKeeper replica is slightly behind A’s, then B may not be able to see the change when it re-reads the configuration.
This problem is solved by ZooKeeper using a sync request. Sync causes a server to apply all pending write requests before processing the read.
ZooKeeper provides two more guarantees:
If a majority of ZooKeeper servers are active and communicating the service will be available.
If, the ZooKeeper service responds successfully to a change request, that change persists across several failures.
Examples of primitives
In this section we will learn a few examples high-level primitives that can be built using Zookeeper low-level primitives.
Configuration Management
ZooKeeper can be used to implement dynamic configuration in a distributed application. We started with configuration in the introduction section. Let’s move ahead and learn how to do this.
A configuration can be stored in a znode zc.
When a process starts, it reads zc and receives the required configuration. The process also sets the watch flag to true.
If the configuration of zc ever changes, the processes are notified for that. The process reads the new config and again sets the watch flag true for future changes.
One good point to note here, It is not necessary that for each zc change the notifications are sent to the process if it has not read zc. Whenever the process comes to read the new configuration, it can see all the changes in one go.
Rendezvous
There are cases when what will be the final configuration after a given request is not known. For example a client may want to start a master process and several worker processes. The client is responsible for letting the worker processes know the addresses and ports of the master process but it can not as the master process is not known.
ZooKeeper handles this scenario with ZooKeeper using a rendezvous znode zr. zr is a znode created by the client.
The client passes zr details as a startup parameter to the master and worker processes. When the master starts it fills data in zr. When workers start, they read zr with the watch set to true.
If zr is not filled at the time workers start, they wait to be notified when zr is updated. If zr is an ephemeral node, master and worker processes can watch for zr to be deleted and clean themselves up when the client ends.
Group Membership
ZooKeeper can provide group membership primitives using ephemeral nodes.
Let’s say a node zg represents the group. When a process member of the group starts, it creates an ephemeral child znode under zg. We can use a unique member name or sequential flag to give a unique name to this newly created node.
If the process fails or ends, the znode that represents it under zg is automatically removed.
Any process can obtain group membership information by simply listing the children of zg. If the process is interested in knowing the change in membership, it can set the watch flag true and will get notified of the changes.
Simple Locks
An application can implement lock primitives using ZooKeeper’s API.
The lock is represented by a znode.
A client tries to acquire a lock by creating the designated znode of the ephemeral type. If the client succeeds, the client holds the lock. Otherwise, the client can read the znode with the watch flag true.
A client releases the lock when it dies or explicitly deletes the znode.
The clients waiting for a lock try again to acquire a lock once they observe (by notification) the znode being deleted.
There are two challenges in this implementation:
Many clients are waiting to acquire a lock. They will all vie for the lock when it is released even though only one client can acquire the lock. It will create a herd effect.
It only implements exclusive locking.
Simple Locks without Herd Effect
Lock
1 n = create(l + “/lock-”, EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if n is lowest znode in C, exit
4 p = znode in C ordered just before n
5 if exists(p, true) wait for watch event
6 goto 2
Unlock
1 delete(n)
The challenge in the previous approach was that all clients were looking for the lock once it was available. To solve this, we can line up all the clients requesting the lock and each client obtains the lock in order of request arrival.
In above code for Lock, the client use the SEQUENTIAL flag when trying to acquire the lock as given in line 1. Next, it is checked that the client’s znode has the lowest number or not in line 3. If yes, the client has the lock. Otherwise, client waits for deletion of the znode that either has the lock or will receive the lock before this client’s znode.
It means the client only needs to watch the znodecthat precedes his znode and we avoid the herd effect. Once the znode being watched by the client disappears, the client must check if it now holds the lock.
The lock will be released either by deleting the node explicitly or crash will automatically clean up.
Read/Write Locks
Implementing Read/Write locks is simple. We need to separate both locks.
The write lock will have name change as shown in the code below line 1.
The read lock need a change so that it checks if there is a write lock znode present or not. It is done in line 3.
Write Lock
1 n = create(l + “/write-”, EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if n is lowest znode in C, exit
4 p = znode in C ordered just before n
5 if exists(p, true) wait for event
6 goto 2
Read Lock
1 n = create(l + "/read-", EPHEMERAL|SEQUENTIAL)
2 C = getChildren(l, false)
3 if no write znodes lower than n in C, exit
4 p = write znode in C ordered just before n
5 if exists(p, true) wait for event
6 goto 3
Double Barrier
A double barrier helps clients to synchronize the beginning and the end of a computation.
Processes can start their computation once enough number of processes have joined the barrier. The barrier can defined this threshold value.
In ZooKeeper, a barrier is represented with a znode b. When a process p registers with b, it creates a child node of b. When a process unregisters, it removes the child.
To enter, processes watch for the existence of a ready child of b. This node will be created by the process that causes the number of children to exceed the barrier threshold.
To leave, processes watch for a particular child to disappear and only check the exit condition once that znode has been removed.
In the next part of this post, we will discuss rest of the paper which includes ZooKeeper applications, implementation, evaluation and other related works.
References
Original paper: https://www.usenix.org/legacy/event/atc10/tech/full_papers/Hunt.pdf
Linearizability: https://dl.acm.org/doi/pdf/10.1145/78969.78972