Why `fsync()`: Losing Unsynced Data on a Single Node Leads to Global Data Loss

Regardless of the replication mechanism, you must fsync() your data to prevent global data loss in non-Byzantine protocols.

Denis Rystsov

May. 18, 23 · Opinion

Likes (1)

Comment

Save

2.6K Views

In this deep dive blog post, I will be exploring the critical and often misunderstood dynamics of data loss within replicated systems. Our main focus will be on the potentially catastrophic impact of losing unsynchronized data on a single node that could cause global data loss. In the end, I will illustrate it with a demo.

What Is fsync?

By default, the disk write API is asynchronous. Therefore, when an application uses the operating system’s API to write data to disk, the OS copies the data and may confirm the write request without waiting for the data to reach the disk. This behavior improves latency and throughput, but it reduces safety.

Although this behavior is safer than buffering the data on the application level, it’s not completely consistent. The OS copies the data, so even if the application crashes, the data remains intact, and the OS completes the write. However, if the machine loses power or the OS crashes, recently written data can still be lost.

With fsync, the application asks the OS to return control flow only when all recently written data has reached the disk. This ensures that the application works synchronously with the disk, guaranteeing that all data is written to the disk before continuing with the program execution.

What Is Replication For?

Replication is a technique used to improve the availability and durability of an application’s data. When an application or its data is hosted on a single node, if that node goes down, the application becomes unavailable, and the data becomes inaccessible. Replication solves this problem by storing the data on different nodes, keeping it in sync, and making it accessible for reading and writing even in the presence of network partitioning and crashes (i.e., fail-stop faults).

Replication ensures that data is consistent across all nodes and that it is as consistent as it would be on a single node (linearizability). This means that even if one node fails, the application and its data remain available, and users can continue to access and use the application without interruption.

Can Replication Reduce the Safety Risks of Running a System Without fsync?

Node crashes are the main vulnerability of running a system without fsync. However, consistent replication can withstand node crashes without compromising data consistency. So it appears that replication resolves this vulnerability and makes fsyncs unnecessary for replicated systems.

Of course, data loss can still occur if all nodes lose power or experience simultaneous OS crashes. However, it is possible to minimize the likelihood of this event, e.g., by using availability zones, making things feel safe.

The argument presented above is a common misunderstanding. Even the loss of power on a single node, resulting in local data loss of unsynchronized data, can lead to silent global data loss in a replicated system that does not use fsync, regardless of the replication protocol in use.

Note: The caveat is that most replication protocols only tolerate fail-stop faults, which means that while nodes may crash, they must have the same state (data) upon restart as they did at the moment of the crash.

Running a system without fsync removes node crashes from the category of fail-stop faults, and we cannot use replication to justify the absence of fsync.

Can a Replication Protocol Support Faults Beyond Fail-Stop and Tolerate the Lack of fsync?

Yes, Byzantine faults are a broader class of faults that assume nodes can exhibit any type of adverse behavior, including the loss of unsynchronized writes. The “Reaching Agreement in the Presence of Faults” paper shows that to tolerate n Byzantine faults, a system must be replicated to 3n+1 nodes.

However, setting a replication factor of four in Redpanda or Kafka is not enough to protect against a single Byzantine fault. A system must use cutting-edge Byzantine fault-tolerant (BFT) replication protocols, which neither of these systems currently employ.

BFT protocol was not chosen due to its complexity, maturity, and performance characteristics. It takes time to mature a research protocol for industry use. For example, Paxos, the first consistent replication protocol, required a decade of distributed system research to support essential industry features, such as the ability to replace crashed nodes.

The loss of unsynchronized data is only a small subset of Byzantine faults. Is there a protocol that covers this subset and provides the same level of guarantees as Raft, with 2n+1 nodes instead of 3n+1?

It’s a good computer science question, and I don’t know the answer in general, but with n=1, it’s impossible.

Proof of Impossibility

The Raft/Paxos replication protocols are designed with fault tolerance in mind. As long as a majority of the nodes are available to elect a leader (2 RTTs), the system can guarantee availability and consistency. However, if a node is allowed to lose unsynced data, this invariant cannot be maintained in any replication protocol. Let’s assume the opposite and prove this by contradiction.

Consider a replicated system consisting of three nodes ({A, B, C}) with RF=3. Suppose node {A} gets isolated from a client and the peers, and the client writes sequential records {1,2,3,4,5} to the replicated system. As soon as the client receives an ack, the following happens:

Node {B} gets isolated from the peers.
Node {C} crashes, loses unsynced data, and restarts only with {1,2,3} data.
Node {A} gets reconnected with node {C} and the client.

Assume that after a short period of time (2 RTTs), the cluster should become available to serve reads and writes. However, nodes {A, C} don’t have {4,5} suffixes, so if they do become available, there will be data loss. This contradicts the system’s guarantee of consistency and proves the impossibility.

In conclusion, it is impossible for a replicated system to tolerate the loss of unsynced data and maintain the same level of consistency and availability as the Raft/Paxos protocols with RF=3.

Counter-Example

A counterexample can sometimes be more convincing than proof, so I decided to reproduce the data loss. Apache Kafka doesn’t use fsync by default, so I chose to use it to demonstrate the possibility of global data loss caused by local data loss on a single node.

Clone the example repo.

     C 
   
   git clone https://github.com/redpanda-data/kafka-fsync
cd kafka-fsync

Build a container with a locally-deployed Kafka cluster (three Kafka processes and one Apache ZooKeeper™ process).

     C 
   
   docker build -t kafka-fsync .

Start the container and log into it.

     C 
   
   docker run -d --name kafka_fsync -v $(pwd):/fsync kafka-fsync
docker exec -it kafka_fsync /bin/bash

Create data directories.

     C 
   
   cd /fsync
./create.dirs.sh

Start the ZooKeeper process.

/root/apache-zookeeper-3.8.1-bin/bin/zkServer.sh --config . start

Start three Kafka processes.

     C 
   
   nohup /root/kafka_2.12-3.4.0/bin/kafka-server-start.sh kafka1/server.properties >> /fsync/kafka1/kafka.log 2>&1 & echo $! > /fsync/kafka1/pid &
nohup /root/kafka_2.12-3.4.0/bin/kafka-server-start.sh kafka2/server.properties >> /fsync/kafka2/kafka.log 2>&1 & echo $! > /fsync/kafka2/pid &
nohup /root/kafka_2.12-3.4.0/bin/kafka-server-start.sh kafka3/server.properties >> /fsync/kafka3/kafka.log 2>&1 & echo $! > /fsync/kafka3/pid &

Create topic1 with RF=3.

     C 
   
   /root/kafka_2.12-3.4.0/bin/kafka-topics.sh --create --topic topic1 --partitions 1 --replication-factor 3 --bootstrap-server 127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094

Kill Kafka process 1. Because the OS or node wasn't crashed, kafka1 hasn't experienced local data loss.

     C 
   
   cat kafka1/pid | xargs kill -9

Write ten records with acks=all.

     C 
   
   python3 write10.py

Output:

     C 
   
 
 
   wrote key0=value0 at offset=0
wrote key1=value1 at offset=1
...
wrote key8=value8 at offset=8
wrote key9=value9 at offset=9 
  

Let's figure out which node is the leader.

     C 
   
   /root/kafka_2.12-3.4.0/bin/kafka-topics.sh --describe --topic topic1 --bootstrap-server 127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094

On my machine, it was kafka3. Let's kill ZooKeeper (isolate ZooKeeper from the cluster) to "freeze" leadership on the third node.

     C 
   
   cat zookeeper/zookeeper_server.pid | xargs kill -9

Kill the rest of Kafka processes to freeze time. Since the OS is intact, there’s no local data loss at this point.

     C 
   
   cat kafka2/pid kafka3/pid | xargs kill -9

Now, simulate local data loss by removing the last ten bytes from the last leader (kafka3, in my case).

     C 
   
   truncate -s -10 kafka3/data/topic1-0/00000000000000000000.log

Restart ZooKeeper.

     C 
   
   /root/apache-zookeeper-3.8.1-bin/bin/zkServer.sh --config . start

Let's give it a minute to remove ephemeral info. Then start the former leader (kafka3, in my case).

     C 
   
   nohup /root/kafka_2.12-3.4.0/bin/kafka-server-start.sh kafka3/server.properties >> /fsync/kafka3/kafka.log 2>&1 & echo $! > /fsync/kafka3/pid &

Wait until it becomes a leader.

     C 
   
   /root/kafka_2.12-3.4.0/bin/kafka-topics.sh --describe --topic topic1 --bootstrap-server 127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094

Then start the "empty" kafka1 process.

     C 
   
   nohup /root/kafka_2.12-3.4.0/bin/kafka-server-start.sh kafka1/server.properties >> /fsync/kafka1/kafka.log 2>&1 & echo $! > /fsync/kafka1/pid &

Again, let's wait until the two nodes' ISR form.

     C 
   
   /root/kafka_2.12-3.4.0/bin/kafka-topics.sh --describe --topic topic1 --bootstrap-server 127.0.0.1:9092,127.0.0.1:9093,127.0.0.1:9094

Then, let's write another ten records for Kafka.

     C 
   
   python3 write10.py

Ideally, you should see.

     C 
   
 
 
   wrote key0=value0 at offset=10
wrote key1=value1 at offset=11
...
wrote key8=value8 at offset=18
wrote key9=value9 at offset=19 
  

But what you actually see is.

     C 
   
 
 
   wrote key0=value0 at offset=9
wrote key1=value1 at offset=10
...
wrote key8=value8 at offset=17
wrote key9=value9 at offset=18 
  

So, by causing local data loss on a single node (it may happen without the fsync), we caused global data loss, and Kafka lost the record key9=value9 at offset=9.

Conclusion

The use of fsync is essential for ensuring data consistency and durability in a replicated system. The post highlights the common misconception that replication alone can eliminate the need for fsync and demonstrates that the loss of unsynchronized data on a single node still can cause global data loss in a replicated non-Byzantine system.

Apache ZooKeeper Data loss Data (computing) kafka Protocol (object-oriented programming) Replication (computing)

Published at DZone with permission of Denis Rystsov. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending