Is ElasticSearch SET/GET Eventually Consistent?
The author takes a deep dive into the intricacies of SET/GET in ElasticSearch to determine if there is eventual consistency.
Join the DZone community and get the full member experience.
Join For FreeElasticSearch does much of the heavy lifting on handling horizontal scalability for us, managing failures, nodes, and shards. I was just getting into it a few days ago in a new project I was working on. I wanted to know if the SET
/GET
operation is eventually consistent. I started by thinking, Well it's a NoSQL; there are replicas; it should be eventually consistent. But then I read some documentation that led me to think it might be consistent and not eventually consistent.
If you know otherwise by the below findings, don't hesitate please let me know! But first, allow me to summarize for you some of the concepts I have learned. Then, I will say what I think about SET
/GET
eventual consistency. (I also used a local cluster test to confirm that.)
cluster.name
: Nodes with the same name belong to the same cluster. The cluster reorganizes itself as we add/remove data, meaning it manages moving data between nodes if needed.Master node: Not involved in searching. One node is elected as the master node and is in charge of adding/deleting indexes and adding/removing nodes from the cluster.
Any node: You can talk to any node for searching and indexing, including the master. The entry point node (any node) knows where data resides, so it will communicate with it to
SET
/GET
data and it will get back to us (the entry point node) with the results.Index: The logical namespace that points to one or more shards. It's like a database in a relational database. Index groups together one or more shards.
1 index + multi-shards: One index can have one or more multi-shards; it's like a database.
Shard: Documents are stored in shards. It's a single instance of Lucene and a complete search engine in its own right.
Application > index > shard: Applications talk to shards via indexes (logical namespaces pointers to shards).
Cluster grows: Move shards between nodes.
Primary shard: The document is on a single primary shard. Data is only on one primary shard.
Replica shard: In the case of hardware failure on the primary shard, this serves read/
GET
requests.Number of shards: You can have multiple primary shards for an index.
Who hands what: Reads and searches are handled by either the primary or replica. The more copies the higher the throughput.
Concurrency: ElasticSearch uses optimistic concurrency control (versioning).
Distributed document store: When you index a document, it is stored on a single primary shard (
shard = hash(routing) % number_of_primary_shards
). This explains why the number of primary shards can be set only when an index is created and never changed. If the number of primary shards is ever changed in the future, all previous routing values would be invalid and documents would never be found.Coordinating node: The node got our request and forwards to the correct node for read/write.
CREATE
,INDEX
, andELETE
requests are write operations and must be successfully completed on the primary shard before they can be copied to any associated replica shards.
Replication: Wait for a successful response from replicas.
async: Success as soon as primary finishes; avoid sync.
Quorum: By default, primary shards require a quorum (shards majority) to be available before attempting to write.
Read miss: It is possible that while a document is indexed, the document is in primary but not yet copied to the replica. The replica will return that the document does not exist, while the primary will return the document successfully. In that sense, read is not consistent but is eventually consistent.
Are ElasticSearch SET/GET Reads Eventually Consistent?
ElasticSearch read consistency is eventually consistent but it can also be consistent! The real-time flag is per shard, so if we have a replicated shard that hasn't gotten that data yet, while it may still be real-time, we won't get the most recent data. At most, we would get the data on its transaction log.
realtime:true + reaplication: sync ==> read consistent for same client # => because replication true means master waits for the written data to be replicated to all replicas.
How did I get to that conclusion? See the documentation:
The default value for replication is sync. This causes the primary shard to wait for successful responses from available replica shards before returning.
In addition, it says:
By the time the client receives a successful response, the document change has been executed on the primary shard and on all replica shards. Your change is safe.
The documentation also says this:
It is possible that, while a document is being indexed, the document will already be present on the primary shard but not yet copied to the replica shards. In this case, a replica might report that the document doesn’t exist, while the primary would have returned the document successfully. Once the indexing request has returned success to the user, the document will be available on the primary and all replica shards.
It's possible for the document to be only on the master and not replicas. That makes sense if we set the document only on the master and the replica didn't get it yet. But in this case, the above section also said that the client would not get an OK response.
Now there is also the realtime
flag in the story:
The translog is also used to provide real-time CRUD. When you try to retrieve, update, or delete a document by ID, it first checks the translog for any recent changes before trying to retrieve the document from the relevant segment. This means that it always has access to the latest known version of the document, in real-time.
To the client which is waiting until data is replicated it is consistent, as the sync
flag of the consistency is returning a success result to the client only after it was replicated. Together with the realtime
flag, this ensures that even if the operation is only in the transaction log, it would be returned to the client. But if I'm client2 and did not do the write, I might be just inside the operation where it finished on the master and has not been replicated yet. In this case, it would be eventually consistent. Of course, I encourage you to tell me if you think this is not the case!
Published at DZone with permission of Tomer Ben David, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments