What to Do When Data Goes out of Sync?
We've all said this at some point. So, how do we fix it? Read this article to find out.
Join the DZone community and get the full member experience.
Join For FreeOh! The data is out of sync.
I am sure many of us have heard this multiple times when we built systems to either support scale or give a better experience to the user.
We all have seen situations where we want the contents of the database in some other systems, for ex: in a Hadoop cluster for analytics, in Elasticsearch for better search, in cache systems so that the applications are nice and fast.
If we were to do this for a system where the data does not change, it would have been very easy. We could have taken a snapshot of the database and loaded the data in another system.
However, reality has a different story to tell. By the time we are done with loading the snapshot, the data is already stale, which is not really great in today's world.
So, what do we do in case we need real-time data in other systems?
I guess we all end up asking our applications to write to multiple systems. What this means is that every time the application writes to the database, it updates the cache for faster retrievals, reindexes search systems, and sends the data for analytics.
Is there any problem with the current approach? Probably not, until we do not hear that the cache is out of sync or has stale data, the changes they made did not reflect in the analytics because the sync job failed or has not pushed the data. Over a period of time, this approach starts seeing race conditions and reliability issues, and what we end up with is a data drift across multiple systems, a big team of engineers rebuilding caches, making sure data is available across all systems, and tons of monitoring infrastructure.
Now, let's try to see this from a different angle. Let's consider a write to the database as a stream. Every time a database change happens, it is a new message in the stream. If we apply the messages to a system in a similar order, we would end up with an exact copy of the same data in another system. This is typically how database replications work.
This approach to building systems is called Change Data Capture. It is already being used by companies like Yelp, Facebook, LinkedIn, etc.
I am very excited about this, as it allows us to unlock the value of data we already have, and we can feed the data into a central hub where the data can be enriched with event streams and data from other databases in real time. This makes it much easier to experiment with minimal data corruption.
I will write another post on how to implement it.
Opinions expressed by DZone contributors are their own.
Comments