Data Masking: Static vs Dynamic
In this article, we'll focus on the mechanics of data masking and gloss over a massive issue: data classification -- knowing who can access what data.
Join the DZone community and get the full member experience.
Join For FreeThe problem of data masking comes up surprisingly often in the world of IT. Any time you need to share some potentially sensitive data, you may need to hide, obfuscate, randomize, or otherwise dissimulate some of that data -- we'll call that the secret data.
In this article, we'll focus on the mechanics of data masking and gloss over a massive issue: data classification -- knowing who can access what data. Data classification is a whole different problem, especially in organizations with huge amounts of sensitive data. I'll refer you to a different article that touches on this topic. For the rest of this article, we'll assume that this problem has been solved and that we know who can access what data. The question is -- how do we hide the secret data?
Data masking is not just for databases -- it can be applied to documents, spreadsheets, and so on, but here we'll focus on databases.
There are many ways to do data masking, but in general, they can be divided into two categories, each one with its own upsides and downsides.
What is Static Masking
Static masking is the simplest solution. Given a database that contains some secret data, you copy that database and edit the copy to mask whatever data needs to be masked. You can then provide the copy to the client, and they can do whatever they want with it.
Of course, this may not be a trivial process for a large data set. Imagine a relational database with thousands of tables and billions of rows (or more). But there are some (expensive) tools that will help you with that task.
Advantages
It should be obvious that static masking is a very clean concept. It's the same idea as taking a pair of scissors and cutting out parts of a document. The secret data is not present, or at least not readable, in the copy, so there is no risk of leakage. The final user simply does not have the secret data.
For simple databases, you may not even need any tools: a few simple SQL scripts (or whatever language your database uses) might be enough.
Because the secret data is not present, you can give a physical copy of the masked database to the client and let them run it on their own machines.
Disadvantages
The duplication of the data can be a problem. It requires more storage and one more copy of the database floating around. This is not usually a problem if, for instance, you are releasing a database to the public; therefore, there will be only one version of the masked database.
But if different clients have different requirements, you may need to make many copies of the database, each one with a potentially different set of rules about which data is masked. And, of course, if you have different rules for different clients, you now have to worry about each client getting access only to their own custom version of the data set and not anyone else's. It can get challenging to track all that.
Another problem is that the copies are snapshots of the database and may need to be updated at regular intervals. Each time you do this is an opportunity for a mistake.
Finally, we live in the era of big data. Some data sets are truly enormous, and making and distributing a copy of such data sets can be a daunting proposition.
What is Dynamic Masking
Dynamic masking takes a different approach. Instead of making a copy of the data and changing the copy, the data is modified on the fly, as it is accessed, before it reaches the user, thereby providing each user of the same database with a potentially different view of the data. Note that this does not affect the database -- it only affects how the user sees the data.
This assumes that you control the database and that the client is accessing it through some sort of network. If the users controlled the database, they could easily bypass the masking.
Generally speaking, dynamic masking can be done either by the database itself or by a layer between the database server and the database client.
For instance, Microsoft SQL Server offers some dynamic data masking capabilities, which may be sufficient for many scenarios. PostgreSQL has the Anonymizer extension. Data masking in SQL Server it's a powerful feature, but it does have some limitations.
Some third-party solutions provide data masking outside of the database, but they typically rely on special drivers or special clients. A more generalized approach is based on proxy filtering, which relies on deep packet inspection and modification to mask data before it reaches the client.
Advantages
The biggest advantage of dynamic masking is that, in theory, it allows you to use just one database for everyone. This avoids most of the issues we identified earlier with static masking.
Dynamic data masking also means that you can update the data masking rules, typically on the fly, and restrict or broaden access to certain data for certain clients at any time. And masking can depend on more than just who the user is: it can also depend on their IP address, the time of day, or what DEFCON level we're at -- you get the picture.
Clients get access to new and updated data immediately, so data currency problems disappear.
Dynamic data masking implies that you are controlling the database. You can (and probably should) monitor what the clients are doing. This is critical for forensic analysis if there is a problem later on (think Cambridge Analytica). In some environments, it may even be possible to enforce data confidentiality contractually, as long as you keep a close eye on how the clients are using the database.
Disadvantages
Dynamic masking is potentially less secure since users are, in fact connecting to a database that contains the secret data. It turns out to be non-trivial to mask data reliably if the client accesses it using a sophisticated query language such as SQL. For instance, Microsoft specifically warns about this issue in their SQL Server data masking documentation. This can be managed by using query control if that's an option.
Dynamic masking can also be a more complex solution, with more moving parts. The more complex the solution, the more likely it is that something will go wrong.
Conclusion
As is so often the case, there is no perfect solution: there is only a series of trade-offs that need to be weighed against the requirements.
If your data set is of a manageable size (and that is very much a relative concept here), it may be practical for you to make a copy of your database and do the masking on the copy. If you're OK with the disadvantages, we have outlined, that's a great way to do it. Simple solutions are often the most secure.
But if it's impractical or undesirable to duplicate the data set, especially if you have multiple clients with multiple masking requirements, then dynamic masking may be your only realistic option. In that case, you'll have to consider whether the database can satisfy your requirements or whether a third-party solution is required. Even if you end up using the data masking capabilities provided by your database, you may still benefit from using a third-party tool to manage permissions and data classifications.
Opinions expressed by DZone contributors are their own.
Comments