Smarter Data Compression With RavenDB
This is a thorough but accessible dive into the theory and rationale behind data compression algorithms and the way databases use them.
Join the DZone community and get the full member experience.
Join For FreeIf you had to guess the highest priorities of database designers, the ability to compress data would probably be high on your list. The more you can compress your data, the more of it fits on your disk - seems pretty essential. But you may have noticed that most of the popular database engines offer little or no data compression features out of the box. RavenDB is an odd duck in that regard. Why is that?
The usefulness and efficiency of compressing data are very dependent on circumstances. For one thing, data storage space isn’t usually a limiting factor for most applications. The speed of the database often is.
But compression doesn’t have to come at the cost of speed; as you might expect, it can actually increase speed as well. As data is moved around, it passes through all sorts of bottlenecks that take much more time than compression and decompression. This includes things like I/O or limited bandwidth. So it saves a lot of time if you can make your data more compact before you move it around. If you’re on the cloud, it saves both time and money. There is also virtual memory to consider, which is more scarce than storage. So compression is useful for much more than just storage space - assuming it’s used appropriately.
There are other reasons that databases often don’t invest in compression capabilities. The performance of any compression algorithm is wholly dependent on the properties of the data itself. So in many cases, the user is in a better position to decide when and how to compress the data.
For compression to make sense as a database feature, it has to be a design priority from day one. The database needs to know how to get the best performance out of its algorithm in different scenarios, given different types and amounts of data. At the same time, compression should ideally be effortless and transparent to the user and not leave them to figure out compression trade-offs on their own.
And of course, many free and open source compression technologies exist and are in common use, like JPEG or GZIP, so databases don’t necessarily want to try reinventing the wheel. On the other hand, it is much easier on the user when the database they’re already using also offers powerful compression in the same package, rather than having to mix and match different tools that may not be compatible in the future.
In this article, we will discuss how RavenDB approaches compression for these types of data:
- Collections of documents
- Document revisions
- TCP and HTTP traffic
- Large text fields
- Map-reduce indexes
- Time-series data
The Speed Paradox
Data compression can either be a force multiplier for everything your application does, or it can be worse than useless. The most basic reason is that [de]compression comes at the cost of CPU and takes some minimum amount of time.
Compression algorithms are judged by compression speed, decompression speed, and the compression ratio (calculated as the uncompressed size divided by the compressed size: a value of 1 or greater). There is a spectrum between algorithms that are high speed-low ratio versus low speed-high ratio. Each algorithm is designed to occupy some niche in that spectrum.
Both at the ends of the spectrum and during the operation of any given algorithm, there is a tendency to hit a barrier of diminishing returns. Speed can drop very fast the more you push for the best data compression ratio and vice versa.
There is also an important difference between ‘lossless’ and ‘lossy’ algorithms. Lossy compression can improve speed or ratio, but at the cost of permanently altering the data. A familiar example is the JPEG image format. RavenDB only uses lossless compression, as do most other databases that offer compression.
As mentioned above, extra storage space isn’t always worth the cost in speed. Storage is continuously getting cheaper and more plentiful - both because of the decreasing cost of a given GB of physical disk space and because of the increasing availability of cloud storage. (Using RavenDB Cloud is a great way to take advantage of this.)
For many applications today, the highest priority is the speed of the round-trip when users are communicating with the server. Every time you press a button on a website and have to wait for it to update, you’d prefer that they don't waste your time with unnecessary compression.
Of course, if your data needs to travel through a time-consuming bottleneck, compression saves time by reducing the amount of data that needs to pass through it. This includes something as simple as writing and reading data to and from the disk. I/O per second (a.k.a. IOPS) is typically extremely slow compared to other processes. By the same token, on the cloud, I/O time usually costs a lot more money than CPU time.
What is true of I/O is also true of network bandwidth. This is why RavenDB compresses all TCP and HTTP traffic by default - both between RavenDB instances and between server and client.
In contrast, if you aren’t expecting your data to go through any significant bottlenecks, compression might then become the slowest part of your data pipeline.
More Data In, Better Performance Out
An algorithm’s potential efficiency is ultimately determined by the properties of the information you feed it. All algorithms depend on exploiting repetitive patterns in your data. This is also called data redundancy and is closely related to information entropy.
Any pattern that is repeated many times can be summarized in something like the following way: you store just one copy of the repetitive pattern, and you record all the locations where the pattern is found in the uncompressed data. If a given pattern is repeated often enough, this can result in an arbitrarily large compression ratio. If two chunks of information are of equal size, they are by no means equally compressible: a megabyte on your disk could contain random noise, or it could contain only zeroes.
This rule of thumb generally holds: a large chunk of information tends to have more redundancy than any smaller piece of the same data. For example, if you look at most sentences in English (like this one), there might not be a single word that appears more than once. But as you increase your sample - even to just two or three sentences - you see a dramatic increase in redundancy.
It’s important to emphasize that this is not an iron law - compression performance does not always correlate with the amount of the data you feed it. Sometimes as you increase the amount of input, you find yourself spending more and more CPU in exchange for smaller and smaller improvements in the compression ratio. In those cases, there might be a goldilocks zone between too little and too much input. For example, a table of data objects that are similar to each other will probably be more compressible than a combination of different tables.
But in general, small chunks of data are usually less redundant, and that makes compression difficult. RavenDB’s document compression feature has a unique approach to small chunks - this is one of its biggest advantages. We discuss this further below after we introduce the Zstd algorithm.
Compression in Document Databases
Document databases like RavenDB are “NoSQL databases” - they’re different from the more common relational databases in that there is no rigid schema for data. A document is analogous to a row, and the document’s fields are analogous to columns. Documents can be grouped in collections, which are roughly analogous to tables, but different documents in the same collection don’t have to have the same structure or even any of the same fields.
The relational schema lends itself to compression because the structure of a row is redundant across a whole table. But that doesn’t mean a document database has to be less efficient. You can choose to treat RavenDB like a relational database by creating a collection in which all documents have the same structure. In that case, the redundancies are just as compressible as they would be in a schema. You can also choose to take advantage of NoSQL freedom and leave fewer redundancies to compress, or you can do anything in between.
Furthermore, document databases usually store data as JSON. This is a common data format that is used for many other applications, such as TCP requests and responses. This is convenient because the brackets, quotation marks, commas, and other text elements used in the format are highly redundant.
The “Chunk” Compression Problem
Returning to the previous topic: if we want to increase the amount of input data, we face a problem. A limitation of many algorithms is that they process data in indivisible chunks. A chunk of information is compressed all at once and must also be decompressed all at once.
The result is that if you compressed 10 kilobytes of data as one unit, but now you only want to decompress one kilobyte of that data, you're out of luck. You have to pay the cost of decompressing all 10 kB in CPU and other resources. There is no universal term for this type of compression, so I am calling it “chunk” compression here. Another way to think of it would be “isolated” compression.
The reason is that with many algorithms, there is not necessarily a one-to-one correspondence between a specific part of the compressed data and a specific part of the uncompressed data. One part of the compressed data might have a partial relationship with a bunch of different parts of the uncompressed data. Likewise, different parts of compressed data might depend on each other in order to be interpreted correctly during decompression.
Uses of Chunk Compression
Chunk compression is limited but useful. In many databases, chunk compression is all they are able to offer. An example is the popular database MySQL. MySQL has native compression, but only in chunks called “pages.” These pages can be 1, 2, 4, 8, or 16 kilobytes. This means that you can't take advantage of any redundancies that only appear on a scale larger than 16 kilobytes.
The assumption behind this feature is that there are limited situations in which you really want to decompress more than 16 kb at a time. Also, there might be diminishing returns for trying to compress more data at a time, as mentioned earlier. Therefore, MySQL’s data compression is flawed but may be adequate for many applications.
RavenDB makes good use of chunk compression in a few specific cases: field compression, map-reduce index compression, and time-series compression.
When you store a lot of text in one field, RavenDB automatically compresses it. The great thing about this is that it is not necessary to decompress a field in order to read or modify any of the other fields in the document. Even the name of a compressed field can be accessed separately from the text itself. This works seamlessly with RavenDB’s out-of-the-box full-text search and indexing capabilities.
Map-reduce indexes are a special type of index that allows you not only to query your data but to perform fast and complex aggregation queries. Map-reduce data is highly redundant, so RavenDB compresses all map-reduce indexes automatically with very high compression ratios.
Similarly, RavenDB’s time-series feature compresses data using Gorilla compression and the algorithm LZ4. Time-series data consists of a series of numerical values (entries), each associated with a time stamp. They can be highly compressible because you can choose to only store the differences between consecutive entries rather than recording each entry separately. In fact, RavenDB is typically able to record an entry in just 4 bits, and that’s before the actual compression.
But when necessary, RavenDB uses Zstd to circumvent the problems of chunk compression.
The Zstd Compression Algorithm
Zstandard, or Zstd, is an open-source algorithm developed by Facebook in 2016. Like all members of the LZ family of algorithms, it is lossless. It scores very high in both speed and compression ratio. With 22 possible ‘levels’ of compression, it occupies a wide range in the middle of the speed-ratio spectrum. Its performance is similar to, if not better, than many other algorithms in the same parts of the spectrum.
Zstd has especially high decompression speeds. Usually, decompression speed increases roughly in proportion to compression speed and decreases in proportion to compression ratio. In Zstd, decompression speed is very high and relatively constant across different compression levels. This makes Zstd especially helpful in all scenarios in which the speed of reads is the highest priority.
But by far, the most interesting aspect of Zstd compression is its dictionary training feature. All LZ algorithms use dictionaries, but they use them for chunk compression - one dictionary per chunk. Zstd revolutionizes compression by training dictionaries on sample data and then freeing that dictionary to be used flexibly to compress any data it is fed. It even allows you to decompress smaller parts of a compressed chunk.
Dictionary Compression
To understand how dictionaries work, we need to understand how compression works without a dictionary.
For example, let's compress the word "Mississippi." With a typical non-dictionary compression algorithm, it might come out something like this:
“Mis” (-1,1) (-3,4) “p” (-1,1) (-3,1)
Unique parts of the word are in quotes, and the redundant parts are encoded by the parentheses. How do we decompress this? The decompressed word starts with the first text in quotes - "Mis." Next, each set of parentheses adds a copy of some part of the existing text. The first number represents the starting position of the redundant sequence, and the second represents its length.
The first set of parentheses tells you to go one character back to “s” and only repeat that one character. Now we have “Miss.” Next, we go three characters back, to “i,” then four characters forward. Those are “i,” “s,” “s”... how do we get the fourth character? In this format, we can continue the sequence to the new “i” that we have just added. This gives us “Mississi.” Next, add “p,” and so on.
The compression ratio here isn’t great because the repeated sequences are very short, but don’t worry about that right now. The point is that in this traditional approach to compression, the raw data being decompressed is used as the repository for redundant sequences in the data that is still compressed, leading to this complex recursive pattern we’ve just analyzed.
This is why a compressed chunk is also decompressed as a chunk: there is no way to access the data "ippi" without having to decompress "Mississ" first.
How does a dictionary change things? A dictionary consists of a repository of all sequences, whether they are redundant or not. But it is not part of the compressed data per se. So, for example, if we create a dictionary for Mississippi, we can record the sequence “ssi” or “iss.” This is a little simpler than what we did before: we had to use one reference to create the sequence “iss,” and only then could the second reference point to it. With a dictionary, Mississippi can now be compressed into something like this:
Rather than references having to point to the pieces of data that precede that reference, they are free to point to any part of the dictionary. Dictionaries can be very efficient, but it is Zstd’s dictionaries that reach the full potential of dictionary compression. Zstd separates the dictionary from the compressed data entirely. That way, a single dictionary can be used to compress one kilobyte or a whole terabyte.
It also finally frees us from chunk compression because references don’t depend on other parts of a compressed chunk; they point to something external.
RavenDB’s Effective Use of Zstd
We can finally talk about RavenDB’s Documents Compression feature. RavenDB is not the only database that uses Zstd. It isn’t even the only document-oriented database that uses Zstd - MongoDB does as well. However, more than two years after they introduced Zstd in version 4.2, MongoDB still isn’t making use of Zstd’s dictionary training capabilities.
RavenDB applies compression to one collection of documents at a time. RavenDB trains Zstd on the first few documents to create a dictionary, and that dictionary is then used to compress all further data that enters the collection.
Better yet, the database monitors the compression ratio of new documents. If the ratio falls below a certain threshold, RavenDB automatically trains a new dictionary on the most recently modified documents in the collection. If this new dictionary performs better than the last one, it replaces it. This means new data entering the collection continues to be compressed with no interruption, and the compression ratio remains above the threshold.
This entire process is transparent to the user. All you need to do is choose which collections to compress, and RavenDB does the rest. Zstd compression can also be applied to all document revisions in the database. Revisions are a record of every change made to every document in a database. They provide a way to track your data's past, allowing you to restore old data or recover from an unexpected server failure.
Hopefully, this article gave you some insight into compression, both in general and in the context of database engines. And now you understand how RavenDB provides efficient compression for a wide range of scenarios, all with minimum fuss. It is one of the only major databases that have been this carefully designed with compression in mind. You can also see how a database’s choice of algorithm can be crucial: both to its own capabilities and to how the user can take advantage of it.
There are many free compression technologies available. But whenever you add something to your stack, it can easily result in unexpected compatibility issues. If you choose your database wisely, you won't have to figure out compression on your own or pay through the nose for cloud resources. RavenDB’s mission in life is to give you ease and peace of mind. You can focus on your product, knowing that the data is well taken care of at the backend.
Published at DZone with permission of Mor Hilai. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments