Implementing Time Series in MongoDB
If you're looking to use time granularities, you can set up a time series (with a workaround) while still benefitting from MongoDB's document-oriented infrastructure.
Join the DZone community and get the full member experience.
Join For FreeI recently designed a solution for MongoDB to store information that has time as the main axis of analysis. This information should be stored in a way that was easy enough to query and aggregate using many different time granularities (months, weeks, days, etc.). Information should also be stored in a way that does not consume too much disk space and was optimal in performance for MongoDB to maintain. In a word, I need to transform MongoDB into a time series database.
Time Series
Let’s start from the beginning. What is a time series? Citing Wikipedia:
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.
InfluxDB's Key Concepts page gives us an extremely easy example to understand what a time series is. Imagine you have two scientists that have to record the number of two types of insects in two different locations. Using a tabular view, we come across something like this.
As you can see, the main dimension of this datastore is time. Events stored in it span through time. What we can refer to as basic time granularity is set to minutes in our example. This means that we cannot produce an analysis that has a granularity of less than the basic (minutes, in this case).
Columns like butterflies
and honeybees
are called fields. Fields are made up of field keys and field values. Field keys (butterflies
and honeybees
) are strings and they store metadata; the field key butterflies
tells us that the field values 12
-7
refer to butterflies, and the field key honeybees
tells us that the field values 23
-22
refer to, well, honeybees.
Field values are the data; a field value in a time series database is always associated with a timestamp. The collection of field-key and field-value pairs make up a field set.
Last, but not least, columns like location
and scientist
are called tags. Also in this case, a tag is made up of a tag key and a tag value. We can look at tags like indexes, which help the access to the time series. They are not mandatory, but they help a lot.
We have a time series now. But are these databases only a bunch of tables that force the presence of a timestamp? No, the main characteristic of a time series database is that it should have powerful tools to aggregate data (fields) over time. Let’s say we need to know how much butterflies were counted by scientist “perpetua” in “location 1” during the last year. It should be easy to retrieve this information from the database.
Time Series in MongoDB
MongoDB is a general purpose document-oriented database. This means that information inside the database is stored as documents. MongoDB uses BSON format, a binary variant of JSON documents. A document in MongoDB looks like the following.
Among the types that are available for field values, we find string, numbers, timestamps, arrays, documents, and so on. How can we design a document schema such that MongoDB can manage time series?
For a number of reasons that will be analyzed later in this document, the best way to treat time series in MongoDB is using a subdocument for each level of aggregation we want to manage. Let’s convert the above example about scientists and insects to MongoDB.
Suppose that you want to maintain minutes as the maximum granularity over time. Then, imagine that you also want to give access to a partially aggregate information in hours, days, and months. The final optimal document schema you should use is the following.
{
_id: "1234567890",
// The document is indexed by month
date: ISODate("2015-08-01T00:00:00.000Z"),
// In the main document we fing tags
location: 1,
scientist: "langstroth",
days: [
// We have a subdocument in the days array for every day in the month
{
date: ISODate("2015-08-18T00:00:00.000Z"),
// In each subdocument we find field keys and values
butterflies: 23,
honeybees: 51,
hours: [
// We have a subdocument for every hour in a day
{
date: ISODate("2015-08-18T00:00:00.000Z"),
butterflies: 23,
honeybees: 51,
minutes: [
{
// And finally, we find the minutes granularity. Also here
// we have a subdocument for each minute in an hour
date: ISODate("2015-08-18T00:00:00.000Z"),
butterflies: 12,
honeybees: 23
},
// Omissis...
{
date: ISODate("2015-08-18T00:06:00.000Z"),
butterflies: 11,
honeybees: 28
},
// Omissis...
]
},
// Omissis...
]
},
// Omissis...
]
}
Such a big document, isn’t it? As you can see, the trick here is to have a subdocument level for the granularity we need in our analysis. Tags are in the main document, let’s say at level 0. Fields are partially aggregated at each level (1, 2, etc.). The aggregation over time is determined by the value of the date
property at each level. Documents are always complete. This means that we will find a subdocument for each minute/hour/day, whether the fields value are 0 or not.
Why This? Why That?
So far so good. Now, the question is: why do we use this complex document schema? What are the pro and cons?
First of all, if we model our event using an 1:1 approach with respect to the documents, we would end up with one document per event.
While this approach is valid in MongoDB, it doesn’t take advantage of the expressive nature of the document model. Moreover, to aggregate results that span through an interval, using the subdocument approach, MongoDB needs to access to very few documents.
Another good question is why we are using arrays to days/hours and minutes information, instead of using a dedicated JSON property for each element. Arrays work very well with the MongoDB Aggregation Framework. In detail, using the $unwind
operator, it is possible to flatten the internal structure of each document, turning it into an easy job the querying process also of information stored inside subdocuments.
For example, using the following aggregation pipeline, it is possible to easily retrieve the number of butterflies reported by scientist langstroth, in location 1, during the days between 2015-08-18 and 2015-08-20.
db.test.aggregate([
{$match: {location: 1, scientist: 'langstroth'}},
{$unwind: '$days'},
{$match: {'$and': [{'days.date': {'$gte' : ISODate("2015-08-18T00:00:00.000Z")}},
{'days.date': {'$lte' : ISODate("2015-08-20T00:00:00.000Z")}}]}},
{$project: {_id : 0, location: 1, scientist: 1, 'butterflies' : '$days.butterflies'}},
{$group: {_id : {location: '$location', scientist: '$scientist'}, butterflies: {'$sum': '$butterflies'}}}
])
In the example, we are using levels of subdocuments, i.e. days, hours and minutes. Clearly, all these levels are not mandatory. However, this way, we can increase update performance when updating the document. Any array is indeed smaller because it contains only a small piece of the overall information. Then, MongoDB can walk faster through the array during the update process.
Another important thing to note is that the main document must be inserted into the collection in its full form, which means with all the levels of granularity already filled. Clearly, at the beginning, all the fields values in each subdocument will be equal to zero. However, this is an important requirement to take into consideration. This way, no update will cause an existing document to grow or be moved on disk. This fact allows MongoDB to perform better on the collection.
But this requirement opens up an important issue about the management of time series using MongoDB: Who is responsible for inserting the “all zero” document for each tag set inside the collection?
Which Came First, the Chicken or the Egg?
This is the real and central issue using MongoDB to model time series. We need a procedure that inserts the documents before we can use and update them.
First attempt: We can develop a process that periodically inserts for us those documents. Nice try, dude. However, this approach is not possible for those use cases in which the domain of the tags is not known a priori. Returning to our example, imagine that your system is collecting butterfly and honeybee numbers from the scientists all over the world. It is impractical to know the name of all these scientists.
Second attempt: Try to take advantage of using the $setOnInsert
clause in an $update
(upsert) statement. From MongoDB documentation we have:
If you're using an update operation with upsert, true results in an insert of a document, then $setOnInsert assigns the specified values to the fields in the document. If the update operation does not result in an insert,
$setOnInsert
does nothing.
Bingo! We found it! We can insert the whole documents in a $setOnInsert
clause the first time we try to update the collection. Nope. Due to a bug explained in this Stackoverflow question it is not possible use the same property both in the $setOnInsert
and $update
clauses. S**t!
Three-Step Initialization
Then, do we reach a dead end, a cul-de-sac? At first glance, it may seem so. Fortunately, my colleagues and I found a workaround. We can call it three-step initialization. Just for a recap, we want to be able to initialize the main document with all the required subdocuments properly set.
Remember that MongoDB guarantees the atomicity of operations on a single document. With this fact in mind, we can operate in the following way:
- Try to update the document, properly incrementing the counters at a specified time chunk. Do not do any upsert, just an old-fashioned update operation. Remember that the execution of an update statement returns the number of documents written. If the number of documents written is greater than zero, you’re done.
- If the number of documents written by the update is zero, then it means that the relative document to update is not yet present in the collection. Try to insert the whole document for the specified tags. Put all the counters (field values) to zero. Also, the execution of an insert statement returns the number of documents written. If it returns zero or throws an exception, never mind: It means that some other process had already inserted the document for the same tags.
- Execute the same above update again.
The code should look like something similar to the following code snippet. Here, we want to add 1 to butterflies' and honeybees' field values for the date 2015-08-01T00:06:00.000Z, and tags location 1 and scientist “langstroth”.
// Firt of all, try the update
var result = db.test.update(
{date: ISODate("2015-08-01T00:00:00.000Z"), location: 1, scientist: "langstroth"},
{$inc :{
butterflies: 1,
honeybees: 1,
"days.0.butterflies": 1,
"days.0.honeybees": 1,
"days.0.hours.0.butterflies": 1,
"days.0.hours.0.honeybess": 1,
"days.0.hours.0.minutes.6.butterflies": 1,
"days.0.hours.0.minutes.6.honeybess": 1,
}
},
{upsert: false}
);
// If the update do not succeed, then try to insert the document
if (result.nModified === 0) {
try {
db.test.insert(/* Put here the whole document */);
} catch (err) {
console.log(err);
}
// Here we are sure that the document exists.
// Retry to execute the update statement
db.test.update(/* Same update as above */);
}
Clearly, what makes the above procedure working is the guarantee of atomicity on document modification. I know, the procedure is a little bit creepy, but we did not find anything better at the moment. Do you have any better idea? If so, try to explain it in the comment section. Thanks!
Published at DZone with permission of Riccardo Cardin, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments