Working With Time-series Data on Redshift
Working with data ordered in time has some unique challenges that we should take into consideration when we design our data warehouse solution.
Join the DZone community and get the full member experience.
Join For FreeTime series data, a sequence of data points that are time ordered, often arise in analytics. Especially when we start working with user generated events. The interaction of a user with our product is a sequence of events where time is important. The same is also true if we consider the interactions of a recipient to one of our MailChimp campaigns.
Working with data ordered in time has some unique challenges that we should take into consideration when we design our data warehouse solution.
In this post, we will see how we can work efficiently with time series data, using Redshift as a data warehouse and data that is coming from events triggered by the interaction of users with our product.
We will see what options we have to optimize our data model and what tools Redshift has in its arsenal for optimizing the data and achieve faster query times.
Time Series data are unique for a number of reasons. Primarily, updating a row rarely happens. When we work with time series data, we are expecting to have an ever-growing table on our data warehouse.
From a data warehouse maintenance perspective, this is important. First of all, we need somehow to guarantee that we will have predictable query times regardless of the point in time which we want to analyze our data.
Second, older data might not be that relevant in the future, or we should not use it in any case as it might skew our analytic results. So, it should be easy to discard older data, something that is also related to the cost of running our infrastructure.
The second characteristic of time series data is that its structure is usually quite simple. If for example, we consider user events, the minimum required information that we need is the following;
- The event that happened
- The time that the event happened
- A user associated with the event
The simple triplet above, taken as a time series, contains enough information to help us understand the behavior of our user.
Of course, more data can be added, like custom attributes with which we’d like to enrich the time series, to perform some deeper analysis. But the overall structure of the points in the time series will be flat in most cases.
A Time Series Example With Mixpanel
Let’s see for example the structure of the events that Mixpanel has, which is a typical case of a service that tracks user events.
{
"event": "Viewed report",
"properties": {
"distinct_id": "foo",
"time": 1329263748,
"origin": "invite",
"origin_referrer": "http://mixpanel.com/projects/",
"$initial_referring_domain": "mixpanel.com",
"$referrer": "https://mixpanel.com/report/3/stream/",
"$initial_referrer": "http://mixpanel.com/",
"$referring_domain": "mixpanel.com",
"$os": "Linux",
"origin_domain": "mixpanel.com",
"tab": "stream",
"$browser": "Chrome",
"Project ID": "3",
"mp_country_code": "US"
}
}
From the above example, we see that data is structured around the event together with a set of properties that relate to it, including of course the time that the event was generated.
The rest of the properties are metadata that can help us to understand better the event and the user who triggered it.
Amazon Redshift is great; it allows you to quickly spin up a data warehouse that scales to petabytes of data. Like any other technology, to get the maximum out of it, we should model our database accordingly to the characteristics of the technology that we will use and the queries we plan to perform on our data.
For Amazon Redshift, we need to have in mind the following.
Deletions of a large amount of data is not a trivial task with Amazon Redshift. So, as we are planning to work with an ever-growing dataset of time dependent data, it will be best to have a clear strategy of how we will be deleting the data.
Deletions of a large amount of data is not a trivial task with Amazon Redshift
Amazon Redshift is a Distributed Data Warehouse, so data is spread across different nodes. It is possible to change the way data distribute. You should always strive for maximum data locality if possible.
With Amazon Redshift, you should always strive for maximum data locality if possible
Last but not least, the definition of sort keys may impact the performance of your queries. Sort keys are also used by the query optimizer to come up with optimal query plans.
Take care the definition of sort keys in your Amazon Redshift instance.
Having the above three characteristics of Amazon Redshift in mind, we can design an optimal table model for time series data that have a structure like the Mixpanel data from above.
We anticipate that our analysis will be around the generated events and thus, data from the same event should be stored in the same node if possible. For this reason, a good choice is to proceed by selecting a “key distribution” style, where data will be clustered together based on a pre-defined key, which in this case will be the event column.
In this way, it will be guaranteed that data from the same event type will be allocated together.
Regarding the Sort Keys
By definition, as we work with time-dependent data, a sort key on a timestamp column should exist. With Mixpanel data, the natural choice is to create as the leading sort key the properties_time
column. This key will guarantee that our data will always sort by the time the events generate and thus adhere better to the nature of time series data.
To optimize the tables a bit further and as we anticipate to work a lot with the events, it is a good idea to create another sort key for the events
field. But make sure that this one will follow the time
field. Time should always be the first sort key when we work with time series data.
The above regarding sort keys and distribution strategies are the bare minimum when we have to work with user event data. Depending on what kind of analysis we plan to perform, we can also add more sort keys and distribution key using the rest of the attributes that are describing our events.
For example, in an attribution analysis, we might want to add the initial_referring_domain
column as a distribution and sort key. Just keep mind that, although it is possible to change distribution strategies and sort keys in the future, is a cumbersome process and thus it is best if you define well your requirements before you start feeding your Amazon Redshift with time series data.
To give a concrete example, if we consider the rows event
and properties_time
a query to create a table that will sort based on time and event while distributing the data based on the event type, will look like the following:
CREATE TABLE events (
Properties_time timestamp sortkey
event varchar(200) sortkey distkey
) diststyle key;
About Deleting Your Old Data
If you do not have to deal with a huge number of rows, 100s of millions, you probably shouldn’t bother that much on how to delete older data. But as we are working with time series data, at some point we might need to implement a strategy that will allow us to delete older data efficiently.
A recommended strategy for this is to use time series tables.
For example, create one new table for each month and store in it the corresponding data. If your data retention is smaller than a month, you can create weekly tables. By doing so, you can delete your data by executing a DROP TABLE
command, which is much faster than any DELETE
command and scales well when you want to delete large amounts of data.
Over to You
The above simple principles are what should guide your choices in designing Amazon Redshift tables for time series data. Following them will give help you to take the most out of your Amazon Redshift cluster and give you enough mileage before you should consider a cluster upgrade.
Published at DZone with permission of George Psistakis, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments