Time-Based Indexing in Elasticsearch Using Java

Want to learn more about using time-based indexing in Elasticsearch?

Abhinav Sinha

Updated Nov. 13, 19 · Tutorial

Likes (3)

Comment

Save

26.8K Views

Anybody who uses Elasticsearch for indexing time-based data such as application logs, is accustomed to the index-per-day pattern: use an index name derived from the timestamp of the logging event rounded to the nearest day( viz. myapp_logs_index_07_11_2019, myapp_logs_index_08_11_2019 etc.) and new indices pop into existence as soon as they are required. It’s a classic use-case.

Need for Time-Based Indexing

Most traditional use cases for search engines involve a relatively static collection of documents that grow slowly. Searches look for the most relevant documents, regardless of when they were created.

With application logs, this number of documents in the index grows rapidly, often accelerating with time. Documents are almost never ( with logs, never actually) updated, and searches mostly target the most recent documents. As documents age, they lose value.

If we were to have one big index for documents of this type, we would soon run out of space. Logging events just keep on coming without pause or interruption. We could delete the old events with a scroll query and bulk delete, but this approach is very inefficient. When you delete a document, it is only marked as deleted. It won’t be physically deleted until the segment containing it is merged away.

Purging old data with time-based indexing is easy — just delete old indices.

Rollover API

Elasticsearch provides support for time-based indexing using its Rollover API. It is offered in two forms that I found particularly interesting:

REST-based APIs
Java APIs

For testing and playing around with how rollover actually works, it is imperative to use the REST endpoint since it’s so easy to set up and run. We will talk about both the ways in this blog.

Rollover API follows the Rollover pattern, which essentially works as follows:

There is one alias used for indexing that points to the active index.
Another alias points to active and inactive indices and is used for searching.
The active index can have as many shards as you have hot nodes to take advantage of the indexing resources of all your expensive hardware.
When the active index is too full or too old, it is rolled over, a new index is created, and the indexing alias switches atomically from the old index to the new.
The old index is moved to a cold node and is shrunk down to one shard, which can also be force-merged and compressed. However, this will not be not covered in this blog.

REST-Based Method

We’re going to create two aliases:logs-search for searches and logs-write for indexing.

1. First, we create a new index template with a search alias. We will now refer to the index using this alias only for searches.

PUT localhost:9200/_template/logs

{

“template”: “logs-*”,

“settings”: {

“number_of_shards”: 5,

“number_of_replicas”: 1

},

“aliases”: {

“logs-search”: {



}

}

}

2. Next, we create an index with payload as it writes the alias and rollover settings.

PUT localhost:9200/logs-000001

{

“aliases”: {

“logs-write”: {

“rollover”: {

“conditions”: {

“max_age”: “60s”,

“max_docs”: 10

}

}

}

}

}

3. We index some data using the alias — this is not the actual index name.

POST localhost:9200/logs-write/_doc/861233345

{

“user”: “kimchy”,

“post_date”: “2009-11-15T14:12:12”,

“message”: “trying out Elasticsearch”

}

You may see a response similar to this:

{

“_index”: “logs-000001”,

“_type”: “_doc”,

“_id”: “861233345”,

“_version”: 3,

“result”: “updated”,

“_shards”: {

“total”: 2,

“successful”: 1,

“failed”: 0

},

“_seq_no”: 2,

“_primary_term”: 1

}

4. You need to keep on hitting the rollover endpoint to do the rollover, given any of the three specified conditions are met. If it is such, the rollover happens and a new index is created with the name viz. logs-00002. The alias now points to this active index. The rollover API is smart enough to detect naming patterns via numbers, dates, and increments to the next value.

POST localhost:9200/logs-write/_rollover

{
“conditions”: {
“max_age”: “5s”,
“max_docs”: 5,
“max_size”: “5mb”
}
}

5. To verify that the rollover did, indeed, happen, try writing some new data to the index (again using the alias):

POST localhost:9200/logs-write/_doc/1233

You can see the index that was written to logs-000002, which is the rolled over index:

{

“_index”: “logs-000002”,

“_type”: “_doc”,

“_id”: “861233345”,

“_version”: 3,

“result”: “updated”,

“_shards”: {

“total”: 2,

“successful”: 1,

“failed”: 0

},

“_seq_no”: 2,

“_primary_term”: 1

}

6. For searches, however, you would use the search alias, which keeps on pointing to all the logs-* indexes because of the index template we defined in step one. If we were to use the logs-write alias for searching, it would only point to the rolled over index (only one), and we won’t have all the documents from the previous indexes.

GET localhost:9200/logs-search/_search

{
"took": 17,
"timed_out": false,
"_shards": {
"total": 20,
"successful": 20,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1,
"hits": [
{
"_index": "logs-000001",
"_type": "_doc",
"_id": "8611234677862",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000002",
"_type": "_doc",
"_id": "861123467jahd",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000003",
"_type": "_doc",
"_id": "861123467jahd",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000004",
"_type": "_doc",
"_id": "861123467jahd",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000001",
"_type": "_doc",
"_id": "8611234677",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
},
{
"_index": "logs-000002",
"_type": "_doc",
"_id": "8611234677",
"_score": 1,
"_source": {
"user": "kimchy",
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch"
}
}
]
}
}

As you can see, the search result contains data from different indexes(logs-00001 thru logs-00004).

7. The multiple indices fetch was possible because of the logs-search alias that points to multiple indices. To verify this, use:

alias index filter routing.index routing.search
logs-search logs-000002 - - -
logs-write logs-000002 - - -
logs-search logs-000001 - - -

Also, notice that logs-write just points to one index at a time, which is what we desire.

Rollover Java API

For the Java API, refer to the code here.

Feel free to leave a comment below if you have any questions!

Java (programming language) Elasticsearch code style

Opinions expressed by DZone contributors are their own.

Related

Trending