An Introduction to Elasticsearch
How to start querying data and documents with Elasticsearch with a few detailed examples.
Join the DZone community and get the full member experience.
Join For FreeElasticSearch is an open source, RESTful search engine built on top of Apache Lucene and released under the Apache license. It is Java-based, and can search and index document files in diverse formats.
ElasticSearch has been compared to Apache Solr and offers several notable features:
- Provides a scalable search solution.
- Performs near-real-time searches.
- Provides support for multi-tenancy.
- An index can be easily recovered in a case of a server crash.
- Uses Javascript Object Notation (JSON) and Java application program interfaces (APIs).
- Automatically indexes JSON documents.
- Each index can have its own settings.
- Searches can be done with Lucene-based querystrings.
Indices and Types
Every time you store data in Elasticsearch it gets saved inside an index which has a type. compared to MongoDB an index is similar to a database, and a type similar to a collection. Compared to SQL an index would be like a database, and a type like a table.
Convention:
localhost:9200/{index}/{type}/
Important note: different types living in the same index cannot have the same field name with a different config or field type
For example the following two documents can't co-exist since they share the same index, and both have a city attribute of different types, string and object, respectively:
localhost:9200/test/users/1
{
"city": "cityID123"
}
localhost:9200/test/city/1
{
"city": {
"name": "Toronto"
}
}
When developing with elasticsearch there are 3 main steps we have to consider. Mapping, Indexing, and Searching data.
1. Mapping
Mapping is used to define how elastic should store and index a particular document and it's fields.
However if no mapping was introduced to a specific field on pre-index time, elastic will dynamically add a generic type to that field. Although this may sound tempting, it is NOT! since generic types are very basic and do not meet the query expectations most of the time.
Moving forward with this tutorial we will base our examples on the following data schema:
{
"first_name": "bam",
"last_name": "margera",
"gender": "male",
"age": 36
}
So to make things more efficient we're gonna create the index, type and mapping for the schema in one request. Something that looks like the following:
PUT localhost:9200/test/
{
"mappings": {
"users": {
"properties": {
"age": {
"type": "long"
},
"first_name": {
"type": "string"
},
"gender": {
"type": "string"
},
"level": {
"type": "string"
},
"last_name": {
"type": "string"
}
}
}
}
}
So creating an Index called test, a type called users with 5 fields that it contains.
Note that field types can have the following values: string, date, long, double, boolean, ip, object, nested, geo_point, and geo_shape.
If everything goes well, we should get the following response:
{
"acknowledged": true
}
Now that we told Elasticsearch what kind of data we want to insert, let's go ahead and index or store it.
2. Indexing
Indexing, or storing, is the process of inserting data into Elasticsearch to make it searchable using the Index API.
So let's index 3 simple documents:
POST localhost:9200/test/users/
{
"first_name": "Bam",
"last_name": "Margera",
"gender": "male",
"level": "super awesome",
"age": 36
}
POST localhost:9200/test/users/
{
"first_name": "Stephanie",
"last_name": "Hodge",
"gender": "female",
"level": "awesome",
"age": 34
}
POST localhost:9200/test/users/
{
"first_name": "Johnny",
"last_name": "Knoxville",
"gender": "male",
"level": "awesome",
"age": 45
}
On success of any of the following docs, we should see a response like this:
{
"_index": "test",
"_type": "users",
"_id": "AVRQDOka0YBBUjDwpzQQ",
"_version": 1,
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": true
}
Where __id_ is a generated id by Elasticsearch that is a 20 character long, URL-safe, Base64-encoded GUID string.
We can also specify our own id after the type like this:
POST localhost:9200/test/users/MyID123
{
"first_name": "Bam",
"last_name": "Margera",
"gender": "male",
"level": "super awesome",
"age": 36
}
Now that we have our data indexed, let's move forward to query it.
3. Searching
In this section we will cover Elasticsearch Queries, Filters, and Aggregations for search
To search in a specific index and type, the following convention is used:
POST localhost:9200/test/users/_search
So now by hitting this request, the response will look like:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "users",
"_id": "AVRQQlCE0YBBUjDwpzQZ",
"_score": 1,
"_source": {
"first_name": "Bam",
"last_name": "Margera",
"gender": "male",
"level": "super awesome",
"age": 36
}
},
**... the other 2 docs go here**
]
}
}
By looking at this response we can see that the data that we inserted is found inside the hits.hits array included inside the __source_ object, and since we didn't actually specify anything to search for we'll get a __score_ of 1 for all docs.
On the top level hits.total is the total number of the docs using an empty search query, and max_score is the maximum score a document can take in a specific query. In our case it's one, since no query was specified.
In __shards.total_ the value is the number of Lucene indexes that Elasticsearch created for that index. The default number is always 5 unless we specify otherwise on index creation time. More details about shards are explained here.
a. Queries
Queries is what we use to get results with scoring (relevance)
To ask a question like
- level = "super awesome"
Using the match query for full-text that is used on analyzed fields, we would write:
POST localhost:9200/test/users/_search
{
"query": {
"match": {
"level": "super awesome"
}
}
}
The response will be:
{
"took": 19,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.2712221,
"hits": [
{
"_index": "test",
"_type": "users",
"_id": "AVRQQlCE0YBBUjDwpzQZ",
"_score": 0.2712221,
"_source": {
"first_name": "Bam",
"level": "super awesome",
...
}
},
{
"_index": "test",
"_type": "users",
"_id": "AVRQRtYW0YBBUjDwpzQa",
"_score": 0.09848769,
"_source": {
"first_name": "Stephanie",
"level": "awesome",
...
}
},
{
"_index": "test",
"_type": "users",
"_id": "AVRQRx-E0YBBUjDwpzQf",
"_score": 0.09848769,
"_source": {
"first_name": "Johnny",
"level": "awesome",
...
}
}
]
}
}
As we can see, the user Bam scored the highest of 0.2712221 since his level was "super awesome ", whereas Stephanie and Johnny scored an equal 0.09848769, so their level was just "awesome"
Whereas for exact values on non-analyzed fields, numbers, dates, and Booleans, it's better to use the Term Query :
- age = 36
POST localhost:9200/test/users/_search
{
"query": {
"term": {
"age": 36
}
}
}
This query will return only Bam.
To combine more than one query together we can use the Query clause to find:
- level = "super awesome" AND "age" < 40
POST localhost:9200/test/users/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "super awesome"
}
},
{
"range": {
"age": {
"lt": 40
}
}
}
]
}
}
}
Where must is and array that implies AND. bool also supports should implying OR, and must_not.
Moreover we used the range query with age "lt" less than 40, where range also supports "lte", "gt", and "gte".
b. Filters
Filters are non-scoring queries that can be used if the score has no importance. It's returns a boolean that answers with "yes" or "no" where the score is always = 1.
Executing the following filter has no significance on the score, but will return only 2 docs:
POST localhost:9200/test/users/_search
{
"filter": {
"match": {
"gender": "male"
}
}
}
Whereas combining this with a previous query:
- level = "super awesome" AND only return gender = "male"
POST localhost:9200/test/users/_search
{
"query": {
"match": {
"level": "super awesome"
}
},
"filter": {
"match": {
"gender": "male"
}
}
}
This will return only 2 users, Bam and Johnny, scoring 0.2712221 and 0.09848769 respectively, where Bam has a more relevant level than Johnny.
Although this works fine, it is bad for performance since it will execute the query first then apply the filter returned results.
To force Elasticsearch to apply the filter before in order to limit the number of docs and then apply the query, we should wrap everything in a bool clause then add the filter next to must:
POST localhost:9200/test/users/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "super awesome"
}
}
],
"filter": {
"match": {
"gender": "male"
}
}
}
}
}
More More More...
- level = "super awesome", and age < 40 but only return gender = "male"
We would write:
POST localhost:9200/test/users/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "super awesome"
}
},
{
"range": {
"age": {
"lt": 40
}
}
}
],
"filter": {
"match": {
"gender": "male"
}
}
}
}
}
This will return only 1 user Bam scoring 1.0253175.
Important note: We can also combine more than 1 filter using the bool.
So as Elasticsearch states: "As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else."
c. Aggregations
Aggregations is a big part of elasticseach it is used to calculate stats about our data. Divided into 3 different types:
- Metrics Aggregations
- Bucket Aggregations
- Pipeline AggregationsIn this tutorial I'm gonna cover the Term Aggregations which is a part of the Bucket Aggregations.
- How many females and males do we have in our Index/type ?
We can write the following:
POST localhost:9200/test/users/_search
{
"size": 0,
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
}
}
}
We set "size" = 0 since we don't want to see any search results. Just the aggs results. "aggs" is a predefined Elasticsearch property, followed by "genders", which is a property that we can freely name. We can call it "xyz" if we want. "terms" implies that that we are performing a term aggregation which specifies the field name that we want to agg > genders.
Response:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"genders": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "male",
"doc_count": 2
},
{
"key": "female",
"doc_count": 1
}
]
}
}
}
What we want is everything inside the bucket array, which tells us that we have 1 female and 2 males.
The power of aggs is that it can be combined with any filter/query.
So using the last filter we created, we can simply say:
{
"query": {
"bool": {
"must": [
{
"match": {
"level": "super awesome"
}
},
{
"range": {
"age": {
"lt": 40
}
}
}
],
"filter": {
"match": {
"gender": "male"
}
}
}
},
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
}
}
}
This will return Bam with a male count = 1 : )
TADAH!
Published at DZone with permission of Hasan Rahhal. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments