How to Keep Elasticsearch in Sync with Relational Databases?
Hibernate Search is a library that allows keeping your local Lucene indexes or ElasticSearch cluster in sync with your database
Join the DZone community and get the full member experience.
Join For FreeThis article was published in Java Advent Calendar on December 6, 2020
Many businesses are looking to take advantage of Elasticsearch’s powerful search capabilities using it in close relationship with existing relational databases. In this context, it’s not rare to use Elasticsearch as a caching layer. At this point, a basic and important need arises which is synchronizing Elasticsearch with the database.
Roughly, the steps below are followed for synchronization:
- A field is added that contains the update or insertion time to the table that will be kept synchronized with Elasticsearch
- A field is added that contains a boolean for marking record deletion to the table that will be kept synchronized with Elasticsearch
- Both the two fields are used in a query that is periodically executed on the table by a scheduler to request the only records that have been modified, inserted, or deleted since the last execution of the scheduler
- If there are newly added, updated, and deleted records, the business logic is invoked to perform CRUD operations both on Elasticsearch and Database(when there are records deleted)
- Scheduler runtime is stored for use in the next execution period
This pattern has some assumptions and disadvantages. Firstly, it has an estimate of how often the database is updated and runs the scheduler accordingly. The Database may be updating more frequently than assumed. In this case, users are likely to view stale data. When it’s the opposite we waste resources because one of the main purposes of using the cache layer is to reduce I/O operations on the database.
Another additional overhead for situations where you’re not returning results from the cache is that database queries should be written to exclude records marked as “deleted“.
Hibernate Search
Hibernate Search is a library that allows keeping your local Apache Lucene indexes or ElasticSearch cluster in sync with your data that extracts from Hibernate ORM based on your domain model. You can get this ability for your application by a few settings and annotations.
Base Components
Hibernate Search is based on two key components. Since these key components are directly related to the efficient use of the library, let’s take a closer look at them now.
Mapper
The mapper component maps your entities to a Lucene index and provides some APIs to perform indexing and searching. The mapper is configured both through annotations on the entities and through configuration properties which are key-value based.
Backend
The backend is the abstraction over the full-text engines. It implements generic indexing and searching interfaces for use by the mapper and delegates to the engine you chose to use in your application for instance Lucene library or a remote Elasticsearch cluster. The mapper configures the backend partly by telling which indexes must exist and what fields they must have. In addition, the backend is configured partly also through configuration properties.
For providing the following main features, the mapper and backend work together.
- Mass indexing to import data from a database
- Automatic indexing to keeping indexes in sync with a database
- Searching to query an index
Dependencies
In order to use Hibernate Search, you will need at least two direct dependencies. One of these dependencies is related to the mapper component.
xxxxxxxxxx
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-mapper-orm</artifactId>
<version>6.0.0.CR2</version>
</dependency>
The other one is related to the backend component and depends on your single or multiple node choice. For Lucene:
xxxxxxxxxx
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-lucene</artifactId>
<version>6.0.0.CR2</version>
</dependency>
The Lucene backend allows indexing of the entities in a single node and storing these indexes on the local filesystem. The indexes are accessed through direct calls to the Lucene library, without going through the network. Hence, Lucene backend is only relevant to single-node applications. So if you have a single-node application you can prefer the Lucene backend.
For Elasticsearch:
xxxxxxxxxx
<dependency>
<groupId>org.hibernate.search</groupId>
<artifactId>hibernate-search-backend-elasticsearch</artifactId>
<version>6.0.0.CR2</version>
</dependency>
The Elasticsearch backend allows indexing of the entities on multiple nodes and storing these indexes on a remote Elasticsearch cluster. These indexes are not tied to the application, therefore, accessed through calls to REST APIs.
Note that you can use both Lucene and Elasticsearch backends at the same time.
Configuration
The configuration properties of Hibernate Search can be added to any file from which Hibernate ORM takes its configuration because they are sourced from Hibernate ORM.
These files can be:
- Hibernate.properties
- Hibernate.cfg.xml
- Persistence.xml
In addition to these files, application properties files of Java runtimes such as Quarkus and Spring can also be used for configuration when you use them.
Hibernate Search provides sensible defaults for all configuration properties but there are few basic configuration parameters that you cannot avoid explicitly setting for your application in some cases.
hibernate.search.backend.directory.root This setting is about where indexes will be stored in the file system. It works when you use the Lucene backend. It will store indexes in the current working directory by default
- hibernate.search.backend.hosts This setting is about defining the Elasticsearch host URL, so it works when you use the Elasticsearch backend. By default, the backend will attempt to connect to localhost:9200
- hibernate.search.backend.protocol This setting is about defining the protocol. You use this setting explicitly when you need to use https because its default value is http
- hibernate.search.backend.username and hibernate.search.backend.password These settings are about defining the username and password for basic HTTP authentication
- hibernate.search.backend.analysis.configurer This setting is about defining a bean reference pointing to the analyzer implementation. You use this setting when you need to custom analysis
Coding Time
Let’s assume that JUG Istanbul uses a meetup app for meetings organized by itself and the data is stored in a relational database. Their domain models contain an event and host entity.
Adding a few settings to the application and a few annotations to the entities will be sufficient to take advantage of Elasticsearch’s powerful search capabilities via Hibernate Search. The entities are seen as follows.
Note: As the reader is assumed to be familiar with the basic concepts of Elasticsearch, these concepts will not be explained in detail.
xxxxxxxxxx
//(1)
public class Host
{
//(2)
private int id;
//(3)
private String firstname;
private String lastname;
analyzer = "english") //(4) (
private String title;
mappedBy = "host", cascade = CascadeType.ALL, orphanRemoval = true, fetch = FetchType.LAZY) (
//(5)
private List<Event> events;
1) @Indexed annotation registers the Host entity for indexing by the full-text search engine i.e Elasticsearch.
2) @GenericField annotation maps the id field to an index field.
3) @KeywordField annotation maps the firstname and lastname fields as a non-analyzed index field, which means that the fields are not tokenized.
4) @FullTextField annotation maps the title field as a specifically full-text search field to an index field. In addition, it defines an analyzer named “english” to gain capabilities like make matches implicitly on words (“tokens“) instead of the full string and return documents consultant while searching for consultation by tokenizing and filtering the string.
5) @IndexedEmbedded annotation includes the associated Event entities into the Host index. The main benefit of this annotation is that it can automatically re-index Host if one of its events is updated, thanks to the bidirectional relation.
xxxxxxxxxx
public class Event
{
private int id;
analyzer = "english") (
private String name;
private Host host;
}
These are our entities, so let’s look at how to perform search and CRUD operations on these entities.
xxxxxxxxxx
path = "/search/hosts", produces = "application/json") (
public List<Host> allHosts(){
SearchResult<Host> result = searchSession.search(Host.class)
.where( f -> f.matchAll())
.fetch(20);
logger.info("Hit count is {}", result.total().hitCount());
return result.hits();
}
path = "/event/add", consumes = "application/json", produces = "application/json") (
public Event addEvent( Event event){
entityManager.persist(event);
return event;
}
path = "/event/update", consumes = "application/json", produces = "application/json") (
public Event updateEvent( Event event){
entityManager.merge(event);
return event;
}
path = "/host/add", consumes = "application/json", produces = "application/json") (
public Host addHost( Host host){
entityManager.persist(host);
return host;
}
path = "/host/update", consumes = "application/json", produces = "application/json") (
public Host updateHost( Host host){
entityManager.merge(host);
return host;
}
path = "/event/delete/{id}", produces = "text/plain") (
public String deleteEventById( ("id") int id){
Event event = entityManager.find(Event.class, id);
entityManager.remove(event);
return String.join(" : ", "Removed", event.toString());
}
When you look at the addEvent, updateEvent, and deleteEvent methods, you won’t notice any difference from the standard JPA usage. The difference is seen in methods such as searchEventsByName and searchHostsByName. In these methods, indices are queried over the Hibernate Search session which is obtained from the injected entity manager by setting the “WHERE” clause.
In this GitHub repository that shows you how to use Hibernate Search with Spring and Quarkus Java runtimes, you can find other details of this example such as configuration, mass indexing, and custom analyzer usage.
Conclusion
Today, the use of full-text search engines such as Elasticsearch as a cache is widespread. In that case, it is essential to keep Elasticsearch synchronized with the database. Hibernate Search meets this requirement elegantly. It indexes your domain model with the help of a few annotations and keeps your local Apache Lucene indexes or ElasticSearch cluster in sync with your data that extracts from Hibernate ORM based on your domain model. While it provides these facilities, it does not distract the developer from the familiar syntax.
References
Published at DZone with permission of Hüseyin Akdoğan. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments