Postgres Full-Text Search
Let's take a look at Postgres full-text search function and see how to build a query document, create a tsquery, and matching.
Join the DZone community and get the full member experience.
Join For FreeSearch is one of the most important features while building an application. It leads to an increase in the popularity of various search engines like Elastic Search, Solr, etc. However, all of them require a different server to be set up and data sync processes to be implemented.
Would it not be better to have a solution that could minimize overheads and allow us to focus on the main objective, which is Search? Yes, and that's where the Postgres Full-Text Search comes into the picture. It searches for the data right where it is stored in your tables. There is no need to set up a different server.
Postgres FTS (Full-Text Search) supports the following features:
1) Fuzzy Match
2) Ranking
3) Phrase Search
4) Support for multiple languages
Querying Document
Data residing in Postgres tables can be queried using three steps:
- Build Query Document i.e create a tsvector from the description of a company.
- Create a tsquery from your search keywords.
- Get the matching results using match operator @@.
Let's delve deep into each of the above to understand more in detail.
Building Query Document
Our description of the department is in text format. We need to convert it to tsvector, which is basically a list of unique lexemes along with their position in the description. Lexemes are a set of tokens that represent a set of different variants of the same work to look alike. For example:
"In the English language, run, runs, ran, and running are forms of the same lexeme, conventionally written as run."
The to_tsvector function is used to convert plain text description into a list of lexemes.
select * from
to_tsvector('english', 'This department delivers the actual product to customers');
to_tsvector
----------------
'actual':5 'custom':8 'deliv':3 'depart':2 'product':6
In the above example, we parsed a simple text to create a list of tokens. The output contains the unique lexemes along with their position in which they appear in the above text.
Creating a tsquery
We now need to create a tsquery from our search keywords. For that, we will use a Postgres function to_tsquery to get a list of keywords. Let's see an example:
select * from to_tsquery('english', 'product | customer');
to_tsquery
----------------
'product' | 'custom'
In the above example, we passed two arguments to to_tsquery function.
1) arg1: It specifies the name of the dictionary that we want to Postgres to use to understand our text
2) arg2: It is text of keywords separated by boolean operators.
The output of tsquery contains the list of tokens and their logical relationship.
Postgres supports various logical operators like OR(|), AND(&) , NOT(!) and parentheses as well.
Matching Document
Now that we know how to create a tsvector and tsquery, we will see how to search for a word in a text. For this, we use @@ operator to perform a match between a tsvector and tsquery.
select to_tsvector('english', 'This department delivers the actual product to customers')
@@ to_tsquery('english', 'product | customer');
Output
----------
true
select to_tsvector('english', 'This department delivers the actual product to customers')
@@ to_tsquery('english', 'project');
Output
---------
false
As you can see, if there is a match between the text description and keywords, it returns true else false.
Ranking Results
The ts_rank function is used to get the relevancy score of the match. In simple words, the higher the rank the more relevant the match is. Let's see with an example:
1)
select ts_rank(
to_tsvector('english', 'This department delivers the actual product to customers') ,
to_tsquery('english', 'product')
);
ts_rank
-----------
0.0607927106
2)
select ts_rank(
to_tsvector('english', 'This department delivers the actual product to customers') ,
to_tsquery('english', 'product & customers')
);
ts_rank
-----------
0.0985008553
3)
select ts_rank(
to_tsvector('english', 'This department delivers the actual product to customers') ,
to_tsquery('english', 'project')
);
ts_rank
-----------
0
We have mentioned three different SQL queries in the above snippet. Relevant scores across three cases differ depending on the number of words matched with the text description. In case there is no match, rank is calculated as 0.
The value of rank varies from 0 to 1. Higher rank indicates a stronger match.
Phrase Search
In Postgres 9.6, the capability to do a phrase search is also introduced. A special operator <-> can be used to specify the distance between two words in the text. Prior to this version, it was not possible, and Postgres did not consider the position of words during a match.
1)
select
to_tsvector('english', 'This department delivers the actual product to customers')
@@
to_tsquery('english', 'product & customer');
Output
--------
true
2)
select
to_tsvector('english', 'This department delivers the actual product to customers')
@@
to_tsquery('english', 'product <-> customer');
Output
---------
false
In case 1, the match operator will simply check the existence of both words in the text irrespective of the position.
In case 2, the match operator will check the existence of both words in the text and also considers that both these words should appear adjacent to each other.
Optimising Search Results
1) Performance of the Postgres full-text search can be improved by pre-generating the tsvector. It will save a considerable time during match and will result in faster performance. Secondly, a GIN index can be created on the tsvector column to make search more efficient.
2) Quality of search results can be ensured by using the same dictionary for creating tsquery and tsvector. A list of all available dictionaries in Postgres can be obtained by the following command:
select cfgname from pg_catalog.pg_ts_config;
Output
------------
simple
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
turkish
If a different dictionary is used to create tsvector and tsquery, then it may lead to some discrepancies in search results.
Opinions expressed by DZone contributors are their own.
Comments