Postgres Full-Text Search

Let's take a look at Postgres full-text search function and see how to build a query document, create a tsquery, and matching.

Sagar Sharma

Updated Aug. 01, 18 · Tutorial

Likes (5)

Comment

Save

19.3K Views

Search is one of the most important features while building an application. It leads to an increase in the popularity of various search engines like Elastic Search, Solr, etc. However, all of them require a different server to be set up and data sync processes to be implemented.

Would it not be better to have a solution that could minimize overheads and allow us to focus on the main objective, which is Search? Yes, and that's where the Postgres Full-Text Search comes into the picture. It searches for the data right where it is stored in your tables. There is no need to set up a different server.

Postgres FTS (Full-Text Search) supports the following features:

1) Fuzzy Match

2) Ranking

3) Phrase Search

4) Support for multiple languages

Querying Document

Data residing in Postgres tables can be queried using three steps:

Build Query Document i.e create a tsvector from the description of a company.
Create a tsquery from your search keywords.
Get the matching results using match operator @@.

Let's delve deep into each of the above to understand more in detail.

Building Query Document

Our description of the department is in text format. We need to convert it to tsvector, which is basically a list of unique lexemes along with their position in the description. Lexemes are a set of tokens that represent a set of different variants of the same work to look alike. For example:

"In the English language, run, runs, ran, and running are forms of the same lexeme, conventionally written as run."

The to_tsvector function is used to convert plain text description into a list of lexemes.

select * from 
to_tsvector('english', 'This department delivers the actual product to customers');

to_tsvector
----------------
'actual':5 'custom':8 'deliv':3 'depart':2 'product':6

In the above example, we parsed a simple text to create a list of tokens. The output contains the unique lexemes along with their position in which they appear in the above text.

Creating a tsquery

We now need to create a tsquery from our search keywords. For that, we will use a Postgres function to_tsquery to get a list of keywords. Let's see an example:

select * from to_tsquery('english', 'product | customer');

to_tsquery
----------------
'product' | 'custom'

In the above example, we passed two arguments to to_tsquery function.

1) arg1: It specifies the name of the dictionary that we want to Postgres to use to understand our text
2) arg2: It is text of keywords separated by boolean operators.

The output of tsquery contains the list of tokens and their logical relationship.

Postgres supports various logical operators like OR(|), AND(&) , NOT(!) and parentheses as well.

Matching Document

Now that we know how to create a tsvector and tsquery, we will see how to search for a word in a text. For this, we use @@ operator to perform a match between a tsvector and tsquery.

select to_tsvector('english', 'This department delivers the actual product to customers') 
@@ to_tsquery('english', 'product | customer');

Output
----------
true

select to_tsvector('english', 'This department delivers the actual product to customers') 
@@ to_tsquery('english', 'project');

Output
---------
false

As you can see, if there is a match between the text description and keywords, it returns true else false.

Ranking Results

The ts_rank function is used to get the relevancy score of the match. In simple words, the higher the rank the more relevant the match is. Let's see with an example:

1) 
select ts_rank(
  to_tsvector('english', 'This department delivers the actual product to customers') , 
  to_tsquery('english', 'product')
);

ts_rank
-----------
0.0607927106

2)
select ts_rank(
  to_tsvector('english', 'This department delivers the actual product to customers') , 
  to_tsquery('english', 'product & customers')
);

ts_rank
-----------
0.0985008553

3)
select ts_rank(
  to_tsvector('english', 'This department delivers the actual product to customers') , 
  to_tsquery('english', 'project')
);

ts_rank
-----------
0

We have mentioned three different SQL queries in the above snippet. Relevant scores across three cases differ depending on the number of words matched with the text description. In case there is no match, rank is calculated as 0.

The value of rank varies from 0 to 1. Higher rank indicates a stronger match.

Phrase Search

In Postgres 9.6, the capability to do a phrase search is also introduced. A special operator <-> can be used to specify the distance between two words in the text. Prior to this version, it was not possible, and Postgres did not consider the position of words during a match.

1)
select 
to_tsvector('english', 'This department delivers the actual product to customers') 
  @@ 
to_tsquery('english', 'product & customer');

Output
--------
true

2)
select 
to_tsvector('english', 'This department delivers the actual product to customers') 
  @@ 
to_tsquery('english', 'product <-> customer');

Output
---------
false

In case 1, the match operator will simply check the existence of both words in the text irrespective of the position.

In case 2, the match operator will check the existence of both words in the text and also considers that both these words should appear adjacent to each other.

Optimising Search Results

1) Performance of the Postgres full-text search can be improved by pre-generating the tsvector. It will save a considerable time during match and will result in faster performance. Secondly, a GIN index can be created on the tsvector column to make search more efficient.

2) Quality of search results can be ensured by using the same dictionary for creating tsquery and tsvector. A list of all available dictionaries in Postgres can be obtained by the following command:

select cfgname from pg_catalog.pg_ts_config;

Output
------------
simple
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
turkish

If a different dictionary is used to create tsvector and tsquery, then it may lead to some discrepancies in search results.

Database PostgreSQL

Opinions expressed by DZone contributors are their own.

Related

Trending