Just Use PostgreSQL, a Quick-Start Guide: Exploring Essential and Extended Capabilities of the Most Beloved Database
Tour some of PostgreSQL's most notable capabilities and use cases, including various extensions for generative AI.
Join the DZone community and get the full member experience.
Join For FreeEditor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.
PostgreSQL has been in development and use for over 35 years. Throughout those years, the project that started as an open-source relational database for transactional workloads has turned into one of the most reliable and comprehensive database solutions for a variety of use cases. The depth and breadth of PostgreSQL's capabilities have become so rich that you might hear "just use Postgres" if you ask for advice about database options for your next application.
What can explain PostgreSQL's meteoric rise in terms of popularity? Why have over 90,000 developers ranked PostgreSQL as the #1 database on StackOverflow? Why has DB-Engines recognized PostgreSQL as the DBMS of 2023? In short, it's reliable and enterprise-ready, it's expandable by design, and it's a true open-source database that is being developed and stewarded by the community.
So let's take a quick tour through some of PostgreSQL's notable capabilities to see what the database offers to application developers.
Starting PostgreSQL in Minutes
For those getting started with PostgreSQL, let's first see how to start the database on your laptop and generate a sample dataset within a few minutes.
Launching PostgreSQL in Docker
The fastest way to get started is by launching a database container in Docker:
mkdir ~/postgresql_data/
docker run --name postgresql \
-e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password \
-p 5432:5432 \
-v ~/postgresql_data/:/var/lib/postgresql/data -d ankane/pgvector:latest
This command starts the postgresql
container using the postgresql_data
directory as a volume for the database's data, logs, and configuration. Once started, the database will listen for incoming connections on port 5432
. The container uses the latest version of PostgreSQL with the pgvector extension (ankane/pgvector:latest
), which is used at the end of this guide. Note that you can always replace ankane/pgvector:latest
with postgres:latest
if pgvector is not necessary.
Once started, you can connect to the database with the psql tool that is shipped with PostgreSQL:
docker container exec -it postgresql psql -U postgres
The psql prompt should welcome you as follows:
Cpsql (15.4 (Debian 15.4-2.pgdg120+1))
Type "help" for help.
postgres=#
Generating Mock Data
It's not a problem if you don't have a dataset handy for your first experiments with PostgreSQL. The database comes with built-in capabilities that let you generate mock data of various complexity.
Imagine that we're working with 100 temperature sensors deployed across 10 cities in the United States. Let's see how we can generate a sample dataset without leaving the boundaries of our previously opened psql session.
First, create the sensor
table with a few essential columns:
CREATE TABLE sensor (
id int PRIMARY KEY,
name text,
city text
);
And then use the generate_series
function of PostgreSQL to generate records for 100 sensors, placing them randomly across 10 cities in the United States:
INSERT INTO sensor (id, name, city)
SELECT
gs.id, -- Setting the sensor ID
'sensor_' || gs.id, -- Generating a unique name for the sensor
(ARRAY[
'New York',
'Los Angeles',
'Chicago',
'Miami',
'Boston',
'Philadelphia',
'Seattle',
'San Francisco',
'Dallas',
'Atlanta'
])[floor(random() * 10) + 1] -- Selecting a random city for the sensor
FROM generate_series(1, 100) AS gs(id); -- Generating IDs for 100 sensors.
Lastly, go ahead and take a look at a subset of the generated data:
select * from sensor order by id limit 5;
The output should be as follows:
id | name | city
----+-----------+---------------
1 | sensor_1 | New York
2 | sensor_2 | Philadelphia
3 | sensor_3 | Dallas
4 | sensor_4 | Boston
5 | sensor_5 | New York
(5 rows)
Exploring the Depth and Breadth of PostgreSQL's Capabilities
By definition, PostgreSQL supports all the capabilities you expect from a relational SQL database, including standard DML and DDL statements, ACID transactions, foreign keys, indexes, materialized views, and more. At the same time, the core PostgreSQL capabilities go far beyond what developers usually expect from a relational database. Let's take a look at a few of such capabilities in action.
Modern SQL
Modern SQL is a category of contemporary SQL capabilities that allow you to solve various tasks in SQL in a concise and efficient manner. Marcus Winand might have coined the category after launching a dedicated resource on the topic. PostgreSQL supports several modern SQL capabilities, including common table expressions (CTEs), recursive queries, and window functions. Let's take a look at CTEs and window functions in action.
Assume that our sensors continuously send information about the current temperature, and we'd like to monitor the highest temperature reported by every sensor.
First, let's create the sensor_measurement
table to keep track of the reported temperatures:
CREATE TABLE sensor_measurement (
id SERIAL PRIMARY KEY,
temperature NUMERIC,
sensor_id INT,
time TIMESTAMP,
FOREIGN KEY (sensor_id) REFERENCES sensor(id)
);
Next, use the generate_series
function one more time to generate 1000 measurements (10 measurements for each sensor from the sensor
table):
INSERT INTO sensor_measurement (temperature, sensor_id, time)
SELECT
round((random() * 100)::numeric, 2) AS temperature, -- Generates a random temperature between 0 and 100 degrees Fahrenheit
s.id AS sensor_id,
NOW() - (INTERVAL '1 day' * round((random() * 30)::numeric)) AS time -- Random timestamp within the last 30 days
FROM
sensor s,
generate_series(1, 10) AS gs; -- Inserts 10 measurements for each sensor
Finally, let's find the highest temperature report by each sensor:
WITH RankedTemperatures AS (
SELECT sensor_id, temperature,
RANK() OVER (PARTITION BY sensor_id ORDER BY temperature DESC) AS temperature_rank
FROM sensor_measurement
)
SELECT sensor_id, temperature
FROM RankedTemperatures
WHERE temperature_rank = 1
ORDER BY sensor_id;
The output should be as follows:
sensor_id | temperature
-----------+-------------
1 | 59.87
2 | 85.76
3 | 94.99
4 | 90.09
5 | 99.40
...truncated
The query uses the RANK()
window function to slice and rank the data by a sensor ID (PARTITION BY sensor_id
). The ranking is calculated as part of the WITH RankedTemperatures AS
common-table expression. Then, the query returns the result for temperature_rank = 1
, which stores the highest temperature reported by a sensor.
JSON
Even though PostgreSQL is a well-known SQL database, it has supported JSON as a first-class citizen for over 12 years. PostgreSQL comes with specialized data types, operators, and index methods that make it seamless to use the database for workloads typical of document databases.
Continuing our example with the sensors, assume that we decided to store each sensor's technical specification as a JSON object in the sensor
table and query that data directly using PostgreSQL's JSON capabilities.
First, alter the sensor
table by adding the spec
column of the JSONB
type:
ALTER TABLE sensor ADD COLUMN spec JSONB;
Next, generate the technical specification for each sensor with the jsonb_build_object
function:
UPDATE sensor
SET spec = jsonb_build_object(
'size', jsonb_build_object(
'width', (random() * 10 + 10)::int, -- Width between 10 and 20 inches
'height', (random() * 10 + 10)::int, -- Height between 10 and 20 inches
'depth', (random() * 10 + 10)::int -- Depth between 10 and 20 inches
),
'weight', (random() * 10 + 1)::numeric(4,2), -- Weight between 1 and 11 pounds
'max_temperature_range', jsonb_build_object(
'min', (random() * 10)::int, -- Min temperature between 0 and 10 F
'max', (random() * 100 + 100)::int -- Max temperature between 100 and 200 F
)
);
Lastly, query the JSON objects directly to find the sensors with a weight over five pounds and a maximum temperature range greater than 150°F:
SELECT id, name, city, spec
FROM sensor
WHERE (spec->>'weight')::numeric > 5 AND
(spec->'max_temperature_range'->>'max')::numeric > 150;
The output should be as follows:
id | name | city | spec
----+-----------+---------------+----------------------------------------------------------------------------------------------------------------------
3 | sensor_3 | Seattle | {"size": {"depth": 10, "width": 19, "height": 16}, "weight": 8.01, "max_temperature_range": {"max": 161, "min": 9}}
6 | sensor_6 | Boston | {"size": {"depth": 15, "width": 15, "height": 16}, "weight": 5.86, "max_temperature_range": {"max": 157, "min": 4}}
...truncated
As you can see, the query uses the specialized ->>
and ->
operators to extract JSON fields at different levels of the JSON structure and then filter the data. Check out the following documentation page to learn more about the JSON-related capabilities in PostgreSQL.
Full-Text Search
As a SQL database, PostgreSQL comes with rich capabilities for querying and working with text data. On top of that, the database supports advanced full-text search capabilities that let you analyze large or complex text documents. While traditional text search looks for exact matches of words, full-text search considers linguistic components such as stems, synonyms, and ranking by relevance to provide you with the most applicable search results.
Suppose our temperature sensors occasionally fail for various reasons. The maintenance team tracks these incidents in the sensor_failure
table and uses the full-text search capabilities to easily discover similar failures, root causes, and solutions.
First, let's create the sensor_failure
table:
CREATE TABLE sensor_failure (
id SERIAL PRIMARY KEY,
sensor_id INT,
failure_description TEXT,
FOREIGN KEY (sensor_id) REFERENCES sensor(id)
);
Next, preload a few sample failure reports into the table:
INSERT INTO sensor_failure (sensor_id, failure_description)
VALUES
(1, 'The sensor failed to report data due to a network connectivity issue. After investigation, it was found that a faulty router caused the problem.'),
(2, 'Temperature readings were inconsistent and showed significant fluctuations. The root cause was identified as a loose connection in the wiring.'),
(3, 'The sensor stopped functioning after exposure to extreme weather conditions. It was determined that water ingress damaged the internal components.'),
(4, 'Power supply interruption caused the sensor to reboot multiple times. The issue was traced back to a malfunctioning power adapter.'),
(5, 'Calibration drift resulted in inaccurate humidity measurements. The cause was a worn-out calibration sensor that needed replacement.'),
(6, 'The sensor experienced intermittent data loss due to network issues. A faulty router was identified as the root cause.'),
(7, 'Sensor experienced hardware failure due to a short circuit. It was found that a nearby lightning strike caused the electrical surge.'),
(8, 'Corrosion on the sensor contacts caused intermittent data loss. The root cause was prolonged exposure to a high-humidity environment.'),
(9, 'The sensor failed to communicate with the server due to network problems. A faulty router was discovered to be the issue.'),
(10, 'Firmware update failed, rendering the sensor unresponsive. The failure was due to an incomplete download of the update file.');
Then, assuming we want to perform a full-text search on the incidents, PostgreSQL needs to preprocess the raw text data. The descriptions of the failures need to be parsed into tokens, such as words and phrases, and then converted to more meaningful units of text called lexemes.
Add the failure_lexemes
column to the table, asking PostgreSQL to use the English language rules for tokenizing and normalizing the text. Also, create a GIN index for the lexemes to expedite the search:
ALTER TABLE sensor_failure ADD COLUMN failure_lexemes tsvector
GENERATED ALWAYS AS (to_tsvector('english', failure_description)) STORED;
CREATE INDEX failure_lexemes_idx ON sensor_failure USING GIN (failure_lexemes);
Finally, use the @@
operator to search for incidents related to the "network issue router":
SELECT sensor_id, failure_description
FROM sensor_failure
WHERE failure_lexemes @@ to_tsquery('english', 'network & issue & router');
The output should be as follows:
sensor_id | failure_description
-----------+--------------------------------------------------------------------------------------------------------------------------------------------------
1 | The sensor failed to report data due to a network connectivity issue. After investigation, it was found that a faulty router caused the problem.
6 | The sensor experienced intermittent data loss due to network issues. A faulty router was identified as the root cause.
9 | The sensor failed to communicate with the server due to network problems. A faulty router was discovered to be the issue.
(3 rows)
The search phrase gets normalized to 'network' & 'issue' & 'router'
. The order of the words doesn't matter as long as the words have matches in the document.
Tapping Into PostgreSQL Extensions
While the core PostgreSQL capabilities, such as JSON and full-text search support, make this relational database truly unique, it's the ecosystem of extensions and broader derived solutions that make PostgreSQL one of a kind. Extensibility and pluggability were foundational principles for PostgreSQL. Today, you can find hundreds of extensions and solutions that open up new use cases for PostgreSQL. For instance, there are extensions that let PostgreSQL handle time series and geospatial data easily, function as a message queue, or scale horizontally while tolerating various types of outages.
How about generative AI (GenAI)? PostgreSQL has extensions for apps leveraging large language models (LLMs) and other machine learning models. Let's look deeper at pgvector, the foundational extension for GenAI apps using PostgreSQL.
PostgreSQL for AI
pgvector is the extension that turns PostgreSQL into a vector database. It adds a new data type, operators, and index types to work with vectorized data (embeddings) in the database. This extension is also used as a foundation for other extensions, such as pg_vectorize, pgvectorscale, and PostgresML, which bring additional capabilities for GenAI workloads.
Suppose we want to leverage AI for our sensor incident reports. For instance, if someone asks a question like, "What are the most recent network issues caused by a faulty router?," we want PostgreSQL to easily find the answer.
First, enable the pgvector extension and add the failure_vector
column with 3
dimensions:
CREATE EXTENSION vector;
ALTER TABLE sensor_failure ADD COLUMN failure_vector vector(3);
Next, let's assign random vectors for all incident failures:
UPDATE sensor_failure
SET failure_vector = ARRAY[
round((random())::numeric, 2),
round((random())::numeric, 2),
round((random())::numeric, 2)
]::vector;
Then, let's set closer vectors for the incidents mentioning 'network & issue & router'
:
UPDATE sensor_failure
SET failure_vector = ARRAY[
0.9 + round((random() * 0.02 - 0.01)::numeric, 2),
0.8 + round((random() * 0.02 - 0.01)::numeric, 2),
0.7 + round((random() * 0.02 - 0.01)::numeric, 2)
]::vector
WHERE failure_lexemes @@ to_tsquery('english', 'network & issue & router');
Finally, assuming that the vector for the "What are the most recent network issues caused by a faulty router?" is [0.9, 0.8, 0.7]
, we can perform the vector similarity search as follows:
SELECT
sensor_id,
failure_description,
failure_vector,
1 - (failure_vector <=> '[0.9, 0.8, 0.7]') AS cosine_similarity
FROM sensor_failure
WHERE 1 - (failure_vector <=> '[0.9, 0.8, 0.7]') > 0.90
ORDER BY cosine_similarity DESC LIMIT 3;
The output should be as follows:
sensor_id | failure_description | failure_vector | cosine_similarity
-----------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+--------------------
9 | The sensor failed to communicate with the server due to network problems. A faulty router was discovered to be the issue. | [0.9,0.8,0.7] | 1
1 | The sensor failed to report data due to a network connectivity issue. After investigation, it was found that a faulty router caused the problem. | [0.91,0.81,0.7] | 0.9999870975983235
6 | The sensor experienced intermittent data loss due to network issues. A faulty router was identified as the root cause. | [0.9,0.81,0.69] | 0.9999486655364522
In real life, you'll use an embedding model from providers like OpenAI, Meta, Google, or others to generate embeddings for sensor incident reports and user search prompts. Consequently, the failure_vector
column will have hundreds or thousands of dimensions. Check out this step-by-step guide to see how to build GenAI apps using pgvector with an LLM.
Conclusion
"Just use Postgres" has become a motto of the PostgreSQL community for a good reason. Over the years, the database has gained a broad set of core and extended capabilities that allow it to be used for a variety of use cases far beyond transactional workloads.
However, don't be misguided by the motto. The community is not trying to say that PostgreSQL is a Swiss army knife or the only database you need. Instead, "just use Postgres" is a suggestion to check if PostgreSQL can meet the new demands coming from your applications. If it can, then you'll benefit from running a single database in production; if not, you can add a specialized database solution to your stack. The choice is yours!
This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.
Read the Free Report
Opinions expressed by DZone contributors are their own.
Comments