ClickHouse: A Blazingly Fast DBMS With Full SQL Join Support
ClickHouse is a popular open-source real-time analytics database. In this article, we discuss its support for different types of Joins.
Join the DZone community and get the full member experience.
Join For FreeClickHouse is an open-source real-time analytics database built and optimized for use cases requiring super-low latency analytical queries over large amounts of data. To achieve the best possible performance for analytical applications, combining tables in a data denormalization process is typical. Flattened tables help minimize query latency by avoiding joins at the cost of incremental ETL complexity, typically acceptable in return for sub-second queries.
However, denormalizing data isn't always practical for some workloads, for instance, those coming from more traditional data warehouses. Sometimes, part of the source data for analytical queries needs to remain normalized. These normalized tables take less storage and provide flexibility with data combinations, but they require joins at query time for certain types of analysis.
Fortunately, contrary to some misconceptions, joins are fully supported in ClickHouse! In addition to supporting all standard SQL JOIN types, ClickHouse provides additional JOIN types useful for analytical workloads and time-series analysis. ClickHouse allows you to choose between six different algorithms for the join execution or allow the query planner to adaptively choose and dynamically change the algorithm at runtime, depending on resource availability and usage.
Join Types Supported in Clickhouse
We use Venn diagrams and example queries on a normalized IMDB dataset originating from the relational dataset repository to explain the available join types in ClickHouse.
Instructions for creating and loading the tables are here. The dataset is also available in our playground for users wanting to reproduce queries.
We are going to use four tables from our example dataset:
The data in that four tables represent movies. A movie can have one or many genres. The roles in a movie are played by actors. The arrows in the diagram above represent foreign-to-primary-key-relationships. e.g., the movie_id
column of a row in the genres table contains theid
value from a row in the movies table.
There is a many-to-many relationship between movies and actors. This many-to-many relationship is normalized into two one-to-many relationships by using the roles table. Each row in the roles table contains the values of the id fields of the movies table and the actors' table.
Inner Join
The Inner Join returns, for each pair of rows matching on join keys, the column values of the row from the left table, combined with the column values of the row from the right table. If a row has more than one match, then all matches are returned (meaning that the cartesian product is produced for rows with matching join keys).
This query finds the genre(s) for each movie by joining the movies table with the genres table:
SELECT
m.name AS name,
g.genre AS genre
FROM movies AS m
INNER JOIN genres AS g ON m.id = g.movie_id
ORDER BY
m.year DESC,
m.name ASC,
g.genre ASC
LIMIT 10;
┌─name───────────────────────────────────┬─genre─────┐
│ Harry Potter and the Half-Blood Prince │ Action │
│ Harry Potter and the Half-Blood Prince │ Adventure │
│ Harry Potter and the Half-Blood Prince │ Family │
│ Harry Potter and the Half-Blood Prince │ Fantasy │
│ Harry Potter and the Half-Blood Prince │ Thriller │
│ DragonBall Z │ Action │
│ DragonBall Z │ Adventure │
│ DragonBall Z │ Comedy │
│ DragonBall Z │ Fantasy │
│ DragonBall Z │ Sci-Fi │
└────────────────────────────────────────┴───────────┘
10 rows in set. Elapsed: 0.126 sec. Processed 783.39 thousand rows, 21.50 MB (6.24 million rows/s., 171.26 MB/s.)
Note that the Inner
keyword can be omitted. The behavior of the Inner Join can be extended or changed by using one of the following other join types.
(Left/Right/Full) Outer Join
The Left Outer Join behaves like Inner Join; plus, for non-matching left table rows, ClickHouse returns default values for the right table’s columns.
A Right Outer Join query is similar and also returns values from non-matching rows from the right table together with default values for the columns of the left table.
A Full Outer Join query combines the left and right outer join and returns values from non-matching rows from the left and the right table, together with default values for the columns of the right and left table, respectively.
Note that ClickHouse can be configured to return NULLs instead of default values (however, for performance reasons, that is less recommended).
This query finds all movies that have no genre by querying for all rows from the movies table that don’t have matches in the genres table and therefore gets (at query time) the default value 0 for the movie_id
column:
SELECT m.name
FROM movies AS m
LEFT JOIN genres AS g ON m.id = g.movie_id
WHERE g.movie_id = 0
ORDER BY
m.year DESC,
m.name ASC
LIMIT 10;
┌─name──────────────────────────────────────┐
│ """Pacific War, The""" │
│ """Turin 2006: XX Olympic Winter Games""" │
│ Arthur, the Movie │
│ Bridge to Terabithia │
│ Mars in Aries │
│ Master of Space and Time │
│ Ninth Life of Louis Drax, The │
│ Paradox │
│ Ratatouille │
│ """American Dad""" │
└───────────────────────────────────────────┘
10 rows in set. Elapsed: 0.092 sec. Processed 783.39 thousand rows, 15.42 MB (8.49 million rows/s., 167.10 MB/s.)
Note that the Outer
keyword can be omitted.
Cross Join
The Cross Join produces the full cartesian product of the two tables without considering join keys. Each row from the left table is combined with each row from the right table.
The following query, therefore, is combing each row from the movies table with each row from the genres table:
SELECT
m.name,
m.id,
g.movie_id,
g.genre
FROM movies AS m
CROSS JOIN genres AS g
LIMIT 10;
┌─name─┬─id─┬─movie_id─┬─genre───────┐
│ #28 │ 0 │ 1 │ Documentary │
│ #28 │ 0 │ 1 │ Short │
│ #28 │ 0 │ 2 │ Comedy │
│ #28 │ 0 │ 2 │ Crime │
│ #28 │ 0 │ 5 │ Western │
│ #28 │ 0 │ 6 │ Comedy │
│ #28 │ 0 │ 6 │ Family │
│ #28 │ 0 │ 8 │ Animation │
│ #28 │ 0 │ 8 │ Comedy │
│ #28 │ 0 │ 8 │ Short │
└──────┴────┴──────────┴─────────────┘
10 rows in set. Elapsed: 0.024 sec. Processed 477.04 thousand rows, 10.22 MB (20.13 million rows/s., 431.36 MB/s.)
While the previous example query alone didn’t make much sense, it can be extended with a where
clause for associating matching rows to replicate Inner
join behavior for finding the genre(s) for each movie:
SELECT
m.name,
g.genre
FROM movies AS m
CROSS JOIN genres AS g
WHERE m.id = g.movie_id
ORDER BY
m.year DESC,
m.name ASC
LIMIT 10;
┌─name───────────────────────────────────┬─genre─────┐
│ Harry Potter and the Half-Blood Prince │ Action │
│ Harry Potter and the Half-Blood Prince │ Adventure │
│ Harry Potter and the Half-Blood Prince │ Family │
│ Harry Potter and the Half-Blood Prince │ Fantasy │
│ Harry Potter and the Half-Blood Prince │ Thriller │
│ DragonBall Z │ Action │
│ DragonBall Z │ Sci-Fi │
│ DragonBall Z │ Fantasy │
│ DragonBall Z │ Comedy │
│ DragonBall Z │ Adventure │
└────────────────────────────────────────┴───────────┘
10 rows in set. Elapsed: 0.441 sec. Processed 783.39 thousand rows, 21.50 MB (1.78 million rows/s., 48.78 MB/s.)
An alternative syntax for Cross Join specifies multiple tables in the from
clause separated by commas.
ClickHouse is rewriting a Cross Join to an Inner Join if there are joining expressions in the where
section of the query.
We can check that for the example query via EXPLAIN SYNTAX (that returns the syntactically optimized version into which a query gets rewritten before being executed):
EXPLAIN SYNTAX
SELECT
m.name AS name,
g.genre AS genre
FROM movies AS m
CROSS JOIN genres AS g
WHERE m.id = g.movie_id
ORDER BY
m.year DESC,
m.name ASC,
g.genre ASC
LIMIT 10;
┌─explain─────────────────────────────────────┐
│ SELECT │
│ name AS name, │
│ genre AS genre │
│ FROM movies AS m │
│ ALL INNER JOIN genres AS g ON id = movie_id │
│ WHERE id = movie_id │
│ ORDER BY │
│ year DESC, │
│ name ASC, │
│ genre ASC │
│ LIMIT 10 │
└─────────────────────────────────────────────┘
11 rows in set. Elapsed: 0.077 sec.
The Inner Join clause in the syntactically optimized Cross Join query version contains the all
keyword, that got explicitly added in order to keep the cartesian product semantics of the Cross Join even when being rewritten into an Inner Join, for which the cartesian product can be disabled.
And because, as mentioned above, the Outer
keyword can be omitted for a Right Outer Join, and the optional all
keyword can be added. You can write All Right Join
and it will work all right.
Left/Right Semi Join
A Left Semi Join query returns column values for each row from the left table that has at least one join key match in the right table. Only the first found match is returned (the cartesian product is disabled).
A Right Semi Join query is similar and returns values for all rows from the right table with at least one match in the left table, but only the first found match is returned.
This query finds all actors/actresses that performed in a movie in 2023. Note that with a normal (Inner) join, the same actor/actress would show up more than one time if they had more than one role in 2023:
SELECT
a.first_name,
a.last_name
FROM actors AS a
LEFT SEMI JOIN roles AS r ON a.id = r.actor_id
WHERE toYear(created_at) = '2023'
ORDER BY id ASC
LIMIT 10;
┌─first_name─┬─last_name──────────────┐
│ Michael │ 'babeepower' Viera │
│ Eloy │ 'Chincheta' │
│ Dieguito │ 'El Cigala' │
│ Antonio │ 'El de Chipiona' │
│ José │ 'El Francés' │
│ Félix │ 'El Gato' │
│ Marcial │ 'El Jalisco' │
│ José │ 'El Morito' │
│ Francisco │ 'El Niño de la Manola' │
│ Víctor │ 'El Payaso' │
└────────────┴────────────────────────┘
10 rows in set. Elapsed: 0.151 sec. Processed 4.25 million rows, 56.23 MB (28.07 million rows/s., 371.48 MB/s.)
Left/Right Anti Join
A Left Anti Join returns column values for all non-matching rows from the left table.
Similarly, the Right Anti Join returns column values for all non-matching right table rows.
An alternative formulation of our previous outer join example query is using an anti join for finding movies that have no genre in the dataset:
SELECT m.name
FROM movies AS m
LEFT ANTI JOIN genres AS g ON m.id = g.movie_id
ORDER BY
year DESC,
name ASC
LIMIT 10;
┌─name──────────────────────────────────────┐
│ """Pacific War, The""" │
│ """Turin 2006: XX Olympic Winter Games""" │
│ Arthur, the Movie │
│ Bridge to Terabithia │
│ Mars in Aries │
│ Master of Space and Time │
│ Ninth Life of Louis Drax, The │
│ Paradox │
│ Ratatouille │
│ """American Dad""" │
└───────────────────────────────────────────┘
10 rows in set. Elapsed: 0.077 sec. Processed 783.39 thousand rows, 15.42 MB (10.18 million rows/s., 200.47 MB/s.)
Left/Right/Inner Any Join
A Left Any Join is the combination of the Left Outer Join + the Left Semi Join, meaning that ClickHouse returns column values for each row from the left table, either combined with the column values of a matching row from the right table or combined with default column values for the right table, in case no match exists. If a row from the left table has more than one match in the right table, ClickHouse only returns the combined column values from the first found match (the cartesian product is disabled).
Similarly, the Right Any Join is the combination of the Right Outer Join + the Right Semi Join.
And the Inner Any Join is the Inner Join with a disabled cartesian product.
We demonstrate the Left Any Join with an abstract example using two temporary tables (left_table
and right_table
) constructed with the values table function:
WITH
left_table AS (SELECT * FROM VALUES('c UInt32', 1, 2, 3)),
right_table AS (SELECT * FROM VALUES('c UInt32', 2, 2, 3, 3, 4))
SELECT
l.c AS l_c,
r.c AS r_c
FROM left_table AS l
LEFT ANY JOIN right_table AS r ON l.c = r.c;
┌─l_c─┬─r_c─┐
│ 1 │ 0 │
│ 2 │ 2 │
│ 3 │ 3 │
└─────┴─────┘
3 rows in set. Elapsed: 0.002 sec.
This is the same query using a Right Any Join:
WITH
left_table AS (SELECT * FROM VALUES('c UInt32', 1, 2, 3)),
right_table AS (SELECT * FROM VALUES('c UInt32', 2, 2, 3, 3, 4))
SELECT
l.c AS l_c,
r.c AS r_c
FROM left_table AS l
RIGHT ANY JOIN right_table AS r ON l.c = r.c;
┌─l_c─┬─r_c─┐
│ 2 │ 2 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 3 │ 3 │
│ 0 │ 4 │
└─────┴─────┘
5 rows in set. Elapsed: 0.002 sec.
This is the query with an Inner Any Join:
WITH
left_table AS (SELECT * FROM VALUES('c UInt32', 1, 2, 3)),
right_table AS (SELECT * FROM VALUES('c UInt32', 2, 2, 3, 3, 4))
SELECT
l.c AS l_c,
r.c AS r_c
FROM left_table AS l
INNER ANY JOIN right_table AS r ON l.c = r.c;
┌─l_c─┬─r_c─┐
│ 2 │ 2 │
│ 3 │ 3 │
└─────┴─────┘
2 rows in set. Elapsed: 0.002 sec.
ASOF Join
The ASOF Join, implemented for ClickHouse in 2019 by Martijn Bakker and Artem Zuikov, provides non-exact matching capabilities. If a row from the left table doesn’t have an exact match in the right table, then the closest matching row from the right table is used as a match instead.
This is particularly useful for time-series analytics and can drastically reduce query complexity.
We will do time-series analytics of stock market data as an example. A quotes table contains stock symbol quotes based on specific times of the day. The price is updated every 10 seconds in our example data. A trades table lists symbol trades - a specific volume of a symbol got bought at a specific time:
In order to calculate the concrete cost of each trade, we need to match the trades with their closest quote time.
This is easy and compact with the ASOF Join, where we use the ON
clause for specifying an exact match condition and the AND
clause for specifying the closest match condition — we are looking for the closest
row from the quotes table exactly or before the date of a trade:
SELECT
t.symbol,
t.volume,
t.time AS trade_time,
q.time AS closest_quote_time,
q.price AS quote_price,
t.volume * q.price AS final_price
FROM trades t
ASOF LEFT JOIN quotes q ON t.symbol = q.symbol AND t.time >= q.time
FORMAT Vertical;
Row 1:
──────
symbol: ABC
volume: 200
trade_time: 2023-02-22 14:09:05
closest_quote_time: 2023-02-22 14:09:00
quote_price: 32.11
final_price: 6422
Row 2:
──────
symbol: ABC
volume: 300
trade_time: 2023-02-22 14:09:28
closest_quote_time: 2023-02-22 14:09:20
quote_price: 32.15
final_price: 9645
2 rows in set. Elapsed: 0.003 sec.
Note that the ON
clause of the ASOF Join is required and specifies an exact match condition next to the non-exact match condition of the AND
clause.
ClickHouse currently doesn't support (yet) joins without any part of the join keys performing strict matching.
Summary
This blog post showed how ClickHouse supports all standard SQL Join types, plus specialized joins to power analytical queries. We described and demonstrated all supported join types.
Published at DZone with permission of Tom Schreiber. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments