How SQL DISTINCT and ORDER BY are Related
These two pieces of SQL may seem similar, but just how similar are they? We take an in-depth look.
Join the DZone community and get the full member experience.
Join For FreeOne of the things that confuse SQL users all the time is how DISTINCT
and ORDER BY
are related in a SQL query.
The Basics
Running some queries against the Sakila database, most people quickly understand:
SELECT DISTINCT length FROM film
This returns results in an arbitrary order because the database can (and might apply hashing rather than ordering to remove duplicates):
length |
-------|
129 |
106 |
120 |
171 |
138 |
80 |
...
Most people also understand:
SELECT length FROM film ORDER BY length
This will give us duplicates, but in order:
length |
-------|
46 |
46 |
46 |
46 |
46 |
47 |
47 |
47 |
47 |
47 |
47 |
47 |
48 |
...
And, of course, we can combine the two:
SELECT DISTINCT length FROM film ORDER BY length
Resulting in...
length |
-------|
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
...
Then Why Doesn't This Work?
Maybe somewhat intuitively, we may want to order the lengths differently, e.g. by title:
SELECT DISTINCT length FROM film ORDER BY title
Most databases fail this query with an exception like Oracle's:
ORA-01791: not a SELECTed expression
At first sight, this seems funny, because this works after all:
SELECT length FROM film ORDER BY title
Yielding:
length |
-------|
86 |
48 |
50 |
117 |
130 |
...
We could add the title to illustrate the ordering:
length |title |
-------|----------------------------|
86 |ACADEMY DINOSAUR |
48 |ACE GOLDFINGER |
50 |ADAPTATION HOLES |
117 |AFFAIR PREJUDICE |
130 |AFRICAN EGG |
So, How Are These Different?
We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order, executed by the optimizer.
When we write something like this:
SELECT DISTINCT length FROM film ORDER BY length
The logical order of operations is:
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
columnDISTINCT
clause, removing distinct tuples (with projectedLENGTH
columns)ORDER BY
clause, ordering by theLENGTH
column
If we look at this step by step, we have:
Step 1: SELECT * FROM film
The intermediary data set is something like:
film_id |title |length | ...
--------|----------------------------|-------| ...
1 |ACADEMY DINOSAUR |86 | ...
2 |ACE GOLDFINGER |48 | ...
3 |ADAPTATION HOLES |50 | ...
4 |AFFAIR PREJUDICE |117 | ...
5 |AFRICAN EGG |130 | ...
... |... |... | ...
Step 2: SELECT length ...
The intermediary data set is something like:
length |
-------|
86 |
48 |
50 |
117 |
130 |
...
86 | <-- duplicate
Step 3: SELECT DISTINCT length ...
Now we're getting a new random order (due to hashing) and no duplicates anymore:
length |
-------|
129 |
106 |
120 |
171 |
138 |
...
Step 4: ... ORDER BY length
And we're getting:
length |
-------|
46 |
47 |
48 |
49 |
50 |
...
It seems obvious.
So Why Did This Work?
Remember, this query worked:
SELECT length FROM film ORDER BY title
Even if after projecting the LENGTH
column, it seems as though it is no longer available for sorting, it really is, according to the SQL standard and to common sense. There is a concept called extended sort key columns in the SQL standard, which means the above query has a slightly different order of operations (apart from the fact that there is no DISTINCT
operation):
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
column from the select list and theTITLE
from the extended sort key columnsORDER BY
clause, ordering by theTITLE
columnSELECT
clause (implicit), projecting only theLENGTH
column, discarding theTITLE
column
Again, this is what happens logically. Database optimisers may choose other ways to implement this. By example:
Step 1: SELECT * FROM film
Same as before
film_id |title |length | ...
--------|----------------------------|-------| ...
1 |ACADEMY DINOSAUR |86 | ...
2 |ACE GOLDFINGER |48 | ...
3 |ADAPTATION HOLES |50 | ...
4 |AFFAIR PREJUDICE |117 | ...
5 |AFRICAN EGG |130 | ...
... |... |... | ...
Step 2: SELECT length, title...
We get that synthetic extended sort key column TITLE
along with the LENGTH
column that we requested
length |title |
-------|----------------------------|
86 |ACADEMY DINOSAUR |
114 |ALABAMA DEVIL |
50 |ADAPTATION HOLES |
117 |AFFAIR PREJUDICE |
168 |ANTITRUST TOMATOES |
...
Step 3: ...ORDER BY title
...we can now order by that column
length |title |
-------|----------------------------|
86 |ACADEMY DINOSAUR |
48 |ACE GOLDFINGER |
50 |ADAPTATION HOLES |
117 |AFFAIR PREJUDICE |
130 |AFRICAN EGG |
...
Step 4: SELECT length
...and finally discard it because we never wanted it
length |
-------|
86 |
48 |
50 |
117 |
130 |
So Why Can't We Use DISTINCT?
If we try to run this:
SELECT DISTINCT length FROM film ORDER BY title
We would get an additional DISTINCT
operation in our logical set of operations:
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
column from the select list and theTITLE
from the extended sort key columnsDISTINCT
clause, removing duplicate(LENGTH, TITLE)
values... OoopsORDER BY
clause, ordering by theTITLE
columnSELECT
clause (implicit), projecting only theLENGTH
column, discarding theTITLE
column
The problem is, since we have synthetically added the extended sort key columnTITLE
to the projection in order to be able to ORDER BY
it, DISTINCT
wouldn't have the same semantics anymore as can be seen here:
SELECT count(*)
FROM (
SELECT DISTINCT length FROM film
) t;
SELECT count(*)
FROM (
SELECT DISTINCT length, title FROM film
) t;
Yielding:
140
1000
All titles are distinct. There is no way this query can be executed reasonably. Either DISTINCT
doesn't work (because the added extended sort key column changes its semantics), or ORDER BY
doesn't work (because after DISTINCT
we can no longer access the extended sort key column).
A more constructed example. T contains this data:
CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 1);
INSERT INTO t VALUES (1, 2);
INSERT INTO t VALUES (2, 3);
INSERT INTO t VALUES (1, 4);
INSERT INTO t VALUES (2, 5);
A B
-----
1 1
1 2
2 3
1 4
2 5
What would this query produce?
SELECT DISTINCT a FROM t ORDER BY b;
Clearly, we should only get 2 rows with values 1, 2, because of DISTINCT a
:
A
--
1
2
Now, how do we order these by B
? There are 3 values of B associated A = 1
and 2 values of B associated with A = 2
:
A B
------------------
1 Any of 1, 2, 4
2 Any of 3, 5
Should we get 1, 2 or 2, 1 as a result? Impossible to tell.
But There Are Some Exceptions
The way I read the SQL standard, the following exception should be possible. The SQL standard ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 6) references the "Left normal form derivation." But I may be reading this wrong, see also a discussion on the PostgreSQL mailing list:
https://www.postgresql.org/message-id/20030819103859.L69440-100000%40megazone.bigpanda.com
In any case, it still makes sense to me. For instance, we can form expressions on the columns in the select list. This is totally fine in MySQL (strict mode) and Oracle:
SELECT DISTINCT length
FROM film
ORDER BY mod(length, 10), length;
It will produce:
length |
-------|
50 |
60 |
70 |
80 |
90 |
100 |
110 |
120 |
130 |
140 |
150 |
160 |
170 |
180 |
51 |
61 |
71 |
PostgreSQL doesn't allow this because the expression MOD(LENGTH, 10)
is not in the select list. How to interpret this? We're looking again at the order of SQL operations:
FROM
clause, loading theFILM
tableSELECT
clause, projecting theLENGTH
column from the select list.MOD(LENGTH, 10)
does not have to be put in the extended sort key columns, because it can be fully derived from the select list.DISTINCT
clause, removing duplicateLENGTH
values ... all fine, because we don't have the verboten extended sort key columnsORDER BY
clause, ordering by themod(LENGTH, 10), LENGTH
columns. Totally fine, because we can derive all of these order by expressions from expressions in the select list
Makes sense, right?
Back to our constructed table T:
A B
-----
1 1
1 2
2 3
1 4
2 5
We are allowed to write:
SELECT DISTINCT a, b FROM t ORDER BY a - b;
We would get:
A B
-----
1 4
2 5
2 3
1 2
1 1
Again, the order by expressions can be derived completely from the select list. This also works in Oracle:
SELECT DISTINCT a - b FROM t ORDER BY abs(a - b);
The select list contains a column A - B
, so we can derive any ORDER BY
expression from it. But these don't work:
SELECT DISTINCT a - b FROM t ORDER BY a;
SELECT DISTINCT a - b FROM t ORDER BY b;
SELECT DISTINCT a - b FROM t ORDER BY b - a;
It is easy to build an intuition for why these don't work. Clearly, the data set we want is:
A - B A B B - A
------------------------------------------
-3 Any of 1, 2 Any of 4, 5 3
-1 Any of 2, 1 Any of 3, 2 1
0 Any of 1 Any of 1 0
Now, how are we supposed to order these by A, B
or B - A
? It looks as though we should be able to sort by B - A
in this case. We could derive a complicated transformation of expressions that can be reasonably transformed into each other, such as A - B = -(B - A)
, but this simply isn't practical. The expression in the projection is A - B
, and that's the only expression we can re-use in the ORDER BY
. For example, we could even do this in Oracle:
SELECT DISTINCT a - b FROM t ORDER BY abs((a - b) + (a - b));
Or start using aliases:
SELECT DISTINCT a - b AS x FROM t ORDER BY abs(x + x);
PostgreSQL DISTINCT ON
PostgreSQL has a nice feature for when you want to order by something from within a group of non-distinct values. Remember how this wasn't possible?
SELECT DISTINCT length FROM film ORDER BY title
Well, this is:
SELECT DISTINCT ON (title) length FROM film ORDER BY title
And we're getting now:
length |
-------|
86 |
48 |
50 |
117 |
130 |
169 |
62 |
...
What we're essentially doing is, we take all distinct lengths, and for each group of identical lengths, we're taking the top title as a criteria to order by. In a way, this is syntax sugar for this:
SELECT length
FROM (
SELECT length, MIN(title) title
FROM film
GROUP BY length
) t
ORDER BY title
Which is what most people really want, when they ORDER BY
something they cannot really order by.
Conclusion
The SQL language is quirky. This is mostly because the syntactical order of operations doesn't match the logical order of operations. The syntax is meant to be human readable (remember Structured English Query Language?) but when reasoning about a SQL statement, we would often like to directly write down the logical order of operations.
In this article, we haven't even touched the implications of adding
Which add more fun rules to what's possible and what isn't. Our previous article on the true logical order of SQL operations explains this completely.
Published at DZone with permission of Lukas Eder. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments