Thursday, 12 December, 2024
The missing SQL sub-queries
Those of you who’ve used SQL for a while will no doubt be familiar with sub-queries.
To begin with, SQL didn’t have sub-queries - they were 'only' added in SQL-92.
Before that, to execute certain types of queries, you either (so I’m told!) needed to make more use of temporary tables, or wrangle more complicated joins.
Indeed, before they introduced sub-queries, the use-case of filtering the results after the group-by was already common enough that they introduced HAVING
(SQL-86/SQL-1) - exactly the same relational operator as WHERE
but just executed at a different time.
SQL sub-queries can be introduced in three main places:
-
As part of
FROM
/JOIN
- these are known in the spec as 'table sub-queries'.-- this query does a frequency distribution SELECT order_count, COUNT(*) AS frequency FROM (SELECT COUNT(*) AS order_count FROM orders GROUP BY customer_id) order_counts GROUP BY order_count;
-
As a 'semi-join' in various predicates:
SELECT order_id FROM orders WHERE customer_id IN ( SELECT _id FROM customers c WHERE c.status = 'gold') ORDER BY order_date DESC LIMIT 10;
-
As a single value in an expression (either a predicate or a projection) - 'scalar sub-queries':
SELECT c._id AS customer_id, c.name, ( SELECT o.order_total FROM orders o WHERE o.customer_id = c._id ORDER BY o.order_date DESC LIMIT 1 ) AS latest_order_total FROM customers c;
You can also return an array of results here, so long as the inner query returns exactly one column:
SELECT c._id AS customer_id, c.name, ARRAY( SELECT o.order_total FROM orders o WHERE o.customer_id = c._id ORDER BY o.order_total DESC LIMIT 3 ) AS top_orders FROM customers c;
We’ll specifically focus on the latter in this article.
Completing the picture:
The more eagle-eyed readers will notice that we’re missing a couple of constructs:
Rows \ Columns | 1 | 0..N |
---|---|---|
0..1 |
Scalar sub-query |
?? |
0..N |
|
?? |
In my pre-XTDB life, I certainly spent a lot of development time (and maybe you have too?) getting data out of a flat database and re-shaping it into a tree to be rendered in a UI - it’s possible in SQL, but notoriously unwieldy. In my case, it was for a social newsfeed:
-
a page that showed a number of different posts. For each post:
-
the first paragraph
-
details of the author
-
a handful of people who’d 'liked' the post
-
their names and avatars
-
-
the top few comments on those posts. For each comment:
-
the comment itself,
-
the name/avatar of the respondent.
-
a count and sample of people who’ve 'liked' the comment
-
their names
-
-
-
On an individual’s profile, the story is reversed:
-
comments they’ve written (or posts they’ve liked, etc)
-
the comment itself
-
a small preview of the post
-
the post author’s name, avatar, etc.
-
-
The UI particularly appreciates a tree-like data structure here, so that the various components in the UI can navigate through to the data they need to render. For the former query, we settled on something like the following:
{
"feed": [
{
"post_id": ...,
"summary": "...",
"author": {
"user_id": ...,
"name": "James",
"avatar": "https://..."
},
"likes": 1424,
"likers": ["Bob", "Dave", "Sally"],
"comments": 32,
"top_comments": [
{
"comment_id": ...,
"comment": ...,
"author": {
"user_id": ...,
"name": "Dave",
"avatar": "https://..."
}
"likes": 14,
... etc ...
}
]
}
]
}
Your first thought (quite reasonably) may well be to reach for constructs like ARRAY_AGG
/JSONB_AGG
etc.
However, in SQL, when you have multiple joins from a top-level entity (let’s say post → comments and post → likers), naive joins will get you an inadvertent cross product of the child tables:
-- here, you'll get `comment-count x like-count`
-- elements in each of the arrays
SELECT p.post,
ARRAY_AGG(c.comment),
ARRAY_AGG(l.liker_name)
FROM posts p
LEFT JOIN comments c USING (post_id)
LEFT JOIN likes l USING (post_id)
WHERE p._id = ?
GROUP BY p.post
You may also be thinking 'GraphQL!' at this point (and you’d be right). Even with GraphQL, though, you need to be careful to avoid falling into the well-known "n + 1" query problem, where it’s all-too-easy to end up making one query for the top of the tree, then N queries for the second level, N² queries for the third, etc.
There is a better way: let’s take advantage of sub-queries:
SELECT p.post,
ARRAY(SELECT ... FROM comments c WHERE c.post_id = p._id)
AS comments,
ARRAY(SELECT ... FROM likes l WHERE l.post_id = p._id)
AS likes,
(SELECT ... FROM users WHERE u._id = p.author_id)
AS author
FROM posts p
WHERE _id = ?
Ah, but: both ARRAY
sub-queries and scalar sub-queries only allow exactly one column in the sub-query - d’oh.
In Postgres, you can use JSONB_BUILD_OBJECT
to create JSON objects - but again, this gets quite cumbersome when you have a deeply nested tree.
Introducing NEST_ONE
and NEST_MANY
Luckily, XTDB has good support for composite data structures - maps/structs/objects and arrays/vectors/lists (whatever your language calls them!).
NEST_ONE
and NEST_MANY
allow 0..N columns in the sub-query, and will wrap those columns up into a nested map for you - the former for when you have a one-to-one join, where it’ll inline the map directly; the latter for one-to-many, where you’ll get a list of maps.
SELECT p.post,
NEST_MANY(SELECT c.comment,
-- turtles all the way down
NEST_ONE(SELECT ... FROM users u
WHERE u._id = c.user_id)
AS comment_author
FROM comments c
WHERE c.post_id = p._id)
AS comments,
NEST_MANY(SELECT ... FROM likes l WHERE l.post_id = p._id)
AS likes,
NEST_ONE(SELECT ... FROM users WHERE u._id = p.author_id)
AS author
FROM posts p
WHERE _id = ?
Note that we’ve kept the majority of the above query - just specifying NEST_ONE
and NEST_MANY
as required.
This query will then yield tree-shaped results similar to the JSON response above.
Indeed, in XTDB, you’re free to re-order the FROM
and SELECT
clauses, so that the query better reflects both the shape of the output tree, as well as the underlying order of operations:
FROM posts p
WHERE _id = ?
SELECT p.post,
NEST_MANY(FROM comments c
WHERE c.post_id = p._id
SELECT c.comment,
-- turtles all the way down
NEST_ONE(FROM users u
WHERE u._id = c.user_id
SELECT ...)
AS comment_author)
AS comments,
NEST_MANY(FROM likes l WHERE l.post_id = p._id
SELECT ...)
AS likes,
NEST_ONE(FROM users WHERE u._id = p.author_id
SELECT ...)
AS author
So, to complete the picture:
Rows \ Columns | 1 | 0..N |
---|---|---|
0..1 |
Scalar sub-query |
|
0..N |
|
|
To give this a try, you can either:
-
Visit https://play.xtdb.com
-
Use a GitHub Codespace (60h free personal usage / month)
-
Run XTDB locally:
-
docker run --name xtdb -p 5432:5432 ghcr.io/xtdb/xtdb:latest
-
psql -h localhost
-
As always, let us know what you think in the comments, or at hello@xtdb.com!
James