Skip to content
Logo of XTDB

Thursday, 12 December, 2024

The missing SQL sub-queries

James Henderson

Those of you who’ve used SQL for a while will no doubt be familiar with sub-queries.

To begin with, SQL didn’t have sub-queries - they were 'only' added in SQL-92. Before that, to execute certain types of queries, you either (so I’m told!) needed to make more use of temporary tables, or wrangle more complicated joins. Indeed, before they introduced sub-queries, the use-case of filtering the results after the group-by was already common enough that they introduced HAVING (SQL-86/SQL-1) - exactly the same relational operator as WHERE but just executed at a different time.

SQL sub-queries can be introduced in three main places:

  1. As part of FROM/JOIN - these are known in the spec as 'table sub-queries'.

    -- this query does a frequency distribution
    
    SELECT order_count, COUNT(*) AS frequency
    FROM (SELECT COUNT(*) AS order_count
          FROM orders
          GROUP BY customer_id) order_counts
    GROUP BY order_count;
  2. As a 'semi-join' in various predicates:

    SELECT order_id
    FROM orders
    WHERE customer_id IN (
      SELECT _id FROM customers c WHERE c.status = 'gold')
    ORDER BY order_date DESC
    LIMIT 10;
  3. As a single value in an expression (either a predicate or a projection) - 'scalar sub-queries':

    SELECT c._id AS customer_id, c.name,
           (
             SELECT o.order_total
             FROM orders o
             WHERE o.customer_id = c._id
             ORDER BY o.order_date DESC
             LIMIT 1
           ) AS latest_order_total
    FROM customers c;

    You can also return an array of results here, so long as the inner query returns exactly one column:

    SELECT c._id AS customer_id, c.name,
           ARRAY(
             SELECT o.order_total
             FROM orders o
             WHERE o.customer_id = c._id
             ORDER BY o.order_total DESC
             LIMIT 3
           ) AS top_orders
    FROM customers c;

We’ll specifically focus on the latter in this article.

Completing the picture:

The more eagle-eyed readers will notice that we’re missing a couple of constructs:

Rows \ Columns 1 0..N

0..1

Scalar sub-query

??

0..N

ARRAY sub-query

??

In my pre-XTDB life, I certainly spent a lot of development time (and maybe you have too?) getting data out of a flat database and re-shaping it into a tree to be rendered in a UI - it’s possible in SQL, but notoriously unwieldy. In my case, it was for a social newsfeed:

  • a page that showed a number of different posts. For each post:

    • the first paragraph

    • details of the author

    • a handful of people who’d 'liked' the post

      • their names and avatars

    • the top few comments on those posts. For each comment:

      • the comment itself,

      • the name/avatar of the respondent.

      • a count and sample of people who’ve 'liked' the comment

        • their names

On an individual’s profile, the story is reversed:

  • comments they’ve written (or posts they’ve liked, etc)

    • the comment itself

    • a small preview of the post

      • the post author’s name, avatar, etc.

The UI particularly appreciates a tree-like data structure here, so that the various components in the UI can navigate through to the data they need to render. For the former query, we settled on something like the following:

{
  "feed": [
    {
      "post_id": ...,
      "summary": "...",
      "author": {
        "user_id": ...,
        "name": "James",
        "avatar": "https://..."
      },

      "likes": 1424,
      "likers": ["Bob", "Dave", "Sally"],

      "comments": 32,
      "top_comments": [
        {
          "comment_id": ...,
          "comment": ...,
          "author": {
            "user_id": ...,
            "name": "Dave",
            "avatar": "https://..."
          }
          "likes": 14,
          ... etc ...
        }
      ]
    }
  ]
}

Your first thought (quite reasonably) may well be to reach for constructs like ARRAY_AGG/JSONB_AGG etc. However, in SQL, when you have multiple joins from a top-level entity (let’s say post → comments and post → likers), naive joins will get you an inadvertent cross product of the child tables:

-- here, you'll get `comment-count x like-count`
-- elements in each of the arrays
SELECT p.post,
       ARRAY_AGG(c.comment),
       ARRAY_AGG(l.liker_name)
FROM posts p
  LEFT JOIN comments c USING (post_id)
  LEFT JOIN likes l USING (post_id)
WHERE p._id = ?
GROUP BY p.post

You may also be thinking 'GraphQL!' at this point (and you’d be right). Even with GraphQL, though, you need to be careful to avoid falling into the well-known "n + 1" query problem, where it’s all-too-easy to end up making one query for the top of the tree, then N queries for the second level, N² queries for the third, etc.

There is a better way: let’s take advantage of sub-queries:

SELECT p.post,
       ARRAY(SELECT ... FROM comments c WHERE c.post_id = p._id)
         AS comments,
       ARRAY(SELECT ... FROM likes l WHERE l.post_id = p._id)
         AS likes,
       (SELECT ... FROM users WHERE u._id = p.author_id)
         AS author
FROM posts p
WHERE _id = ?

Ah, but: both ARRAY sub-queries and scalar sub-queries only allow exactly one column in the sub-query - d’oh. In Postgres, you can use JSONB_BUILD_OBJECT to create JSON objects - but again, this gets quite cumbersome when you have a deeply nested tree.

Introducing NEST_ONE and NEST_MANY

Luckily, XTDB has good support for composite data structures - maps/structs/objects and arrays/vectors/lists (whatever your language calls them!).

NEST_ONE and NEST_MANY allow 0..N columns in the sub-query, and will wrap those columns up into a nested map for you - the former for when you have a one-to-one join, where it’ll inline the map directly; the latter for one-to-many, where you’ll get a list of maps.

SELECT p.post,
       NEST_MANY(SELECT c.comment,
                        -- turtles all the way down
                        NEST_ONE(SELECT ... FROM users u
                                 WHERE u._id = c.user_id)
                          AS comment_author
                 FROM comments c
                 WHERE c.post_id = p._id)
         AS comments,
       NEST_MANY(SELECT ... FROM likes l WHERE l.post_id = p._id)
         AS likes,
       NEST_ONE(SELECT ... FROM users WHERE u._id = p.author_id)
         AS author
FROM posts p
WHERE _id = ?

Note that we’ve kept the majority of the above query - just specifying NEST_ONE and NEST_MANY as required. This query will then yield tree-shaped results similar to the JSON response above.

Indeed, in XTDB, you’re free to re-order the FROM and SELECT clauses, so that the query better reflects both the shape of the output tree, as well as the underlying order of operations:

FROM posts p
WHERE _id = ?
SELECT p.post,
       NEST_MANY(FROM comments c
                 WHERE c.post_id = p._id
                 SELECT c.comment,
                        -- turtles all the way down
                        NEST_ONE(FROM users u
                                 WHERE u._id = c.user_id
                                 SELECT ...)
                          AS comment_author)

         AS comments,
       NEST_MANY(FROM likes l WHERE l.post_id = p._id
                 SELECT ...)
         AS likes,
       NEST_ONE(FROM users WHERE u._id = p.author_id
                SELECT ...)
         AS author

So, to complete the picture:

Rows \ Columns 1 0..N

0..1

Scalar sub-query

NEST_ONE

0..N

ARRAY sub-query

NEST_MANY

To give this a try, you can either:

As always, let us know what you think in the comments, or at hello@xtdb.com!

James