How we 40x'd the performance of a class of temporal queries

Hey folks!

We’ve had a good week in XTDB-land: 2.2.0-beta1 landed with some big news for XTDB’s architecture, and what we’re looking to launch in future releases. I’m particularly excited about the external sources: opening XTDB up to being used as (e.g.) an event-sourcing data store, temporal queries over your event streams. More to come on that soon!

In this post, though, I wanted to share news of a recent performance adventure we had with a customer, and how we ended up improving the performance of this particular class of temporal queries by 40-50x.

We’ve worked closely with this particular customer for well over a year now. Through XTDB consultancy, we have a great understanding of their data model, their query patterns and their performance requirements; they know that the XTDB core team is available to them for performance investigations just like these.

If you too would like this kind of bespoke support, optimisations targeted towards your use-case, please do get in touch via hello@xtdb.com - we’d love to hear from you!

Background reading

It’s likely worth (skim-) reading the “Building a Bitemporal Index” trilogy before this post, as it covers the core concepts of:

the data shape we’re optimising for in this case
how we resolve the stored bitemporal events into query results.
how we partition the data in storage for efficient scanning.

The data shape in this article is closest in shape to the “pricing feed data” example in part 1 of the trilogy:

relatively low cardinality: only in the thousands or so of distinct entities - we model each ticker as a single entity.
high update frequency - every new price received is an update to that entity.
each update being valid from now until the end-of-time (hereafter, ’∞’) - replacing the previous version.

Background: the XTDB primary index

The power of our primary index comes from its emergent behaviour in each of the different cases, even though it’s only a single strategy. Through the analysis in part 1, the single set of rules that we specify in part 3 is still able to yield a differently-shaped (and hence optimised) index for each case.

For pricing-data specifically, the emergent behaviour of the storage is that we end up with a single relatively small ‘current’ file, containing the most recent price for each ticker; and a wide ‘historical’ section - one partition for each week’s now-superseded updates. The current file is quick to read to serve the ‘as-of-now’ queries; the most recent historical files are likely to be hot in cache for the “give me the history for this one ticker over the last month”.
For trade lifecycle data, though, these same rules yield a smaller historical section containing the intermediate state of the trades, and a much larger ‘current’ section containing the final state of each of the trades. Again, the most recently completed trades are likely to be hot in cache; the intermediate states paged out unless they’re explicitly queried.

Call me biased, but this is why you should absolutely choose a temporal-aware database when you have any kind of time-oriented data. XTDB knows how time-oriented data actually behaves in practice, and optimises storage/queries for it accordingly. As you read the rest of this article, compare it to ‘just’ using an ordinary atemporal database, which stores your times as vanilla columns but otherwise has zero understanding of them.

Background: how we decide what to scan

For a database that stores all your history, this is one of our biggest questions: how can we quickly elide the majority of the database for the most common queries (‘as of now’), but still make the history available when you need it?

We store temporal metadata at each level - for each file, and for each page within each file:

min/max valid-from, valid-to, system-from. Notably, we don’t store system-to in the index - this is provided by the system-from of the next event.
‘Recency’: defined as “the maximum timestamp T for which the event was believed valid for both valid-time = T and system-time = T”. Recency is discussed in more detail (with pictures!) in part 3 of the aforementioned trilogy.

Indeed, the files themselves are partitioned by recency - so this is a key metric in separating historical data from current data. For pricing data specifically, the events are superseded as soon as a new price arrives - so the recency of a given update will be the time of the next update for that ticker (or ∞ for the current price, albeit temporarily).

When we scan, we then check the temporal metadata of the file/page against the temporal bounds of the query, as well as any content filters that may be present. Rows that appear in the results must necessarily match both the temporal and the content predicates of the query.

However, just because a row doesn’t match the temporal and content predicates doesn’t mean it can’t affect the rows that may be in the results: a later event for a given entity might no longer match the content predicate but it can still supersede the earlier event that does match.

Rows that don’t match the temporal bounds can still affect the values of the temporal columns in the results (e.g. _valid_from, _valid_to):

Let’s say we have an assertion inserted in October, where the price of a product is £40. The recency of this event (call it “E1”) is ∞, so it stays in the current file.
Then let’s say we have a Christmas promotion: this product’s price is reduced to £30 through December. The recency of this event (“E2”) is December 31st - recall “the maximum timestamp T for which the event was believed valid for both valid-time = T and system-time = T” - so it goes into a historical file.
We don’t update E1 when E2 comes in, to reduce both write and storage amplification - instead, we rely on metadata to avoid scanning E2 when we don’t need to.

E1 (£40, valid from October to ∞) and E2 (£30, valid December to January) as bitemporal rectangles, with the dashed blue line tracing the result of a FOR ALL VALID_TIME query — the full valid-time history this resolves to.

If we were to query the history of the price of this product (FROM products FOR ALL VALID_TIME SELECT _id, _valid_from, _valid_to, price, the dashed blue line), we’d expect to see:

_id	_valid_from	_valid_to	price
p1	2025-10-01	2025-12-01	£40
p1	2025-12-01	2026-01-01	£30
p1	2026-01-01	∞	£40

Note that E2 in this case causes E1 to appear in the results twice - once for the period before the promotion, and once for the period after - even though it only appears once in storage.

Even if you queried for as-of-now, FROM products SELECT _id, _valid_from, _valid_to, price, E2 still has its effect - without E2, _valid_from would be 2025-10-01:

_id	_valid_from	_valid_to	price
p1	2026-01-01	∞	£40

The impact of this is that when we decide what to scan, we have to additionally consider:

Files that match the temporal bounds of the query (for inclusion in the results).
Files that overlap with the temporal bounds of the files included in group 1 (for their impact on the rows included in the results).

The performance issue

The customer in question has a specific table that’s included in nearly all of their Metabase dashboards. It might look like a static reference data table, about 30k entities, but their data feeds give the latest state of each of these entities every 5 minutes, which they pour into XTDB (which, now, thankfully, filters out unchanged versions for them). In practice, then, this particular table behaves more like the pricing-feed example than static reference data.

This table was growing slower and slower for as-of-now queries, as well as the UPDATEs that were populating it. Argh - why?! It should have a relatively small (30k) current file, that should be the only one we have to read? 🤔

Eventually we figured it out: there were some entities in the current file that had stopped being updated, so the current row for that entity had a relatively early valid-from, compared to the rest of the rows in the file. The min valid-from summary of the whole file therefore overlapped with a lot more history than we’d expected, and this history was therefore being pulled in for every as-of-now query.

A current file whose min valid-from is dragged back by a handful of long-stale entities, causing the file's temporal envelope to overlap a wide swathe of historical files.

D’oh.

Time to get the whiteboard out.

Fix 1: re-categorise the files

We noticed that there are in fact three categories of files that can be relevant to a given query - and that, under certain not-uncommon circumstances, we can safely elide the third:

‘emit’: files that match the temporal bounds and content predicates of the query, for inclusion in the results.
‘supersede’: files that match the temporal bounds but not the content predicates - these can filter and/or duplicate rows in the ‘emit’ files.
‘constrain’: files that don’t match either, but overlap with the temporal bounds of the ‘emit’ files, and so can affect the values in the temporal columns of the results.

The difference between ‘supersede’ and ‘constrain’ is that ‘constrain’ cannot affect the multiplicity of the emitted rows, only the values within them.

Observation 1, then (a simple one in hindsight):

If the query doesn’t project any of the temporal columns, we can safely ignore the files that can only contain ‘constrain’ rows.

The client in question very helpfully supplied us with a file/metadata listing of the table in question, so we could evaluate the optimisation before we implemented it. We found that, out of 291 files, for as-of-now queries without any temporal column projection, 2 were ‘emit’, 35 were ‘constrain’ - no ‘supersede’s in this case, mostly because there were no content predicates in the queries. This optimisation therefore allowed us to drop from 37 to 2 files (the small current file and the most recent historical file) for a significant proportion of their dashboard queries.

The as-of-now aggregates saw the biggest wins - up to 50x faster in some cases, queries that previously took a few seconds now coming back in the tens-to-hundreds of milliseconds. Queries that were already in that range saw smaller but still non-trivial improvements, 3-5x.

An anonymised slice of the queries from one of their dashboards:

Query	Before	After	Speedup
Query A	350ms	90ms	3.9×
Query B	1500ms	53ms	28.3×
Query C	266ms	54ms	4.9×
Query D	4800ms	102ms	47.1×
Query E	5100ms	116ms	44.0×
Query F	5600ms	133ms	42.1×
Query G	5500ms	106ms	51.9×
Query H	4800ms	113ms	42.5×

This client has been happy to deploy XTDB nightly releases through their QA environments all the way into production for a good few months now - we’re fortunate to work very closely with them. They were certainly keen to get this one out, safe to say!

Fix 2: `UPDATE`

Fix 1 was great for queries that don’t project temporal columns. Behind every UPDATE, though, is a query that still does.

Let’s say we have a statement like

UPDATE prices 
  -- optional, defaults to 'FROM NOW TO ∞'
  [FOR PORTION OF VALID_TIME FROM x TO y] 

SET price = 45 
WHERE _id = 'p1'

In the general case, this needs to read the existing state of the entity between x and y, so we run a query along the following lines:

FROM prices FOR VALID_TIME FROM x TO y
WHERE _id = 'p1'
SELECT _id, 
  GREATEST(NOW, _valid_from) AS _valid_from, 
  LEAST(_valid_to, ∞) AS _valid_to, -- '∞' not valid SQL!
  45 AS price,
  -- and also any other non-updated columns

Given this projects the temporal columns, the ‘constrain’ files are relevant to this query, so we couldn’t elide them. The query here therefore returns (pseudo-code):

_id	_valid_from	_valid_to	price
p1	NOW	∞	£45

This then becomes the row that we insert.

The thing is, while we might be querying for _valid_from and _valid_to, triggering the inclusion of the ‘constrain’ files, we don’t really care about the exact values there - so long as they’re between x and y (NOW and ∞, in this case).

Observation 2, then (similarly simple in hindsight):

If the temporal columns are going to be clamped to a fixed range anyway, the ‘constrain’ files can’t affect the result - all we need is values within that range.

We therefore introduced a new query flag for the UPDATE query to use - FOR VALID_TIME ONLY:

FROM prices FOR VALID_TIME ONLY FROM x TO y
WHERE _id = 'p1'
SELECT _id, _valid_from, _valid_to, 45 AS price,
  -- and also any other non-updated columns

This flag clamps the valid time columns to the given period, letting the query engine know that it needn’t look outside the given range.

The result?

UPDATE statements have seen the same performance improvements as the queries above.

Success!

Closing thoughts

Both of these fixes will land in 2.2.0-beta2 - or pick them up sooner on the nightly Docker tag.

”Normal” databases can’t make this optimisation. They don’t know the columns have temporal meaning, they don’t know the semantics of UPDATE ... FOR PORTION OF VALID_TIME. By and large, they store them as vanilla columns and hope that traditional indexing patterns and query planning still work. They don’t have the bitemporal event storage model that allows us to be intelligent about what we read for a given query.

If you’re always fighting with time-oriented requirements, and you want to try a database that has time at its core, give us a shout: hello@xtdb.com, or come join us on our Discord.

Until next time!

James

How we 40x'd the performance of a class of temporal queries

Background reading

Background: the XTDB primary index

Background: how we decide what to scan

The performance issue

Fix 1: re-categorise the files

Fix 2: `UPDATE`

Closing thoughts

Similar posts

Building a Bitemporal Index (part 1): A Taxonomy of Bitemporal Data

Building a Bitemporal Index (part 2): "Bitemporal Resolution"

Building a Bitemporal Index (part 3): Storage

How we 40x'd the performance of a class of temporal queries

Background reading

Background: the XTDB primary index

Background: how we decide what to scan

The performance issue

Fix 1: re-categorise the files

Fix 2: UPDATE

Closing thoughts

Similar posts

Building a Bitemporal Index (part 1): A Taxonomy of Bitemporal Data

Building a Bitemporal Index (part 2): "Bitemporal Resolution"

Building a Bitemporal Index (part 3): Storage

Fix 2: `UPDATE`