Monday, 30 September, 2024
The Challenge of Data Compliance
By Eric Newcomer, CTO and Principal Analyst, Intellyx (guest author)
Regulations and laws governing the collection and use of personal data are getting stricter every year. Penalties for infractions are growing, and becoming more and more common all the time.
Keeping track of all of the various rules, regulations, and laws are a big part of the challenge, especially for a multinational or global organization. Regulations not only vary by country, but in the United States, they vary by state as well.
A more significant aspect of compliance is how the laws are enforced, and how organizations respond to issues, incidents, and breaches.
Internal audit teams are frequently tasked with ensuring the right controls are in place so that organizations avoid compliance and security issues. Regulators also frequently perform audits and checks to ensure organizations are implementing the correct safeguards.
And of course if an incident occurs, such as it often does, such as a public breach of customer data, an organization, consequences can be severe in terms of financial penalties and reputational impact.
But there’s another big organizational impact, which may be even more significant in terms of ongoing cost: resolving disputes. When challenged or taken to court, an organization is required to perform its own assessment of the issue and either agree with the complainant or defend itself. Such an internal process can be very expensive to staff and operate.
In compliance investigations, determining whether or not a violation occurred typically revolves around the value of a given set of data items and determining when or how they may have changed. Identifying when and how an error occurred can be difficult, however, because a record of all previous values for the items is typically not maintained.
The IT System Challenge
Mainstream database technology replaces prior values during an update operation, completely destroying the old values.
IT systems are designed to reflect the current state of the business, not a prior state. Transactions update the database in place, erasing prior values when storing new ones.
This is because when databases were invented, disk storage was very expensive, as were computers in general. Keeping costs down was important to making computers affordable. Disk space was reclaimed and reused wherever possible to keep down costs. Prior (i.e. not current) copies of data typically were not automatically preserved because that would consume expensive disk space. Queries typically returned only the most recent updates.
The best practice advice at the time for tracking changes was to maintain a separate log of historical changes, and go back in time and see what the values were on a given data, but you had to figure out how to do it yourself.
Traditional Relational Databases
Traditional relational databases such as Oracle, DB2, and SQL Server are very popular for processing business transactions. Although storage prices have plummeted – the fundamental design of relational databases hasn’t changed.
It would seem like using date and time operations would be a simple thing for a relational database, but in fact calculating time can involve many variables that introduce complexity.
Because of the way databases are designed to work efficiently with disk, and transaction processing systems are designed to reflect the current state of the business, tracing the evolution of data values over time isn’t easy to program.
Scanning through a big log of all changes to extract what you’re looking for isn’t easy, nor is keeping a copy of all updates in the database (or another database) and writing complicated SQL queries to find what you are looking for – which may be different each time you look for it.
Other Types of Databases
Other types of databases implement different approaches to handling the challenges of tracking changes and reporting on updates to data over time.
Some databases implement multiversion concurrency control (MVCC), which does not immediately overwrite data. MVCC databases create a new version of the data item, instead.
As the name implies, this is done to improve concurrency control (MVCC avoids locking the data item during the read/update cycle) rather than as a way to preserve updates, so the new versions are only temporarily retained.
RDF databases typically use an overlay technique to write new tuples to the database without deleting old ones, but RDF databases are primarily used for specialized applications that require semantic analysis and graphs. They are rarely used in a system of record for capturing transactional updates to business state.
Time series databases are designed to capture data that changes rapidly, such as stock price quotes or manufacturing shop floor telemetry. But time series databases typically support a very simple data model, designed to support real time analysis of data changing over time.
In addition, some data warehouse databases such as Snowflake and Iceberg support a form of “time travel” to look back in time at previous data values. But like MVCC databases they limit the timeframe over which you can look back, and require you to code the SQL yourself.
There is, however, a type of database engineered to solve this temporal problem with reviewing data values over time. It’s called a “bi-temporal” database, and it implements some functionality recently added to SQL that allows the database to maintain historical versions of data and avoid deleting data when performing updates.
Instead, the bi-temporal database stores all versions by date, and allows you to set the date to any value and see that the data was as of that date.
The Intellyx Take
Most regulatory, data privacy, and risk compliance requirements require accuracy and accountability for errors in the data under such supervision, often requiring an historical view of the data under supervision to determine when and how the data changed.
Internal auditors and external regulators alike frequently need to review such historical values when identifying and resolving anomalies and settling disputes.
When did an incorrect value occur? What was the value before it was incorrectly updated? How do we know for certain what the correct value is? Especially when a dispute or challenge arises, such as a consumer asking to resolve an error on a credit report, or a data privacy violation.
Resolving discrepancies in data governed by regulations and critical to risk calculations can take a lot of work. This would seem an obvious area for improved automation, such as what a bi-temporal database provides.
Copyright © Intellyx B.V. Intellyx is editorially responsible for this document. No AI bots were used to write this content. At the time of writing, JUXT is an Intellyx client. Image is public domain.