What are transactional databases?
Transactional databases are optimized for running production systems — everything from websites to banks to retail stores. They excel at reading and writing individual rows of data very quickly while maintaining data integrity.
Transactional databases aren’t specifically built for analytics, but often become de facto analytic environments because they’re already in place as production databases. Because they’ve been around for decades, they’re familiar, and accessible, and ubiquitous.
If your organization does not have a pre-existing separate analytics stack, one of the quickest way to start doing analytics is to create a replica of your transactional database. This ensures that analytic queries don’t accidentally impede business-critical production queries while requiring minimal additional setup. The downside is that these databases are designed for processing transactions, not analysis. Using them for analytics is a great place to start, but you may run into limitations and need workarounds sooner than you would on a analytics-specific setup.
Transactional databases are row-stores, which means that data is stored on disk as rows, rather than columns. That’s great, for example, when you need to know everything about one customer in the user table since you can grab only the data you need. But it’s not so great when you’re trying to count the customers in a particular ZIP code, since you have to load not just the ZIP column, but the
user_id columns as well.
What are transactional databases really great for?
Ensuring data integrity
Transactional databases are architected to be ACID compliant, which ensures that writes to the database either succeed or fail together, maintaining a high level of data integrity when writing data to the database. Transactional databases are therefore critical for business transactions where a high-level of data integrity is necessary (the canonical example is banking where you want a whole transaction — debit from one account and credit to another — to either succeed or fail).
Because transactional databases are designed to run production systems, they’re very good at operations that must complete in milliseconds. If you’re doing analytics on a transactional replica of your production database, the replica will likely be nearly in sync with the master database (i.e. sub-second latency).
Monitoring operational systems
Working with data from transactional databases to provide a real-time operational snapshot is a perfect analytical use case for transactional databases because there’s so little latency introduced by the replica. If you’re trying to monitor support workloads or inventory or another operational system and need to make decisions based on data that’s as fresh as possible, replicating the production database may be the best option.
Popular transactional databases
Transactional databases are row-stores, which means that an entire row of data is stored together, with the assumption that when a user wants to read a record of data, they will want to pull all the data available about that record. Queries against a transactional database scan each row of data entirely, and then display only the columns selected by the database query.
Analytical data warehouses by contrast, are column-stores, which store each field independently. This makes analytical data warehouses optimized for reading data, but not writing data, because writing to an analytical database means making a lot of simultaneous writes across multiple columns.
However, this also means that analytical databases are generally more efficient and faster at performing aggregations. Performing an aggregation on a dataset from a transactional database means scanning every single row prior to making the aggregation. It’s more efficient to scan only the column that you want to aggregate, making analytical data warehouses better at more complex analyses.
For companies with small data volumes, this issue may not come into play, but as the amount of data available increases, these differences become more apparent.
ACID is a set of properties that describe how transactional databases are architected to preserve integrity of writes to the database.
If even one part of the transaction fails, the entire transaction fails. In this way, every transaction must succeed 100% in order to successfully be committed to the database.
A transaction is either written to the database (bringing the database from one valid state to another) or the transaction is reverted.
Transactions that haven’t completed yet cannot be acted on / modified by other transactions.
Once a transaction is written to the database, it will stay there, even in the event of a database failure.
Below is an example that highlights the importance of ACID as a function of transactional database design. Say you have two people competing for an airline seat.
Person A is booking one seat, and Person B is booking several seats all at once.
- Atomicity - If Person A successfully completes their transaction, Person B’s transaction fails because a seat in their basket has already been claimed.
- Consistency - Person A and Person B cannot both have a ticket to the same seat.
- Isolation - Person B’s transaction will be cancelled if and only if Person A’s transaction completely successfully and is written to the database, and vice versa.
- Durability - If the database fails after Person A successfully claims their seat, Person A will still be able to redeem their purchase at the airport.
Location within the data stack
When transactional databases are used for analytical services, they are mostly configured as a replica of the production database. This configuration is widely documented and fairly accessible for most DBAs or Engineers to set up.
Cloud database vendors offer easy ways to provision the replica. It is preferable to use this setup, rather than doing analytics directly on your production database, because analytic queries can take significant resources and could degrade performance of your production database, which should be dedicated to writing data.In most cases, the replica database should only be a few seconds behind the production database, at most.
If you’re using a replica of your production database for analytics on a transactional database, in some cases you usually won’t be able to write any data to that replica. This is because the replica’s job is to mirror the production database exactly, and a write performed on the replica would not be mirrored on the master database.
There are exceptions to this, as you may be able to create a writeable part of your replica database, but this is certainly a limitation worth taking into consideration and can make certain types of operations more complex.
The lowest-hanging fruit when optimizing your transactional database for analytics is to decrease the isolation level. Isolation level controls what kinds of operations “lock” a table, so lowering isolation will decrease replication lag and moderate the database’s use of locking conditions. Because you’re changing these settings only on the read-only replica, this should be safe to do.
The single greatest performance boost for your transactional database is to set up good indexes. Indexes are, as the name suggests, the databases’ way of ordering specified columns so it knows immediately where different rows live on disk. Indexing foreign keys for commonly joined tables and timestamps for time-ordered tables can greatly improve analytic query performance by reducing the number of rows scanned for JOINs.
The one drawback to adding indexes is that each additional index takes up space on disk, so a good indexing strategy is about balancing faster performance without running out of space.
Reducing the redundancy of the data is also a great way to increase the performance of reading tables. Storing data in the third normal form is one common pattern of achieving this. This can dramatically reduce the width of commonly read rows and move less-used parts of the data into their own tables, which can then be JOINed in only when appropriate.
Having a sparse data matrix can also severely impact performance. The most common datasets that have sparse matrices are things like medical records, server logs, and collections of unstructured data. Columnar databases are designed, at the core, to handle a sparse data matrix.
When using a row store, you will likely need one or more workarounds. One way to deal with this is to fragment out columns in your table to multiple tables. This strategy can reduce of columns utilized per query. Setting up an entity-attribute-value schema may also help if you have an extremely sparse matrix, but this increases the complexity of analytical queries. A tool with a robust modeling layer such as Looker can support such a strategy, whereas writing all that SQL from scratch can become unwieldy.
Another great way to optimize queries is to perform expensive calculations in SQL before running the query. Use data transformations to process the raw rows to create aggregates or DISTINCTs beforehand will greatly speed up queries. The main tradeoff is that rolling up your data reduces resolution and needs to be constantly maintained and updated as new data is added.