What is an Analytical MPP database?
Analytical Massively Parallel Processing (MPP) Databases are databases that are optimized for analytical workloads: aggregating and processing large datasets. MPP databases tend to be columnar, so rather than storing each row in a table as an object (a feature of transactional databases, MPP databases generally store each column as an object. This architecture allows complex analytical queries to be processed much more quickly and efficiently.
These analytic databases distribute their datasets across many machines, or nodes, to process large volumes of data (hence the name). These nodes all contain their own storage and compute capabilities, enabling each to execute a portion of the query.
The proliferation and drop in cost of analytical MPP databases in the last decade has created a huge opportunity for data-driven organizations to operationalize and analyze larger datasets than ever before. These databases have been a wonderful addition to the growing toolkit for analysts, but also introduce additional complexity into architectures.
What are data warehouses really great for?
Typical analytical workloads
MPP databases are very good at the most common analytical workloads, which are generally characterized by queries on a subset of columns with aggregations over broad ranges of rows. This is due to their columnar architecture, which allows them to only access the fields needed to complete a query (as opposed to transactional databases, which must access all fields in a row).
A columnar architecture also gives MPP databases additional features that are useful for analytic workloads. These vary by database, but often include the ability to compress like data values, efficiently index very large tables, and handle wide, denormalized tables.
Organizations typically use analytical MPP databases as data warehouses, or centralized repositories that house all data generated within their organization, such as transactional sales data, web tracking data, marketing data, customer service data, inventory/logistical data, HR/recruiting data, and system log data. Because analytical MPP databases can handle huge data volumes, an organization can comfortably rely on these databases to not only store data, but also support analytical workloads from these various business functions.
Analytical MPP databases can easily scale their compute and storage capabilities linearly by adding more servers to the system. This the opposite of vertically scaling compute and storage capabilities, which involves upgrading to larger and more powerful individual servers, and which generally hits a wall at scale. Analytical MPP databases are able to scale out so quickly, easily, and efficiently that on-demand database vendors have automated that process to scale the system up or down depending on the size of the query.
Data warehouse solutions
Self-managed MPP databases
On-demand MPP databases
Analytical MPP architecture
Massively Parallel Processing as a term refers to the fact that tables loaded into these databases are distributed across each node in a cluster, and the fact that when a query is issued, every node works simultaneously to process the data that resides on it.
Analytical MPP databases were designed to run queries in parallel over many individual servers, or nodes. This means that by adding more nodes to a cluster, the same workload can be distributed to more servers and completed more quickly.
While analytical MPP databases are sometimes capable of scaling up in certain aspects, their biggest advantage is making it relatively easy to horizontally scale the database system by adding more nodes to the system.
Massively parallel processing (MPP)
While different systems take different approaches to coupling storage of data and computate for queries, employing different strategies for moving data from one node to another, all MPP systems are fast because a “leader” can make a query plan and then distribute the actual work of executing the query to many workers.
Hosted vs. on-premise implementations
Analytical MPP databases have been around for decades, but their cost and the complexity of managing them has dropped tremendously in the last decade. The only option until recently was to self-host these databases. But more recently, they have migrated to the cloud.
Some analytical data warehouses are solely available via a hosted architecture; Amazon Redshift, Snowflake, and Google BigQuery for example, are offered solely through the cloud. Others, like Teradata are able to be deployed both on-premise, packaged as appliances (software and hardware bundled), or deployed via a hosted model in the cloud.
HPE Vertica is the only database within the analytical MPP database category that offers software with which you can build your own analytical database on top of commodity hardware. For larger scale implementations, this type of solution provides more flexibility and customization for specialized use cases.
Constraints of an analytical MPP database
The main constraint associated with distributed MPP databases is that they require a different type of planning and setup than what’s required for transactional databases. Setting up the initial Extract-Transform-Load (ETL) pipeline that moves data from its source to your data warehouse can be quite a project (though many vendors now offer services to make this process simpler).
It is also challenging to plan resources needed within the cluster when dealing with multiple data sources, especially when the volume of data from those sources is not fully in your control. Many cloud-hosted mpp database technologies provide for auto-scaling, but not all do.
Batch loading / latency
Because of the way analytical MPP databases are architected, as a general rule, the closer to “real-time” you want your data, the more challenging the data engineering task will be. This is because these technologies are typically optimized for batch loads. Optimizing data storage for distributed reads across the cluster is a labor-intensive task. Some Distributed MPP databases specifically offer hybrid storage or additional services to address this issue. Many organizations maintain replicas of their transactional database just to enable real-time access to that slice of the dataset.
Optimizing an analytical MPP database
Considerations around the performance tuning for an analytical MPP database will dramatically differ depending on whether you’re using an on-demand MPP database solution or a managed MPP database solution and which you’re using.
It’s important to note that generally, MPP database solutions can handle far larger data volumes than transactional databases of similar cost and so will be able to handle much more data before you need to start thinking about optimization and tuning.
That said, when you do reach the stage where you need to tune performance, managed MPP solutions will handle more of the details for you than self-hosted solutions, which will require more work on your part.