What is a self-managed MPP?
Self-managed MPP databases are powerful clustered databases that allow for massive customization, flexibility, and functionality that must be managed manually by the customer, usually by a database administrator or DevOps team.
While self-managed MPP databases have traditionally been deployed on-premise, these databases now have the flexibility to be deployed in the cloud as well, which allows for a larger array of deployment options.
One key proficiency shared by all of the databases within the self-managed MPP category are their mature SQL dialects and integrations. This makes them ideal options within an enterprise data stack. For example, HPE Vertica, and Teradata all offer powerful connectors to Apache Hadoop. These integrations, combined with their high concurrency, enable these databases to support a large enterprise workforce.
What are on-premise MPP databases really great for?
Working alongside existing database technologies
Self-Managed MPPs are designed to integrate with and work alongside existing database services and workflows such as Hadoop. All of the examples of self-managed data warehouses have been around for many years (Teradata) or built on existing data warehouse technology, so they’re good candidates for enterprise workloads.
One of the biggest features of self-managed MPP databases is the amount of control they grant users over hardware selection, table architecture/storage, and query optimization. Leveraged wisely, these options allow organizations to design a structure that is highly performant and efficient.
Many self-managed MPP solutions provide mature SQL dialects and broad integrations that provide advanced analytics and user-defined functions. A few have special analytical capabilities and integrations (such as Vertica’s geospatial and machine learning libraries) that make them a great choice for specific analytical workflows.
Popular self-managed databases
Self-managed database architecture
Every single node within a self-managed MPP database has its own storage, memory, and compute resources. This is known as a “shared nothing” architecture because storage and compute resources are not shared across the entire system.
This “shared-nothing architecture” enables multiple nodes to work in parallel to process a query. When a query is issued, a leader node creates a plan and distributes it to each of the individual nodes. These nodes then process their portion of the query (requesting needed data from other nodes across the network). These intermediate results are then sent back to a “leader node” that then aggregates the results together.
Constraints of a self-managed database
Self-managed MPP databases give you flexibility and customization, but in exchange you take on managing some of the complexity yourself. So, if you don’t have the resources to manage a database on your own, you should probably look at an on-demand MPP database. Similarly, if you’re upset about the lack of flexibility offered by an on-demand solution, a self-managed solution might be a better fit for you.
Optimizing a managed MPP database
Managed MPP databases vary greatly in how they can be optimized because of differences in how they deal with caching, distribution/sorting, and storage. We’ll cover each in more detail in the guides to individual databases.
But there are a few general guidelines that can help:
- Flatten or denormalized schemas yield fewer joins, which improves parallelization and therefore query performance.
- Column encoding and compression save disk space and improve processing speed.
- Sorting and distributing data efficiently can make a big difference in performance.
- Avoid queries that operate on most or all columns at once, since fetching each column incurs additional costs.
- Keep tables clean and free up unused space with appropriate vacuum and purge policies.
- Keep table statistics up to date to help with efficient query planning.