Snowflake recently announced the general availability of materialized views (MVs). If you aren’t familiar with MVs, they are physical database objects that contain the result of a query, so they lie somewhere between a table, which is a physical object, and a view, which is based on a query but is a logical object.
MVs have many uses, including enhancing query performance, encapsulating complex logic, and aggregating data.
Say you frequently report on a customer’s month-to-date spend and you have a table that contains line-item information about a customer’s purchase history. Using MVs, you could create a view that rolls the purchases up by customer for the current month. And since that aggregated data would always be as current as the data in the base table, the result would be highly performant queries when accessing data on customer month-to-date spend.
Without MVs, your options would be to aggregate the data each time the query was run to ensure that the most current data was included, which could have performance implications depending on data volumes, or to create a rollup table that is refreshed on a schedule. The rollup table would yield excellent performance, but the data may not be current.
There are two cost components to be aware of, but not deterred by, with Snowflake MVs. One is the additional storage required for the MV, and the other is the cost of maintaining the MV (keeping it current), which is handled by a Snowflake service. That service is being continuously improved to reduce the maintenance cost. Over the last few weeks alone, Snowflake has released improvements to the service which can result in a 30%-40% cost savings over earlier versions (with more improvements coming soon). It’s worth keeping track of how much your MVs are costing to maintain vs. the amount of time saved by having faster queries. As we’ll see in our demo use case, that time savings can be significant.
In addition, it’s important to note that while the compute resources for creating the MVs come from your own warehouse, the compute resources used to maintain the MVs are provided by Snowflake. This is good news because it means that you don’t have to have an active warehouse in order for your MVs to stay up-to-date.
Looker users may be familiar with Looker’s Persistent Derived Tables ( PDTs), which are physical tables that are created based on the result of a query. This sounds quite a bit like an MV, so customers frequently ask whether they should be using MVs or PDTs for their use cases. Here are a few guidelines to help you decide:
You should use Looker PDTs when:
You should use Snowflake MVs when:
Snowflake’s Robert Fehrmann wrote a great blog post demonstrating how MVs can be used to provide different access paths into the same set of data, improving performance based on the types of queries that are run. Before continuing on, I would highly recommend giving it a read.
In Robert’s blog, it assumes that you understand that there are two different access paths available. From there, you can choose to query either the base table or the MV built on the base table depending on what access path you need. These are valid assumptions if you are writing your own SQL, but what if you wanted to accomplish this in Looker with your LookML model, so users wouldn’t even have to know that there are two database objects to choose between?
Let’s take a look at how to get everything set up in Looker.
In the following example, I followed all of the steps in Robert’s blog post with one exception. When I created the MV, I included all of the columns from the base table, so my MV create statement looked like this:
CREATE OR REPLACE MATERIALIZED VIEW MV_TIME_TO_LOAD (CREATE_MS, PAGE_ID, TIME_TO_LOAD_MS, METRIC2, METRIC3, METRIC4, METRIC5, METRIC6, METRIC7, METRIC8, METRIC9) CLUSTER BY (PAGE_ID) AS SELECT CREATE_MS, PAGE_ID, TIME_TO_LOAD_MS, METRIC2, METRIC3, METRIC4, METRIC5, METRIC6, METRIC7, METRIC8, METRIC9 FROM WEBLOG
After I built the table, loaded the 10 billion rows and created the MV, it was time to get things set up in Looker.
For the purposes of this example, I’ll assume that you are familiar with the basic Looker developer functions, such as creating a connection, creating a project and generating a LookML model. If you aren’t, this video explains how to do all of those things: Looker — Database To Dashboard
Before we start running queries, we’ll need to add a new measure for AVG_TIME_TO_LOAD_MS into our weblog view file. It will look like this:
Now we are ready to start running queries, so let’s follow the examples in Robert’s blog and test our performance.
First, we’ll query the weblog table filtering it on CREATE_MS. We’d expect this query to run pretty quickly because the WEBLOG table is clustered on CREATE_MS.
SELECT COUNT(*) CNT, AVG(TIME_TO_LOAD_MS) AVG_TIME_TO_LOAD FROM WEBLOG WHERE CREATE_MS BETWEEN 1000000000 AND 1000001000;
Using our weblog Explore, we’ll build the query by selecting the Count and Avg Time To Load MS and filtering on the Create MS
As expected, the query takes just over a second.
Now we’ll try selecting the same columns, but filtering on PAGE_ID instead, based on the query from the Snowflake blog post:
SELECT COUNT(*) CNT, AVG(TIME_TO_LOAD_MS) AVG_TIME_TO_LOAD FROM WEBLOG WHERE PAGE_ID=100000;
This query takes almost 90x longer to run due to the fact that there is not an efficient retrieval path for data filtered by page_id.
Let’s see how we can incorporate our MV, which is clustered by page_id, into our query to give us faster query times on data filtered by page_id.
Our current weblog view file looks like the screenshot below with PUBLIC.WEBLOG specified as the sql_table_name.
We’ll add in some logic using the Liquid templating language to dynamically choose between the WEBLOG table and the MV_TIME_TO_LOAD MV, depending on which filter criteria is selected. The logic is very simple — if the filter criteria includes the page_id, then use the MV that is clustered on page_id. Otherwise, use the base table which is clustered on create_time_ms.
If you aren’t using Liquid in your LookML, you’re missing out. Take a look at some of the many Looker Community articles on using Liquid in your LookML to get an idea of the use cases.
If everything is working as expected, Looker should dynamically select whether to use the base table or the Materialized View as the source of the query, depending on the filter criteria selected. You can see below that with the data filtered by Create MS, Looker will use the WEBLOG table as the source of the query.
But, if we change our filter criteria to page_id, Looker automatically updates the query to access the MV instead. With a couple of lines of code, Looker takes care of selecting the right source for the data based on the selected filter criteria to ensure that your self-service users always get the best performance possible.
Now, let’s try re-running our query filtered by page_id. This time Looker queries the MV and the results are returned in under two seconds, a 45x improvement over our previous attempt!
MVs provide another great tool for working with Snowflake. They can enhance the experience of end-users by improving query performance, or simplify getting access to data in complex structure by presenting a flattened view. MVs can also benefit developers by encapsulating complex logic in a reusable way.
As MVs are still a relatively new feature, we expect there will be more enhancements coming in future releases to make MVs even more useful. A few of the enhancements that we are hoping to see in the not-too-distant future include support for joins as well as the automatic selection of the most performant path between the base table and any associated MVs by the optimizer. If you have plans for how you will be using Snowflake’s MVs with Looker to enhance your end user’s experience, we’d love to hear them! Want to learn more about Looker + Snowflake? Reach out to our team for a demo today.