Optimizing Redshift for analytics
Nov 12, 2014
Query your data where it lives. (Don’t move it.)
Looker was built with massively parallel processing (MPP) databases like Amazon Redshift in mind. By querying an MPP data warehouse directly for just the data needed to answer a question, Looker is the most efficient BI path in terms of hardware, storage, and computing power.
Amazon Redshift is a fully managed, high-performance MPP data warehouse solution in the cloud that can scale up to a petabyte or more, while costing an order-of-magnitude less than legacy data warehousing solutions. If you can query your data where it sits in such a powerful database, why move it anywhere else?
To help make the most of Looker + Redshift, we put together a technical overview that covers the most important tips and must-knows when setting up a data warehouse on Redshift. Here are three quick tips:
1) Sort keys:
Every table in Redshift can have one or more sort keys. A sort key is like an index: Imagine looking up a word in a dictionary that’s not alphabetized — that’s what Redshift is doing if you don’t set up sort keys. Redshift stores data in 1MB blocks, storing the min and max values for each sort key present in that block. The main benefit of sort keys is that Redshift can skip over blocks of data when a sort key is present and the query is limited by that column, vastly boosting performance when querying large data sets.
If you have a 50 billion–row event table with three years of data, you may often need to run queries on just “today” or “last 7 days” in Looker. If you have a sort key on created_at and include a filter on that field in Looker, Redshift will be able to skip over 99% of rows (i.e., blocks of data) when executing the query.
2) SSD vs HDD clusters:
Redshift gives two options for storage: “Dense Compute” (SSD) or “Dense Storage” (HDD). Dense Compute node clusters use SSDs and more RAM, which costs more—especially when you have many terabytes of data—but can allow for much faster querying and a better interactive experience for your business users. If you have under 1TB, it usually makes sense to use SSDs; if over 1 TB, it depends on your use case. Looker is geared toward interactive on-the-spot exploration, so query speed is key.
3) Joins and distribution keys:
A rule of thumb we use at Looker is that each Redshift join can cost a 10-40% speed reduction on a query -- sometimes much more if not optimized and distributed efficiently or when joining very large tables. Adding this factor to the columnar design of databases like Redshift, it is often better to have long and wide tables than many separate tables. When many joins are necessary, it’s especially important to set the proper distribution keys and sort keys so that query results return quickly — which in turn fulfills the user’s curiosity and encourages further exploration.
See the full paper, Technical Overview: Looker for Amazon Redshift, for more tips on distribution styles, column compression, derived tables, and caching.