Looker Blog : Data Matters

Amazon Redshift Introduces Interleaved Sorts

Mike Xu, Analyst

May 11, 2015

Today Amazon released a brand new feature for Amazon Redshift called Interleaved Sorts. The feature was designed to improve filter query performance without the need for indices or projections used by traditional databases. The result is dramatically improved average query times across diverse use cases for large multi-faceted datasets.

Interleaved Sorting has significant ramifications for customers interactively querying big datasets and executing complex queries. Traditionally, analysts are restricted to a single facet that they can use for optimal query restriction and aggregation. A classic example is timeframes. Since most aggregate queries are timeframe restricted, large datasets typically use a timestamp as their primary sortkey. Interleaved Sorts removes this constraint and allows users to service a wider ranges of use cases on the same dataset. For example, in addition to timeframes, users can efficiently slice by cohorts like device, geo, behavior and other rich facets.

At Looker we were privileged to be a part of the preview for Interleaved Sorts and we are excited about the potential impact for our users. Looker was architected to take advantage of powerful analytical databases like Redshift by sitting directly in-database and relying on the computational power of Redshift to execute queries. Looker was designed for flexible data exploration and supports ad hoc, rapid transformation of data to answer a broad range of questions. Interleaved Sorts is a promising feature to support this approach and we expect it to greatly accelerate our Redshift users.

A common use case of Looker against large datasets is for event tracking data. These sets are extremely faceted with rich information about the users, their actions, workflows, and the application. We connect directly to the raw event streams that come from tools like Snowplow and Segment. Interleaved Sorting will improve performance against these datasets by orders of magnitude. Organizations can quickly find important insights in vast and sparse data.

There are some considerations when deciding when to utilize Interleaved Sorts as it is not a silver bullet for all use cases, it only benefits specific use cases. When conducting performance benchmarks we noticed that there is only significant improvement after getting close to 100M records. There is also an overhead cost to access the Interleaved Sorts that is more expensive than the standard Sortkey query. When using a restriction only on a single column that is the Sortkey or Interleaved Sort, the Sortkey will be considerably faster. However, once a query starts adding more restrictions on additional columns, Interleaved Sorts become an order of magnitude faster.

Here are our results from a Snowplow dataset. We have approximately ~30M records in this dataset and are using the traditional Snowplow schema.

With Interleaved Sorts

(collector_tstamp,geo_city,geo_region,useragent,os_name)

Looker Meta Data Sources Diagram

With Sortkey on collector_tstamp Looker Meta Data Sources Diagram

This query is just looking at the number of records from 100 days ago. Here we can see there is a performance disadvantage from the overhead cost of accessing the interleave sorts, this impacts the query slightly. Once we start trying queries more accurate to real life use cases we can see the performance increase:

With Interleaved Sorts

(collector_tstamp,geo_city,geo_region,useragent,os_name)

Looker Meta Data Sources Diagram

With Sortkey on collector_tstamp Looker Meta Data Sources Diagram

We are now executing the queries from Looker’s Data Explorer, we’ve added some criteria on regions we care about as well removing specific users that are not accessing the application via a browser. We have also expanded the timeframe for our report. There is an order of magnitude performance increase from using restrictions with the Interleaved Sorts.

We hope that this new feature will enable our customers to achieve more rapid discovery of insights operationally and analytically. We're excited to help our customers architect their schemas to utilize Interleaved Sorts.

Next Previous

Subscribe for the Latest Posts