Today Amazon quietly added a bunch of features to Redshift, their cloud-based column store SQL data warehouse. Redshift is pretty amazing if you have large data sets and want to explore them quickly using robust SQL (with the ability to do complex joins, joined subqueries, and transformations).
When looking at large data sets, the most common aggregate operation is to count things. Think of a giant web log with 10 billion entries. How many users came to the website each day? How many different web pages were hit?
To compute this you might write an expression like this:
SELECT DATE_TRUNC('day',event_time), COUNT(DISTINCT user_id), COUNT(DISTINCT url) FROM weblog GROUP BY 1
SQL's COUNT(DISTINCT) works by making a sorted list or tree for each grouping (in this case month) and then keeps a sorted list of all the elements it encounters. If you have a very busy website, this can be millions for each grouping. COUNT(DISTINCT) in this large case are very slow and extremely memory intensive.
If you think about COUNT(DISTINCT url), the computation would require many copies of all the URLs. You can easily see how you might use up all the memory on your cluster.
HyperLogLog (HLL) is an algorithm for computing approximate counts. It’s pretty magical -- basically a low overhead probabilistic counter. It’s accurate within about 2%, which for very large datasets is generally quite acceptable. The algorithm uses a fixed amount of space, so it’s not at all memory intensive.
This week, the folks at Amazon implemented this algorithm and built it directly into Redshift's SQL. We're very excited about it. We’ve already incorporated the functionality into Looker, and have three customers using it in production. If we re-write the computation using the Amazon HLL syntax, it looks like this:
SELECT DATE_TRUNC('day',event_time), APPROXIMATE COUNT(DISTINCT user_id), APPROXIMATE COUNT(DISTINCT url) FROM weblog GROUP BY 1
Our initial very simple tests indicate a 5x improvement on a single grouping. Performance improvements will vary widely, depending on the number of distinct values, how much you are grouping and how much memory your cluster has. In the majority of cases, the result is very fast data exploration for end users over extremely large datasets (multi-billions of rows).
For problems with large data sets and large numbers of transactions entities, APPROXIMATE COUNT(DISTINCT) will be game changing. Thank you Amazon! We hope to see this functionality soon in the other MPP/column-store databases.