Looker Blog : Data Matters

Time Series Data, Moving or Rolling Aggregates in SQL

Mike Xu, Analyst

Mar 25, 2014

When looking at time series data, it's good to rely on a metric that reveals an underlying trend — something robust enough to deal with volatility and short-term fluctuations. A question that Looker users frequently pose is: How does average sale price fluctuate over time? This question points to a moving average and sum calculation within SQL using a monthly interval. There are several ways of accomplishing this. I'm going to demonstrate two approaches: correlated subqueries and derived tables.

My example uses a simple purchases table to create rolling sums on revenue. The sample code below can be modified to calculate many different aggregates and for comparing other timeframes, such as daily or hourly. Here's how the raw data looks:

TABLE: purchases
id timestamp price
1 2014-03-03 00:00:04 230
2 2014-03-03 00:01:14 210
3 2014-03-03 00:02:34 250
4 ... ...

Here is the result set we expect:

Derived Table: Purchases weekly rolling on monthly interval
week revenue avg_rev rev_4_weeks_prior avg_rev_4_weeks_prior
2014-03-03 51,312 208.12 193,210 202.91
2014-03-10 54,310 202.87 221,409 210.41
2014-03-17 57,942 218.82 259,682 205.23
... ... ... ... ...

The finished result set makes it faster and easier to create reports and to discover interesting insights. We can also add dimensions to the final form so we can see how different facets alter our rolling aggregate. Now let's take a look at how to get to this form.

Approach 1: Correlated Subquery

The easiest way to produce our desired result set is by using a simple correlated subquery .

    -- 2nd pass
    
    SELECT
    
    pw
    
    .
    
    week
     
    AS
     
    week
    
    ,
     
    pw
    
    .
    
    revenue
     
    AS
     
    revenue
    -- 1st subselect
    
    ,
     
    (
    
    SELECT
    
    SUM
    
    (
    
    p2
    
    .
    
    revenue
    
    )
    
    FROM
     
    purchases
     
    AS
     
    p2
    
    WHERE
     EXTRACT(week FROM 
    p2.timestamp)
     
    <=
     
    pw
    
    .
    
    week
    
    AND
     EXTRACT(week FROM p2.timestamp) 
    >=
     
    pw
    
    .
    
    week
     
    +
     
    'INTERVAL -3 weeks'
    
    )
     
    AS
     
    rev_4_weeks_prior
    
    -- 2nd subselect
    
    ,
     
    (
    
    SELECT
    
    AVG
    
    (
    
    p3
    
    .
    
    revenue
    
    )
    
    FROM
     
    purchases
     
    AS
     
    p3
    
    WHERE
     EXTRACT(week FROM p3.timestamp) 
    <=
     
    pw
    
    .
    
    week
    
    AND
     EXTRACT(week FROM p3.timestamp) 
    >=
     
    pw
    
    .
    
    week
     
    +
     
    'INTERVAL -3 weeks'
    
    )
     
    AS
     
    avg_rev_4_weeks_prior
    
    FROM
    
    -- 1st pass
    
    (
    
    SELECT
    
    EXTRACT
    
    (
    
    week
     
    FROM
     
    p
    
    .
    
    timestamp
    
    )
     
    AS
     
    week
    
    ,
     
    SUM
    
    (
    
    p
    
    .
    
    revenue
    
    )
     
    AS
     
    revenue
    ​    
    , 
    AVG
    (p.revenue) 
    AS
     avg_rev
    
    
    FROM
     
    purchases
     
    AS
     
    p
    
    GROUP
     
    BY
     
    1
    
    )
     
    AS
     
    pw
    
    GROUP
     
    BY
     
    1
    
    
    

In the first pass, you can see we're grabbing all the weeks for which we have data and summing the revenue for that week. In the second pass, we're calculating for each week the revenue from four weeks prior.

Unfortunately, there are some drawbacks to this technique. The simplicity and elegance of the query comes at great cost to performance. Correlated subqueries run a SELECT statement for each row and column within the result set. With a large dataset (one that has many weeks, for example), this query may end up running for a long time. The correlated subquery approach is ideal for small datasets or for looking at a small subset of the data (using a WHERE clause to limit query range).

Please note that correlated subqueries are not implemented in every database. Popular MPP databases, such as Redshift and Vertica, only partially support correlated subqueries.

Approach 2: Derived Table

So how can we answer our question when dealing with very large datasets? We want to avoid scanning any row of the raw data more than once. The best method is to use derived tables . Here's the code:

    -- 1st pass on raw set
    
    WITH
     
    weekly_revenue
     
    AS
     
    (
    
    SELECT
    
    EXTRACT
    
    (
    
    week
     
    FROM
     
    p
    
    .
    
    timestamp
    
    )
     
    AS
     
    week
    
    ,
     
    SUM
    
    (
    
    p
    
    .
    
    revenue
    
    )
     
    AS
     
    weekly_revenue
    
    ,
     
    COUNT
    
    (
    
    *
    
    )
     
    AS
     
    purchases
    
    FROM
     
    purchases
    
    )
    
    -- 1st pass on derived table
    
    SELECT
    
    wr
    
    .
    
    week
     
    AS
     
    week
    
    ,
     
    wr
    
    .
    
    weekly_revenue
     
    AS
     
    revenue
    
    ,
     
    wr
    
    .
    
    weekly_revenue
    
    /
    
    wr
    
    .
    
    purch
     
    AS
     
    avg_rev
    
    ,
     
    SUM
    
    (
    
    wr2
    
    .
    
    revenue
    
    )
     
    AS
     
    rev_4_weeks_prior
    
    ,
     
    SUM
    
    (
    
    wr2
    
    .
    
    revenue
    
    )
    
    /
    
    SUM
    
    (
    
    wr2
    
    .
    
    purch
    
    )
     
    AS
     
    avg_rev_4_weeks_prior
    
    FROM
     
    weekly_revenue
     
    AS
     
    wr
    
    JOIN
     
    weekly_revenue
     
    AS
     
    wr2
    
    ON
     
    wr2
    
    .
    
    week
     
    <
     
    wr
    
    .
    
    week
    
    AND
     
    wr2
    
    .
    
    week
     
    >=
     
    wr
    
    .
    
    week + 'INTERVAL -3 weeks'
    
    GROUP
     
    BY
     
    1
    
    ,
    
    2
    
    

This query may seem a bit strange, so let's go over it step by step. First, we create a simple common table expression , or CTE, with every week and its revenue. Next, we join this CTE onto itself to create four rows, one for every prior week. We're fanning out the data and then summing on each of the new joined-on weekly_revenues. Here's how the form looks right after the JOIN, but before the SUM:

Intermediate Form: Fanned out weekly_revenue
wr.week wr.revenue wr.purch wr2.week wr2.revenue wr2.purch
2014-03-31 61,582 296 2014-03-10 51,312 247
2014-03-31 61,582 296 2014-03-17 54,310 317
2014-03-31 61,582 296 2014-03-24 57,942 286
2014-03-31 61,582 296 2014-03-31 59,429 272
... ... ... ... ... ...

There are now four rows for the week starting 2014-03-31. Each row has a weekly revenue from one of the four prior weeks. To get to the result set we want, we simply sum and average the wr2.weekly_revenue and then group by the original week's date and revenue value.

Derived Table: Our result set
week revenue avg_rev rev_4_weeks_prior avg_rev_4_weeks_prior
2014-03-31 61,582 208.12 222,993 198.76

This query will run much faster than the subquery method, since we only scan the raw table once.

(I used a WITH statement in this example query. Your syntax may be different, depending on your SQL database. You may prefer to use CREATE TEMPORARY TABLE or CREATE TABLE.)

Dealing with Gaps in the Data

Sometimes our data isn't perfect. There may be some weeks when no purchases are made. To make sure our query addresses that, we simply add a table with all the weeks we're interested in:

Derived Table: weeks
week
2014-03-03
2014-03-10
2014-03-17
2014-03-24
...

Now we can use this table for our queries by incorporating it with the second pass of our subquery or within the join of the derived table.

What's Next?

With these two approaches, we can now create rolling calculations in SQL — which enables us to understand complex metrics, such as ratios and medians, across meaningful intervals of time.

Next Previous

Subscribe for the Latest Posts