Looker Blog : Data Matters

Excluding Weekends from Time Differences

Scott Hoover, Analyst

Mar 27, 2014

Recently, the Looker Analytics team was posed with a unique challenge by a client: come up with a way to exclude weekends from an analysis that calculates the duration between two times.

The data in question captures the rate at which jobs move from various stages in production. Each row of the table of interest (hereinafter  velocity ) represents a single job — which, along with other attributes, has an approved and a completed time.

velocity  table

job_id created_at approved_at completed_at ... location_id partner_id
1 2013-07-31 04:00:00 2013-08-01 11:00:00 2013-08-04 13:00:00 ... 1 1
2 2013-03-07 07:00:00 2013-03-07 10:00:00 2013-03-12 14:00:00 ... 4 1
... ... ... ... ... ... ...
n 2013-05-07 09:00:00 2013-05-08 02:00:00 2013-05-16 12:00:00 ... 3 2

The first pass at addressing approved-to-complete velocity relied on a simple date difference:

    SELECT
     
    job_id
    
    ,
     
    approved_at
    
    ,
     
    published_at
    
    ,
     
    EXTRACT
    
    (
    
    'EPOCH'
     
    FROM
     
    (
    
    completed_at
     
    -
     
    approved_at
    
    ))
    
    /
    
    3600
     
    AS
     
    velocity_in_hours
    
    FROM
     
    velocity
    
    

The problem with this approach is that some jobs start in one week, extend over one or more weekends, and are completed in some subsequent week; however, work is not done on weekends, and the velocity for jobs that extend over one or more weekends is penalized in a sense. This is why Looker was asked to come up with a solution to exclude weekends from the analysis.

Solution

A few approaches were posed, but the final solution relied on a look-up table containing three columns: a  date_time  field, a  day_of_week  field, and an integer,  hour_seq,  indicating the number of hours since the first hour in this table (hereinafter  sequence ). The table would be used to get a corresponding integer (again, the number of hours since the beginning of the sequence table) for the  approved_at  and  completed_at  for each job and then subtract the two, yielding the number of hours between the two times.

The key to the solution was that the integers in the sequence table would not increment on weekend hours. The resulting table would look like this:

sequence  table

date_time day_of_week hour_seq
2008-01-01 00:00:00 2 1
2008-01-01 01:00:00 2 2
2008-01-01 02:00:00 2 3
... ... ...
2008-01-04 21:00:00 5 93
2008-01-04 22:00:00 5 94
2008-01-04 23:00:00 5 95
2008-01-05 00:00:00 6 95
2008-01-05 01:00:00 6 95
2008-01-05 02:00:00 6 95
... ... ...
2008-01-06 23:00:00 0 95
2008-01-07 00:00:00 1 96
2008-01-07 01:00:00 1 97

To achieve this, we needed to start with a table of hours from the beginning of time (or 2008, when the business started) going well into the future. We relied on recursive common table expressions (CTEs) in PostgreSQL to generate a sequence of numbers and then add hours accordingly, but this could achieved in a number of ways for dialects without recursive CTEs.

    WITH
     
    RECURSIVE
     
    seq
    
    (
    
    n
    
    )
     
    AS
     
    (
    
    SELECT
     
    1
    
    UNION
     
    ALL
    
    SELECT
     
    n
     
    +
     
    1
     
    FROM
     
    seq
    
    )
           
    --generate a sequence of numbers
    
    ,
     
    hours
     
    AS
     
    (
    
    SELECT
     
    TIMESTAMP
     
    '2008-01-01 00:00:00'
     
    +
     
    n
     
    *
     
    INTERVAL
     
    '1 hour'
     
    AS
     
    date_time
    
    FROM
     
    seq
    
    LIMIT
     
    87600
         
    )
           
    --generate ten years of hours from 2008-01-01 00:00:00 into the future
    
    SELECT
     
    date_time
    
    ,
     
    EXTRACT
    
    (
    
    dow
     
    FROM
     
    date_time
    
    )
     
    AS
     
    day_of_week
    
    ,
     
    SUM
    
    (
    
    CASE
    
    WHEN
     
    EXTRACT
    
    (
    
    dow
     
    FROM
     
    date_time
    
    )
     
    NOT
     
    IN
     
    (
    
    6
    
    ,
    
    0
    
    )
    
    THEN
     
    1
    
    ELSE
     
    NULL
    
    END
    
    )
     
    OVER
     
    (
    
    ORDER
     
    BY
     
    date_time
    
    )
     
    AS
     
    hour_seq
    
    FROM
     
    hours
    
    

The penultimate step was to twice join the sequence table into velocity and map each  created_at  and  completed_at  entry to its corresponding  hour_seq . The final velocity table looked something like this:

velocity  table

job_id created_at approved_at approved_int completed_at completed_int ... location_id partner_id
1 2013-07-31 04:00:00 2013-08-01
11:00:00
34979 2013-08-04
13:00:00
35015 ... 1 1
2 2013-03-07
07:00:00
2013-03-07
10:00:00
32458 2013-03-12
14:00:00
32534 ... 4 1
... ... ... ... ... ... ... ... ...
n 2013-05-07
09:00:00
2013-05-08
02:00:00
33506 2013-05-16
12:00:00
33660 ... 3 2

With a  velocity  table that had corresponding hour sequence numbers for each approved and completed time, we could simply difference these two values to see the time spent in production, in hours. This could be easily extended to capture any timeframe.

SELECT
 
approved_at

,
 
completed_at

,
 
completed_int
 
-
 
approved_int
 
AS
 
duration_in_hours

FROM
 
velocity

— fin —

Next Previous

Subscribe for the Latest Posts