Looker Blog : Data Matters

Aggregate Functions Gone Bad and the Joins Who Made Them that Way

Brett Sauve, Head of Content

Jun 10, 2014

NOTE: When this entry was originally written, Looker users needed to use the following strategies to properly calculate certain metrics. Since that time Looker has implemented Symmetric Aggregates, which make much of this unnecessary and a great deal easier.

However, this is still a useful entry to read for writing SQL in general, or for SQL dialects that do not support Symmetric Aggregates.

If you’re using SQL some of the first things you probably learned about were joins and aggregate functions (such as COUNT and SUM). One thing that is not always taught is how these two concepts can interact, and sometimes produce incorrect results. In this entry we’ll discuss the things to look out for, the concept of a “fanout”, and why it matters to SQL writers and Looker users alike.

Let's Start with a Friendly Join

Let’s start off with a simple example, where we’ll join together a couple tables. Our first table will be made up of our customers, and the number of visits they’ve made to our website:

customer

customer_id first_name last_name visits
1 Amelia Earhart 2
2 Charles Lindbergh 2
3 Wilbur Wright 4

Our second table will include all of the orders that those customers have placed. You can see that each order is linked to the customer who placed it via the customer's ID.

order

order_id amount customer_id
1 25.00 1
2 50.00 1
3 75.00 2
4 100.00 3

Joining these tables together in SQL would be pretty straightforward:

SELECT    *
FROM      customer
LEFT JOIN order
ON        customer.customer_id = order.customer_id

The result of that query would be this table:

customer_id first_name last_name visits order_id amount customer_id
1 Amelia Earhart 2 1 25.00 1
1 Amelia Earhart 2 2 50.00 1
2 Charles Lindbergh 2 3 75.00 2
3 Wilbur Wright 4 4 100.00 3

Aggregate Functions Gone Bad

Now that we have a joined table, we need to be careful about how we use aggregate functions like COUNT and SUM.

Aggregate Functions on a Single Table

Let's consider the customer table all by itself again. If we want to know the total number of customers we can execute a simple query like this:

SELECT COUNT(*)
FROM   customer

SQL will count up the rows in the table as follows:

customer

COUNT customer_id first_name last_name visits
1 1 Amelia Earhart 2
2 2 Charles Lindbergh 2
3 3 Wilbur Wright 4

We'll get a count of 3, which is correct.

Or, if we want to know the total number of customer visits, we can execute another straightforward query like this:

SELECT SUM(visits)
FROM   customer

SQL will add up the number of visits in the table as follows:

customer

customer_id first_name last_name visits
1 Amelia Earhart 2
2 Charles Lindbergh + 2
3 Wilbur Wright + 4
SUM 8

We'll get a result of 8, which is also correct.

Aggregate Functions on the Joined Tables

If we try to use the same aggregate functions on either of our joined tables, we'll start to get incorrect results.

Running a basic count on the joined table, we will no longer get the correct number of customers:

SELECT    COUNT(*)
FROM      customer
LEFT JOIN order
ON        customer.customer_id = order.customer_id

SQL will count up the rows in the table as follows:

COUNT customer_id first_name last_name visits order_id amount customer_id
1 1 Amelia Earhart 2 1 25.00 1
2 1 Amelia Earhart 2 2 50.00 1
3 2 Charles Lindbergh 2 3 75.00 2
4 3 Wilbur Wright 4 4 100.00 3

We'll get a result of 4, even though there are really only 3 customers. You can see that Amelia is counted twice.

Similarly, if we try to sum the number of visits, we will no longer get the correct result:

SELECT    SUM(visits)
FROM      customer
LEFT JOIN order
ON        customer.customer_id = order.customer_id

SQL will add up the number of visits in the table as follows:

customer_id first_name last_name visits order_id amount customer_id
1 Amelia Earhart 2 1 25.00 1
1 Amelia Earhart + 2 2 50.00 1
2 Charles Lindbergh + 2 3 75.00 2
3 Wilbur Wright + 4 4 100.00 3
SUM 10

We'll get a result of 10, even though there are only 8 visits. Amelia's 2 visits are added twice.

A Fanout Happened While You Weren't Looking

In the example we’ve been looking at the primary table (customer) had only 3 rows. When I say "primary table", I mean the table that is in the FROM clause of our SQL queries. After the join, we now have 4 rows. Since the joined table has more rows than the primary table, we say that a fanout has occurred.

If we want to avoid a fanout, we can write the join in the opposite order. So instead of this ...

SELECT    *
FROM      customer
LEFT JOIN order
ON        customer.customer_id =
          order.customer_id

Do this ...

SELECT    *
FROM      order
LEFT JOIN customer
ON        order.customer_id =
          customer.customer_id

Now our joined table looks like this:

order_id amount customer_id customer_id first_name last_name visits
1 25.00 1 1 Amelia Earhart 2
2 50.00 1 1 Amelia Earhart 2
3 75.00 2 2 Charles Lindbergh 2
4 100.00 3 3 Wilbur Wright 4

The eagle eyed observer will recognize that this is exactly like our other joined table, except for the order of the columns. That is true, and it’s why most writers of SQL think of our two different joins as exactly the same.

However, when we do the join in this order we do not have a fanout. Our original table (order) had 4 rows, and our joined table also has 4 rows, so there has been no fanout. Herein lies a key point: to help avoid fanouts, begin your joins with the most granular table.

Fanouts, Schmanouts, Who Cares?

In both the fanout case, and the non-fanout case, we still need to worry about the accuracy of aggregate functions. However, there is a subtle difference in the type of problems we’re going to see. If you use Looker, this is a very important concept:

No Fanout: You can trust aggregate functions on your primary table, but not necessarily on your joined tables.

Fanout: You cannot necessarily trust aggregate functions on either your primary table, or your joined tables.

To drive home this point, let’s take a look at our previous examples.

Fanout Example

If you'll recall from above, the following join resulted in a fanout, because while the customer table only had 3 rows, the joined table had 4 rows.

SELECT    *
FROM      customer
LEFT JOIN order
ON        customer.customer_id = order.customer_id
customer_id first_name last_name visits order_id amount customer_id
1 Amelia Earhart 2 1 25.00 1
1 Amelia Earhart 2 2 50.00 1
2 Charles Lindbergh 2 3 75.00 2
3 Wilbur Wright 4 4 100.00 3

Since we are in a fanout situation we cannot trust that aggregate functions will work on the primary table (customer). As we saw above, SUM(visits) will give us a value of 10, even though only 8 visits have actually occurred.

No Fanout Example

When we reversed the join we did not get a fanout, because the order table had 4 rows, and the joined table also had 4 rows.

SELECT    *
FROM      order
LEFT JOIN customer
ON        order.customer_id = customer.customer_id
order_id amount customer_id customer_id first_name last_name visits
1 25.00 1 1 Amelia Earhart 2
2 50.00 1 1 Amelia Earhart 2
3 75.00 2 2 Charles Lindbergh 2
4 100.00 3 3 Wilbur Wright 4

Without any fanout we can trust that aggregate functions will work on the primary table (order). For example, SUM(amount) will give us a value of 250.00, which is the correct amount of money collected.

2 Friendly Joins, 1 Frenemy Join

If we want to avoid fanouts it's important that we understand the 3 different types of joins.

1-to-1 (Friendly)

If one row of your primary table only ever matches up with one row of your joined table, you have a 1-to-1 join. This type of join will not result in a fanout, and aggregate functions will be accurate no matter where you use them.

Example: Suppose you have a person table and a DNA table. Since only one person can be matched with one DNA record, this is a 1-to-1 join.

Many-to-1 (Friendly)

If many rows of your primary table match up with the same row in your joined table, you have a many-to-1 join. This type of join will also not result in a fanout, and aggregate functions will at least be accurate on the primary table.

Example: Suppose you have a person table and a state-of-residence table. Since many people can live in one state, this is a many-to-1 join.

1-to-Many (Frenemy)

If one row of your primary table can match up with multiple rows in your joined table, you have a 1-to-many join. This type of join can result in a fanout, and aggregate functions are not necessarily accurate anywhere.

Example: Suppose you have a person table and a children table. Since one person can have more than one child, this is a 1-to-many join.

A Fanout Witch Hunt

Understand Your Join Type

The first, and preferred, method to check for a fanout is to understand the type of join that is occurring. 1-to-1 and many-to-1 joins won’t ever result in a fanout. However, if you know you are in a 1-to-many situation, then there will always be the risk of a fanout. Even if a fanout has not already occurred, it will be a risk in the future if new rows are added.

Count Rows Before and After the Join

The second method you can use to check for a fanout is to query a COUNT before and after the join. The queries would look like this:

SELECT COUNT(*)
FROM   my_primary_table

SELECT    COUNT(*)
FROM      my_primary_table
LEFT JOIN my_joined_table
ON        my_primary_table.my_column_1 = my_joined_table.my_column_2

If the count increases between the 2 queries, we know that a fanout has occurred. Since we’re looking for an increase, it’s important that the second query use a LEFT JOIN. We don’t want to artificially decrease the number of rows being reported just because a row in the primary table doesn’t have a corresponding row in the joined table.

Unfortunately, this method cannot tell you if there is a risk of a future fanout. To know that, you need to understand the type of join that is occurring.

Looker to the Rescue

If you're one of the lucky folks who use Looker, there are several LookML options that can help protect you from these pitfalls.

Joining with foreign_key Makes Fanouts Impossible

In Looker, you’ll define joins between your views (views represent an existing or derived table) in your model file. At Looker we recommend you implement joins using the foreign_key declaration when possible. In LookML, a join that uses this method will be written like this:

explore: person {
  join: dna {
    foreign_key: dna_id
  }
}

In plain English, we would interpret this to mean “take the primary key from dna, and join it to dna_id from person”.

It is impossible for a join performed via foreign_key to result in a fanout, as long as you've defined your primary keys correctly. Since the joined views are being joined via their primary key, which is unique, we know that only one row from the joined views can ever be matched with a row from the explore.

Sometimes the column relationships in your database will not allow you to use foreign_key, even though no fanout will occur. That’s perfectly fine, and in this situation you can use sql_on to define your join. The example above could be written like this:

explore: person {
  join: dna {
    sql_on: ${dna.id} = ${person.dna_id} ;;
  }
}

As we learned in this article, if you eliminate the possibility of a fanout, you can be sure that the measures in your explore can be trusted.

Aggregate Measures from Joined Views are Hidden Unless Join is 1-to-1

That sure is a mouthful, so let’s work through it. In this article, we've seen that aggregate functions don't always work for tables that have been joined in. For that reason, Looker will not display aggregate functions (except for counts) for joined views. That way, your users will not accidentally retrieve bad information.

The only type of aggregate function that is displayed are counts. To protect you from a bad count, Looker will automatically use a COUNT DISTINCT in these cases. This is safe because, even if a row from a joined table is repeated multiple times, a COUNT DISTINCT will only count it once.

Sometimes it is safe to use other aggregate functions, besides just COUNT DISTINCT, on joined views. This occurs when the join relationship is of the 1-to-1 type. If you know that a join is 1-to-1 you can tell Looker about this, using the relationship declaration. It looks like this:

explore: person {
  join: dna {
    foreign_key: dna_id
    relationship: one_to_one
  }
}

When you make this declaration Looker will start displaying aggregate functions for the joined view.

When a Fanout is Ok

Sometimes you know that a fanout is going to occur, but you've thought it through, and decided you don't need aggregate functions or that the ones you plan to use will work properly. In that case, Looker will definitely allow you to define your joins however you like. Just use sql_on to define your join. As a reminder, the syntax will look like:

explore: person {
  join: child {
    sql_on: ${child.parent_id} = ${person.id} ;;
  }
}

But I Want My Cake and I Want to Eat it Too!

There are some clever ways to avoid inaccuracy with aggregate functions, even if you have problematic joins. If you can force the information in your joined tables to have a 1-to-1 relationship with your primary table, you can use aggregate functions to your heart's content.

Group Your Data in a 1-to-1 Join

At the very beginning of this article we started with the customer table, then joined in order, and saw some bad behavior extracting total customers and total visits. As a reminder, the tables look like this:

customer
customer_id first_name last_name visits
1 Amelia Earhart 2
2 Charles Lindbergh 2
3 Wilbur Wright 4
order
order_id amount customer_id
1 25.00 1
2 50.00 1
3 75.00 2
4 100.00 3

One approach we might take, to get this data joined together in a safe way, is to group the order table by customer_id. In so doing, we'll create a single row to be matched with each row in customer, and will end up with a 1-to-1 relationship. The SQL to perform this grouping might be written as:

SELECT   customer_id,
         SUM(amount) AS total_amount
FROM     order
GROUP BY customer_id

The resulting table will look like this:

customer_id total_amount
1 75.00
2 75.00
3 105.00

If you compare this result set to the customer table, you can see that it will join quite nicely. To actually execute the join in SQL, you would need to use a sub query:

SELECT *
FROM   customer
LEFT JOIN
(
  SELECT   customer_id,
           SUM(amount) AS total_amount
  FROM     order
  GROUP BY customer_id
)
AS customer_totals
ON customer.customer_id = customer_totals.customer_id

The resulting table will look like this:

customer_id first_name last_name visits customer_id total_amount
1 Amelia Earhart 2 1 75.00
2 Charles Lindbergh 2 2 75.00
3 Wilbur Wright 4 3 105.00

Now any aggregate function calculations we want to use (such as a SUM on visits or total_amount) will work just fine.

Using Looker's Derived Tables to Do the Same Thing

Looker has a great feature called derived tables. They have many sophisticated features, but can be used simply to achieve the grouping of the order table that we want. This is how the LookML would be written:

view: customer_totals {
  derived_table: {
    sql:
      SELECT
        customer_id,
        SUM(amount) AS total_amount
      FROM order
      GROUP BY customer_id ;;
  }
}

You could now treat this data as if it were a real table in your database, just like any other. You can define measures and dimensions just like any other view. It can also be joined in LookML just like any other table. In this example, the LookML might be:

explore: customer {
  join: customer_totals {
    foreign_key: customer_id
    relationship: one_to_one
  }
}

The CliffsNotes Version

To summarize everything we've just covered:

  • Aggregate functions like SUM and COUNT can misbehave if used against joined tables

  • If a join has a 1-to-1 relationship, aggregate functions will work just fine

  • If a join has a many-to-1 relationship, aggregate functions will work on the primary table, but might not work on the joined tables

  • If a join has a 1-to-many relationship, aggregate functions may not work anywhere

  • If you're using Looker, protect yourself from many of these pitfalls by making your joins with foreign_key if possible, and setting relationship: one-to-one when appropriate

  • If you need to join tables that have a 1-to-many relationship, and want to use aggregate functions, try grouping the joined tables first such that they will have a 1-to-1 relationship with the primary table

Next Previous

Subscribe for the Latest Posts