Looker Blog : Data Matters

Name Game: Step 4 - Normalizing Populations with Cohorts

Lloyd Tabb, Founder & CTO

Jan 15, 2016

Step 4: Normalizing Population with Cohorts

A cohort is a group of people banded together and treated as a group. In this model, we are going to treat all people of a given gender born in a particular year in a particular state as a cohort.

Cohort Table: The Code

We're going to use LookML's derived table functionality to create a cohort table that takes the raw data and computes the number of people in each cohort. We'll declare dimensions for the cohorts so we can join this table to our usual names table. We'll make the year, state, and gender dimensions hidden so they do not appear in the field picker.

We'll also declare a measure cohort_size, so we can combine smaller cohorts into larger ones. This computation is very difficult to do in other systems but Looker's symmetric aggregates functionality correctly handles this computation.

- view: cohorts
  derived_table:
    sql: |
      SELECT
        year, state, gender, 
        SUM(number) as number,
        row_number() OVER () as id
      FROM names
      GROUP BY 1,2,3

  fields:
  - dimension: id
    primary_key: true
  - dimension: year
    hidden: true
  - dimension: state
    hidden: true
  - dimension: gender
    hidden: true
  - dimension: number
    type:

  - measure: cohort_size
    type: sum
    sql: ${number}

Code: The Join

We're going to join each name to the cohort that has the same gender, year, and state:

# build a percent of cohort to normalize data
- explore: names_step_4
  joins:
  - join: cohorts
    sql_on: |
      ${state} = ${cohorts.state}
      AND ${year} = ${cohorts.year}
      AND ${gender} = ${cohorts.gender}
    relationship: many_to_one

The View

And finally, we simply add a single measure, percentage_of_cohort. This measure will tell us what percentage this particular name or names were of the total cohort. Looker correctly computes this measure no matter what filters we add or which dimensions we group by.

# build out cohort analysis to normalize data by cohort
#  so you can compare names popularity between states or years
#
- view: names_step_4
  extends: names_step_3
  fields:

  - measure: percentage_of_cohort
    type: number
    sql: ${total_population} / ${cohorts.cohort_size}
    value_format_name: percent_2

Explore!

All of the queries below can be created by Exploring From Here. Select dimensions and measures, add filters and bonk the 'Run Button'.

James Population vs. Popularity

So far in this analysis, we've been using population to tell us the popularity of a given name. However, looking at this cohort graph we can see that we've been wrong. One line shows that the population of people with the name James has risen and fallen. The other line shows that from 1920 to 1950 fully 3% of all people were named 'James' but since then the popularity of the name has been on decline. Using cohorts, we can normalize the data to the population size and see actual trends in popularity.

Explore From Here

James Popularity by State

The cohorts work regardless of how we group the data. Here is a graph of popularity of the name 'James' by state. Percentages are against all the male people born in that state for all time.

Explore From Here

James Popularity by Region and Time

We can see the changes in popularity of the name James over time by region. The popularity calculation works regardless of how the data is grouped or filtered.

Explore From Here

Popularity of Some Male Names Over Time

Explore From Here

Here's a list of all of the steps:

Go to the next Name Game page, Step 5: Using Rank to Figure out Name Density and Popularity

Next Previous

Subscribe for the Latest Posts