# Name Game: step 4 - normalizing populations with cohorts

Jan 15, 2016

A cohort is a group of people banded together and treated as a group. In this model, we are going to treat all people of a given gender born in a particular year in a particular state as a cohort.

## Cohort table: the code

We're going to use LookML's derived table functionality to create a cohort table that takes the raw data and computes the number of people in each cohort. We'll declare dimensions for the cohorts so we can join this table to our usual names table. We'll make the year, state, and gender dimensions hidden so they do not appear in the field picker.

We'll also declare a measure cohort_size, so we can combine smaller cohorts into larger ones. This computation is very difficult to do in other systems but Looker's symmetric aggregates functionality correctly handles this computation.

``````view: cohorts {
derived_table: {
sql: SELECT
year, state, gender,
SUM(number) as number,
row_number() OVER () as id
FROM names
GROUP BY 1,2,3
;;
}

dimension: id {
primary_key: yes
}

dimension: year {
hidden: yes
}

dimension: state {
hidden: yes
}

dimension: gender {
hidden: yes
}

dimension: number {
#X# Invalid LookML inside "dimension": {"type":null}
}

measure: cohort_size {
type: sum
sql: \${number} ;;
}
}
``````

## Code: the join

We're going to join each name to the cohort that has the same gender, year, and state:

``````# build a percent of cohort to normalize data
explore: names_step_4 {
join: cohorts {
sql_on: \${state} = \${cohorts.state}
AND \${year} = \${cohorts.year}
AND \${gender} = \${cohorts.gender}
;;
relationship: many_to_one
}
}
``````

## The view

And finally, we simply add a single measure, percentage_of_cohort. This measure will tell us what percentage this particular name or names were of the total cohort. Looker correctly computes this measure no matter what filters we add or which dimensions we group by.

``````# build out cohort analysis to normalize data by cohort
#  so you can compare names popularity between states or years
#
view: names_step_4 {
extends: [names_step_3]

measure: percentage_of_cohort {
type: number
sql: \${total_population} / \${cohorts.cohort_size} ;;
value_format_name: percent_2
}
}
``````

## Explore!

All of the queries below can be created by Exploring from here. Select dimensions and measures, add filters and bonk the 'Run Button'.

## James population vs. popularity

So far in this analysis, we've been using population to tell us the popularity of a given name. However, looking at this cohort graph we can see that we've been wrong. One line shows that the population of people with the name James has risen and fallen. The other line shows that from 1920 to 1950 fully 3% of all people were named 'James' but since then the popularity of the name has been on decline. Using cohorts, we can normalize the data to the population size and see actual trends in popularity.

Explore from here

## James popularity by state

The cohorts work regardless of how we group the data. Here is a graph of popularity of the name 'James' by state. Percentages are against all the male people born in that state for all time.

Explore from here

## James popularity by region and time

We can see the changes in popularity of the name James over time by region. The popularity calculation works regardless of how the data is grouped or filtered.

Explore from here

## Popularity of some male names over time

Explore from here

Here's a list of all of the steps:

Go to the next Name Game page, Step 5: Using Rank to Figure out Name Density and Popularity