Government-collected datasets contain a wealth of information about our world--everything from which roads have the most potholes to who votes. And in recent years, governments and other public agencies have made huge strides in opening datasets to the public.
Projects like Data.gov, the NYC Open Data Portal, and Data.gov.uk catalog tens of thousands of fascinating datasets that anyone can download and explore. But even though we’ve come a long way, the promise of these datasets--to shed light on how government is performing, on where discrimination persists, on how to increase civic participation--is far from fulfilled.
The data is more accessible than it’s ever been, but for the vast majority of citizens, having to download a huge CSV to their computer or having to map latitudes and longitudes manually is such a barrier to entry that the data might as well be locked away in private repositories.
That’s why we’ve partnered with Google Cloud Platform to take some of those fascinating datasets and make them available for querying with Looker on Google BigQuery. That takes data from being nominally accessible, to actually explorable for anyone with an internet connection.
Of all the public datasets, one of the richest is the one collected by the U.S. Census Bureau. Many people don’t know this, but the Census does far more than just the decennial survey. It also includes surveys like the Current Population Survey and the American Community Survey that ask a sample of Americans literally hundreds of questions about their lives.
Since it’s an election year, we thought a nice dataset to start with would be the Community Population Survey’s Voting and Registration Supplement, which is collected in November after every election and goes back to 1994. The problem is, if you go to download this data, you’re presented with a data file that looks like this:
Not very user-friendly. To make sense of this, you need to consult the codebook, which is different for every survey and survey year and looks something like this:
Needless to say, without very specialized tools, the process of extracting meaning from this data is usually quite onerous. The data isn’t super tall--about 1.5 million rows--but each row is more than 300 columns wide. And since each respondent is given a different weight in the survey, making sense of the data is no easy task.
Luckily, giving meaning to data is exactly what Looker is good at. And with the power of Google’s BigQuery behind the scenes, we can slice and dice the data in millions of ways and get answers in seconds.
To transform the 60,000+ lines of codebooks into something useful, we’ve written a couple of Python scripts to rewrite the information in LookML (Looker’s modeling language). A few hundred lines of scripting transforms the codebooks from an impenetrable mess into code that Looker deploys to let you query the data directly in BigQuery, no special knowledge required.
And once we deploy Looker, what do we see?
We see that the percentage of voters who are white, non-Hispanic, and haven’t taken any college courses has been shrinking steadily, going from 34% in 1994 to just 20% in 2014.
We see that Baby Boomers made up the largest share of voters in 2012. And although there were far more Millennials eligible to vote than members of the Silent Generation, there were actually more voters from the Silent Generation. We’ll see how that looks in 2016 soon.
We can also easily see non-voting related trends, like the housing bubble. We see that California, Florida and Nevada, three states that were devastated by the housing bubble, saw big increases in population and homeowners from 2000 - 2006. But from 2006 to 2012, their population increases slowed and huge numbers of homeowners left or lost their houses.
But states like Texas, Georgia and North Carolina, which weathered the housing bubble relatively well, saw increases in homeowners from 2000 to 2006 and from 2006 to 2012 (albeit at a slower rate). These states have continued to see strong growth in both overall population and homeowners.
These analyses only scratch the surface of what we can discover when we have an easy way to explore the Census’ incredibly rich datasets. We’ve added a bunch more shareable facts that we’ve discovered at CensusLooker. And over the next weeks and months, we’ll be releasing additional analyses and datasets that we hope you’ll use as jumping off points for your own analyses.
But if you don’t want to wait, next week, I’ll be doing a Reddit AMA where you can ask me to look up any question that the Census can tell us about. I’ll provide answers live on Thursday, July 28 at 1 p.m. ET.
P.S. Here are a bunch of other questions that can easily be answered from this data set:
By state, what percentage of workers work for the government? (Washington, D.C. is actually #6, not #1.)
How many men vs. women have served in the Armed Forces since 2001? (Overwhelmingly male, but more gender-balanced than previous periods)
Which state has the smallest percentage of white, non-Hispanic residents? (Think tropical.)
Does when you immigrated predict your chance making >$60,000? (It’s not a perfect correlation by any chance, but earlier immigrants do seem to make more.)
How many teenagers are in the U.S.? (That’s a lot of teenage angst.)
Which Metropolitan Statistical Areas have the highest percentages of government workers? (Anyone know what’s in Gainesville, FL?)
Do naturalized citizens earn more than foreign-born, non-citizens? (In general, yes, though the difference is maybe less pronounced than you’d think.)
And that’s just a taste. If you have questions of your own, come to my Reddit AMA on July 28.