Looker Blog : Data Matters

Fun with Data - Hacking Hacker News

Lloyd Tabb, Founder & CTO

Mar 29, 2016

Hacking Hacker News

Last year, Google's Felipe Hoffa uploaded complete data about Hacker News' posts to Google's BigQuery data engine. BigQuery is a giant clustered SQL engine in the cloud that can query enormous amounts of data very quickly. Felipe did some great initial analysis of the dataset writing SQL by hand and graphing in Matlab.

Others on Hacker News picked up the thread and added their own analyses on Hacker News (meta, right?). I loved the idea of exploring this public dataset and wanted to surface some new insights, and also to make it accessible to anyone who wanted to play with the data, whether or not they knew SQL.

So I wrote up a quick model in LookML (Looker's YAML-based modeling language) to describe how the BigQuery tables relate to each other. (It's all of four files and fewer than 300 lines of code; you can view it here).

Below, I'll walk through the process of building out the model and the analyses it makes easy (want to know what domain has the most stories on Hacker News? Or what words are most common? Or which author's stories score the highest? Read on!). But if you're in a hurry, jump right to a dashboard showing some of our favorite findings or explore the data yourself.

Start with the Raw Data

Navigating to the Hacker News data in BigQuery, we see there are two tables, stories and comments. Both tables are relatively simple in structure.

Table Stories

Each story contains a story id, the author that made the post, when it was written, and the score the story achieved, which I believe corresponds to 'points' on news.ycombinator.com Each story also has a title and the URL containing the content.

Getting Started

Let's first start with Stories. First we run Looker's generator to create a LookML model for stories. Each field in the table will have an associated LookML dimension. These dimensions are used both in Looker's Explorer and to write SQL statements and run them in BigQuery.

Without any changing any of this code, we can explore the data immediately.

- explore: stories

- view: stories
  sql_table_name: |
     [fh-bigquery:hackernews.stories]

  fields:
  - measure: count
    type: count
    drill_fields: detail*

  - dimension: id
    type: int
    sql: ${TABLE}.id

  - dimension: by
    type: string
    sql: ${TABLE}.[by]

  - dimension: score
    type: int
    sql: ${TABLE}.score

  - dimension: time
    type: int
    sql: ${TABLE}.[time]

  - dimension_group: post
    type: time
    timeframes: [time, date, week, month, year]
    sql: ${TABLE}.time_ts

  - dimension: title
    type: string
    sql: ${TABLE}.title

  - dimension: url
    type: string
    sql: ${TABLE}.url

  - dimension: text
    type: string
    sql: ${TABLE}.text

  - dimension_group: deleted
    type: time
    timeframes: [time, date, week, month]
    sql: ${TABLE}.deleted

  - dimension: dead
    type: yesno
    sql: ${TABLE}.dead

  - dimension: descendants
    type: int
    sql: ${TABLE}.descendants

  - dimension: author
    type: string
    sql: ${TABLE}.author

  sets:
    detail: [id, post_time, author, title]

Let's Start Exploring

The first and most obvious question is, 'How many stories are there?'

Explore From Here

Is the number of stories going up or down? Let's look by year.

Explore From Here

Explore From Here

Looks like stories peaked in 2013.

How Do I Get My Story to the Top of Hacker News?

I've tried to post a couple of times, but my stories never seem to go anywhere. Let's find my stories. I'm going to filter by 'lloydt' (my Hacker News name) and let's take a look.

Explore From Here

Clicking on any of the counts will lead to my stories. Clicking on the 4 in the 2013 row's count column shows my stories for that year.

Explore From Here

My best story scored a 4. Looking on the front page of Hacker news at the moment, we see posts with a variety of scores, but the lowest looks to be 7.

How Are the Scores Distributed?

If use the score as a dimension (group by score, in SQL) and count the number of posts with each score, we should get an idea about how likely a story is to get a given score. Looking at the table and graph below we can see that many stories have a score in the 1-4 range like my stories.

Explore From Here

Explore From Here

Lucky 7

The goal here is to try and figure out if a story made it to the front page of Hacker News. Many stories each day are posted and most don't get there. It is pretty obvious that the distribution is heavily bifurcated; there are some stories that make it, but most stories don't.

Unfortunately, there is no obvious way to see the split in the data.

Sometimes, just picking a somewhat arbitrary threshold will help us find it. In this case, I'm going to pick 7 as a threshold for an interesting story. Later, we can investigate different thresholds, but for now, I'm going to say if a story has a score of 7 or more, it's interesting.

Let's build a new dimension. In LookML. Note we can reuse the score definition (${score}) in the main model to create score_7_plus.

  - dimension: score_7_plus
    type: yesno
    sql: ${score} >= 7

Using the new score_7_plus dimension, we run a query and we can see that about 15% (300K/1959K) of the stories have a score of 7 or above.

Explore From Here

Looker, using the LookML model, is writing all the SQL for us behind the scenes and sending it to BigQuery. The query it sent on our behalf was:

SELECT
  CASE WHEN stories.score >= 7 THEN 'Yes' ELSE 'No' END AS stories_score_7_plus,
  COUNT(*) AS stories_count
FROM [fh-bigquery:hackernews.stories]
 AS stories
GROUP EACH BY 1
ORDER BY 2 DESC
LIMIT 500

Who is the King of Hacker News?

Getting a front page story is no easy feat. Let's see if we can figure out of someone out there does it consistently. We're going to use our lucky 7 as our threshold. To examine this, we going to hold our grouping by Score 7 Plus and additionally group by Author. Looker lets us pivot the results. We're also going to sort by the Yes Count column to find the person with the most posts with a score of 7 or more.

Explore From Here

It looks like an author cwan has had the most posts that have made it to the front page. To see what cwan posts about, we just click on his story count. All looker counts drill into the detail behind them. Let's look at cwan's posts.

Explore From Here

Finding Top Posters about a Particular Subject.

Filtering data to a smaller set can help us find trends about a particular subject.

We can go back to our original "top poster" query and research whose posts contain the word 'Facebook'. We'll see a different set of people.

Notice that the top poster ssclanfani has had 122 posts with 'Facebook' in the title and 65 of them have a score 7 or higher (about 50%).

iProject has had 323 posts about Facebook and only 29 have scored 7 or higher (about 10%).

Explore From Here

Often, the devil is in the details. Many times, I've clicked into a number and looked at the underlying data records and seen some pattern. Let's look at ssclafani's Facebook posts and see if we can find something interesting. By clicking into the 57, we can see her posts.

Explore From Here

Generalizing Hit Rate

Pivoting the data is helpful, but we're still doing some calculations by hand. We can create a couple of custom measures that will help us understand the data more readily.

We'll create a count of just the posts that scored 7 and above.

Then we'll create a measure that is the percentage of all posts that scored 7 and above.

LookML makes creating these measures easy.

Notice that we reuse the definition of score_7_plus.

  - measure: count_score_7_plus
    type: count
    drill_fields: detail*
    filters:
      score_7_plus: Yes

And we reuse the definition of count_score_7_plus in the following definition.

  - measure: percent_7_plus
    type: number
    sql: 100.0 * ${count_score_7_plus} / ${count}
    decimals: 2

With the new measures, we can rebuild and run our previous query. The percentage measure really helps us see that the author Slimy is quite good at placing stories; 65.22% of his stories score 7 or higher.

Explore From Here

Are There People Better than the Author Called 'Slimy'?

Another advantage of creating a new measure is we can now sort by it. Let's sort by Percent 7 Plus and look at people that have posted more than 5 stories (again, an arbitrary number).

Explore From Here

Where Do the Stories Live?

Hacker News only contains titles and urls that point places (and comments). Let's take a look where the stories that are posted live. In order to do this, we'll have to parse out the host from the URL. We'll build a dimension in LookML that does this. BigQuery's SQL has a regular expression extractor that makes it pretty easy. LookML also has a way that we can write the html for the thing we are displaying.

We add the dimension to our model:

  - dimension: url_host
    sql: REGEXP_EXTRACT(${url},'http://([^/]+)/')

And now we can look at stories by the host they were posted to. Let's sort by Score 7 Plus.

Explore From Here

And a peek at the sql that Looker wrote for us:

SELECT
  REGEXP_EXTRACT(stories.url,'http://([^/]+)/') AS stories_url_host,
  COUNT(*) AS stories_count,
  COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END) AS stories_count_score_7_plus,
  100.0 * (COUNT(CASE WHEN stories.score >= 7 THEN 1 ELSE NULL END)) / (COUNT(*)) AS stories_percent_7_plus
FROM [fh-bigquery:hackernews.stories]
 AS stories
GROUP EACH BY 1
ORDER BY 3 DESC
LIMIT 500

Domains Are Better

Domains are probably more interesting then hosts. After all, www.techcrunch.com and techcrunch.com both appear in this list. So let's build up another field that parses domain out of the host. We have to be careful to deal with hosts like 'bbc.co.uk', so we look for domains that end in two letters and grab more data.

  - dimension: url_domain
    sql: REGEXP_EXTRACT(${url_host},'([^\\.]+\\.[^\\.]+(?:\\.[a-zA-Z].)?)$')

Explore From Here

Are there domains that are more successful than others? Lets look at hosts by Percent 7 Plus.

Explore From Here

Whoops, looks like a bunch of one-hit-wonders. Let's eliminate hosts that have fewer than 20 successful posts.

Explore From Here

Building a Better Indication that a Post Was on the Front Page

There is an old joke about a group of people that encounter a bear in the woods. They all start running from the bear. The joke is that you don't have to outrun the bear, you have to outrun the other people.

Hacker News scores are like that. We probably don't care what the actual score is, we just care that it's better than the other scores being posted on the same day.

We'll rank the the score for each day starting with 1 as the best score for the day and moving down.

In order to compute the daily rank, we'll need to use SQL's window function and a derived table in LookML. The output is a two column table with the id of the story and the rank of the story on the day it was posted.

- view: daily_rank
  derived_table:
     sql: |
        SELECT
           id
          , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
        FROM (
           SELECT
            id
            , DATE(time_ts) as post_date
            , score
           FROM [fh-bigquery:hackernews.stories]
           WHERE score > 0
        )
  fields:
  - dimension: id
    primary_key: true
    hidden: true
  - dimension: daily_rank
    type: number

We can join this table into our stories explore.

- explore: stories
  joins:
  - join: daily_rank
    sql_on: ${stories.id} = ${daily_rank.id}
    relationship: one_to_one

We can then look at our data by daily_rank and see the number of stories that match this. The data looks right. There are some 3000 days and a story for each rank for each day.

Explore From Here

The SQL that Looker wrote for this query is below. As the model gets more and more complex, so do the queries, but because Looker is doing the query-writing behind the scenes, asking the question remains simple.

SELECT
  daily_rank.daily_rank AS daily_rank_daily_rank,
  COUNT(*) AS stories_count
FROM [fh-bigquery:hackernews.stories]
 AS stories
LEFT JOIN (SELECT
   id
  , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
   SELECT
    id
    , DATE(time_ts) as post_date
    , score
   FROM [fh-bigquery:hackernews.stories]
   WHERE score > 0
)
) AS daily_rank ON stories.id = daily_rank.id
GROUP EACH BY 1
ORDER BY 1
LIMIT 500

Let's Build a New Top 25 Set of Dimensions and Measures

Like we did before, building dimensions and measures into the model will allow us to think in these terms.

We build them in a very similar way that we built our Score 7 measures. Notice we simply reference ${daily_rank.rank} and Looker figures out how to write the SQL to make it all fit together.

  # Was this post in the top 25 on a given day?
  - dimension: rank_25_or_less
    type: yesno
    sql: ${daily_rank.rank} <= 25

  # How many posts were in the top 25 out of this group of posts?
  - measure: count_rank_25_or_less
    type: count
    drill_fields: detail*
    filters:
      rank_25_or_less: Yes

  # What Percentage of posts were in the top 25 in group set of posts?
  - measure: percent_rank_25_or_less
    type: number
    sql: 100.0 * ${count_rank_25_or_less} / ${count}
    decimals: 2

And the simple output. Looks like about 4% of posts make it to the top 25 on a given day.

Explore From Here

Now let's look at it by poster. Looks like Paul Graham (whose author name is "pg") has had lots of top 25 posts and a very high hit rate.

Explore From Here

Wow. Looking by Domain Is an Amazing List

Rerunning the query, this time by target domain with high story counts with rank 25 or less gives us a fascinating list of domains. The obvious ones are there -- nytimes, bbc.co.uk -- but scrolling down a little, I find domains I don't know about. Following the links (we'll talk about how to make these later) usually takes me to an interesting place.

Explore From Here

Common Words in Top Posts

We can now find top posts. Let's see if we can figure out why some posts are top. Are top posts talking about something in particular? Let's see if we can find common words in posts.

First, we're going to build a derived table that has two columns, the story id, and a word that appeared in the title.

 view: story_words
  derived_table:
    sql: |
      SELECT id, SPLIT(title," ") as word
      FROM [fh-bigquery:hackernews.stories] stories
  fields:
  - dimension: id
    primary_key: true
    hidden: true
  - dimension: word

Next we'll build an explore definition (the join relationships). We'll reuse both our prior views (the story view and the daily_rank view). The basis of this explore is a word, not a story.

- explore: story_words
  joins:
  - join: stories
    sql_on: ${story_words.id}=${stories.id}
    relationship: many_to_one
    type: left_outer_each
  - join: daily_rank
    sql_on: ${stories.id} = ${daily_rank.id}
    relationship: one_to_one
    type: left_outer_each

Now we can explore by word. Let's look at the words in the posts with a rank of 25 or less. Scroll down a little to look past some of the small common words

Explore From Here

and again, the SQL Looker is writing for us:

SELECT
  story_words.word AS story_words_word,
  COUNT(DISTINCT CASE WHEN daily_rank.daily_rank <= 25 THEN stories.id ELSE NULL END, 1000) AS stories_count_rank_25_or_less
FROM (SELECT id, SPLIT(title," ") as word
FROM [fh-bigquery:hackernews.stories] stories
) AS story_words
LEFT JOIN EACH [fh-bigquery:hackernews.stories]
 AS stories ON story_words.id=stories.id
LEFT JOIN EACH (SELECT
   id
  , RANK() OVER (PARTITION BY post_date ORDER BY score DESC) as daily_rank
FROM (
   SELECT
    id
    , DATE(time_ts) as post_date
    , score
   FROM [fh-bigquery:hackernews.stories]
   WHERE score > 0
)
) AS daily_rank ON stories.id = daily_rank.id
GROUP EACH BY 1
ORDER BY 2 DESC
LIMIT 500

Of course clicking on any of the numbers will drill in and show us the stories.

Eliminating the Common Words with a Shakespeare

The common words are a problem. It would be great to eliminate or at least flag them.

To do this, we're going to use an inspired little hack.

BigQuery provides a nice little table of all the words in Shakespeare. The table consists of the word, the corpus it appeared in and what year the corpus was written.

We are going to find these 1000 words and then flag the words that we encounter that appear in the 1000 word list.

First, we write a little query to find the 1000 most common words in Shakespeare.

SELECT
    lower(word) as ssword
    , count(distinct corpus) as c
  FROM [publicdata:samples.shakespeare]
  GROUP BY 1
  ORDER BY 2
  DESC
  LIMIT 1000

With this word list, we can modify our derived table to have a new column, ssword, which if NOT NULL, means the word appears in Shakespeare (and we would consider it common).

- view: story_words
  derived_table:
    sql: |
     SELECT a.id as id, a.word as word, b.ssword as ssword
     FROM FLATTEN((
       SELECT id, LOWER(SPLIT(title," ")) as word
          FROM [fh-bigquery:hackernews.stories] stories
       ), word) as a
     LEFT JOIN (
        SELECT lower(word) as ssword
        , count(distinct corpus) as c
        FROM [publicdata:samples.shakespeare]
        GROUP BY 1
        ORDER BY 2
        DESC
        LIMIT 1000) as b
      ON a.word = b.ssword
  fields:
  - dimension: id
    primary_key: true
    hidden: true
  - dimension: word
  - dimension: is_comon_word
    type: yesno
    sql: ${TABLE}.ssword IS NOT NULL

Now rerunning our query, with the common field, we can see what we've isolated some of the more common words.

Explore From Here

And now without the common words:

Explore From Here

Finally ... Which Words, If in the Title of the Story, Are Most Likely to Get You on the Front Page

Explore From Here

Comparing

Now with a few clicks we can start comparing by filtering words to Microsoft, Google and Facebook. Let's compare front page posts by year.

Explore From Here

Wiring This into an Application

The next step is to make a data discovery application and cross wire all the research we've done so far. We easily build a dashboard that shows posts, over time, by domain, by author, by word, and success rates into making to a score of 7, and from a score of 7 into the top 25.

We wire up filters for author, domain and word, so that any of these will change all the data on the dashboard.

For example, let's look at Paul Graham (author name "pg"). He is posting a little less over time, likes to post about ycombinator, talks about yc, applications, hn and startups. His posts look very successful.

One of the nice things we can do in Looker is to create links when we render the result cells. Dimensions have an html: parameter that is rendered with liquid templating.

Using this mechanism, we can cross link everywhere we display author, domain and word to point to a dashboard.

For example, we link author to both the dashboard and the profile page on Hacker News. We use emoji's to make it all work.

  - dimension: author
    type: string
    sql: ${TABLE}.author
    html: |
      
       <a href="/dashboards/169?author="
        title="Goto Dashboard"
        target=new>⚡</a>
      <a href="https://news.ycombinator.com/user?id="
        title="Goto news.ycombinator.com"
        target=new>➚</a>

Other Ideas to Research

There is lots more to investigate.

  • What's really in a score? Is it related to the number of comments?
  • If it is, are there any cliffs in comments that might be a better indicator of "frontpageness"?
  • If there are multiple people that post the same URL with almost the exact same text, does timing matter?
  • What time of day is the best time to post? Does it matter?
  • Does velocity of comments matter in score? How soon after the post do comments need to happen before a score goes up?
  • Do some commenters matter more then others (for example, if you get a comment from someone that comments a lot, does that help your score more?)
Next Previous

Subscribe for the Latest Posts