Looker Blog : Data Matters

Recommendations Are Easier Than You Think

Colin Zima, Chief Analytics Officer

Dec 22, 2014

Getting Started

When you get tempted into buying salsa in the chips aisle of the grocery store or adding one more book to your Amazon cart that “Customers Who Bought This Item Also Bought”, you have experienced how retailers use recommendations to drive more business.

By understanding what customers are purchasing in the same basket or the different goods that specific customer groups purchase over their lifetime, retailers can customize promotions, increase average order sizes, and improve product usage. These models are not only for data scientists - using simple analytics we can get started on the path to a full blown recommendation model.

Recommendations in an Instant

If your goal is to recommend you a website to someone at random, the top 10 websites on the internet aren’t a bad starting point. If you think about it, this is a very basic recommendation model. Unfortunately, not many people are going to thank you for recommending Google, but a savvier algorithm isn’t much harder.

The following formula is a pretty easy starting point for recommending websites:

Unpacking our formula, the probability of visiting website B given that you visited website A is the probability of visiting both websites, divided by the probability of just visiting site A. Put another way, if we know websites a user has visited (A), then we can rank other websites that users that A visitors frequent (B) - a recommendation algorithm! Unfortunately this has some weakness, as you can see in the image below:

Unfortunately, we haven’t done much better than recommending the top websites on the internet. In this case, for visitors to ESPN.com, as many have visited Facebook as NBA.com, but clearly the relationship is tighter with NBA.com. Thankfully, we can make adjustments:

In this formula, we use the same math as above, but adjust by the frequency of our new site. Thus in the example above, Facebook would be downweighted by it’s enormous reach, while NBA.com would not. Much better! Unfortunately, we have new problems:

Now we are doing a much better job of surfacing more obscure sites, but unfortunately we are now over-biasing towards them. In the example above, we can see that NBA.com is 4x more likely for vistors of ESPN.com, but ESPNBoston is 20x more likely! The problem is that even with this huge lift, ESPNBoston is extremely unlikely. The issue is balance:

Great recommendation needs to balance our two examples above - popularity is important but needs to be weighed against discovery. Fortunately there are lots of ways we can balance the two with very simple equations, so let’s dive in to an example for an e-commerce store.

Our Store

For this analysis, we’ll use a hypothetical e-commerce store with a pretty typical database set-up:

  • An orders database of transactions per user
  • An order_items database of SKUs included in each order
  • A user table containing every registrant in the store
  • A products table with details on each SKU

orders:

id created_at user_id
125 2014-07-02 09:27:83 +0000 1053
126 2014-07-02 09:29:04 +0000 1057
127 2014-07-02 09:31:12 +0000 1251

order_items:

id order_id inventory_id price
304 125 3076 137.50
305 125 14567 122.50
306 126 201 99.50

users:

id created_at name
1 2013-08-25 18:12:03 +0000 Kevin Martin
2 2013-08-25 19:29:04 +0000 Grace Peters
3 2014-08-26 11:31:12 +0000 Erica Johnstone

products:

id category item_name brand dep
1 Legging Retro Electric Patchwork Printed Cotton Leggings Yelete Women
2 Socks Gold Toe Men's Freshcare Dress Rib 3 Pack Gold Toe Men
3 Socks PACT Men's Signature Multistripe Socks PACT Men

Note that all four tables can be joined into order_items, resulting in a larger table with order level, product level, and user level information for each order_item. Using this data set, we can begin to construct important fact tables that will let us quickly calculate relationships between fields like ‘category’ , ‘item_name’, or ‘brand’. And we can calculate these affinities at the order level or the user level.

Ambitious users could even aggregate or filter using different attributes at both the user level (age, gender, join date, location) or order level (time of day, day of year). For example, we could examine SKU level co-purchase rates for males 18-35 for an e-commerce store, or page-view affinity for mobile traffic vs web traffic to a news site. This are recommendation models.

Example Analysis

For this affinity analysis, we will using Jaccard Similarity, but there are many different methodologies for performing the calculation. An example calculation will make the math clear. Let’s imagine a store with 3 products and 10 customers that have purchased 17 items in total.

user_id purchase_a_yesno purchase_b_yesno purchase_c_yesno
1 1 1
2 1 1 1
3 1
4 1 1
5 1 1
6 1
7 1
8 1
9 1 1
10 1 1
Total 5 5 7

To calculate similarity scores between the items, we simply find the number of customers in the intersection of two products (A ∩ B), divided by the union of the same two products (A U B).

Grouping product_a and product_b:

  • Union: 9 customers bought one of the two products (customer 3 did not)
  • Intersection: 1 customers bought both products (customer 2)
  • Affinity = 1/9 = 0.11

Grouping product_a and product_c:

  • Union: 7 customers bought one of the two products (customers 6,7,8 did not)
  • Intersection: 5 customers bought both products (customers 1,2,4,5,10)
  • Affinity = 5/7 = 0.71

Grouping product_b and product_c:

  • Union: all 10 customers bought one of the two products
  • Intersection: 2 customers bought both products (customers 2 and 9)
  • Affinity = 2/10 = 0.20

Similarity Scores in Practice

Generating affinity scores across large data sets is as simple in practice as our example above. We need to simply generate the intersection of a given pair and the union of a given pair. Thankfully, we can use a quick shortcut to speed up the process (visualized below).

The intersection is simply the total number of instances where both A and B occurred (the purple square). The union appears a bit trickier to calculate, due to finding three individual cases. But note how simply adding all occurrences of Purchase A (the red, first column) and all occurrences of Purchase B (the blue, first row), and then subtracting the double-counted intersection (the purple square).

Thus we are left with:

In Looker, we can use derived tables to quickly do these calculations. Continuing our SQL example above, we first enumerate every order, product pair:

- view: order_product
  derived_table:
    persist_for: 24 hours
    indexes: [order_id]
    sql: |
      SELECT o.id as order_id
      , oi.inventory_item_id as inventory_item_id
      , p.item_name as product
      FROM order_items oi
      JOIN orders o ON o.id = oi.order_id
      JOIN inventory_items ii ON oi.inventory_item_id = ii.id
      JOIN products p ON ii.product_id = p.id
      GROUP BY order_id, item_name

We then calculate a simple count distinct of the order ids for each item (the purchase count):

- view: total_order_product
  derived_table:
    persist_for: 24 hours
    indexes: [product]
    sql: |
      SELECT p.item_name as product
      , count(distinct p.item_name, o.id) as product_frequency
      FROM order_items oi
      JOIN orders o ON o.id = oi.order_id
      JOIN inventory_items ii ON oi.inventory_item_id = ii.id
      JOIN products p ON ii.product_id = p.id
      GROUP BY p.item_name

Now using these roll-up tables, we can examine the joint frequency of any two products, by joining the order, product table to itself:

- view: order_purchase_affinity
  derived_table:
    persist_for: 24 hours
    indexes: [product_a]
    sql: |
      SELECT product_a
      , product_b
      , joint_frequency
      , top1.product_frequency as product_a_frequency
      , top2.product_frequency as product_b_frequency
      FROM (
        SELECT op1.product as product_a
        , op2.product as product_b
        , count(*) as joint_frequency
        FROM ${order_product.SQL_TABLE_NAME} as op1
        JOIN ${order_product.SQL_TABLE_NAME} op2 ON op1.order_id = op2.order_id
        GROUP BY product_a, product_b
      ) as prop
      JOIN ${total_order_product.SQL_TABLE_NAME} as top1 ON prop.product_a = top1.product
      JOIN ${total_order_product.SQL_TABLE_NAME} as top2 ON prop.product_b = top2.product
      ORDER BY product_a, joint_frequency DESC, product_b

The result is a table with every product pair that was ordered together (product_a and product_b), along with the joint frequency (joint_frequency), and the raw frequency of each product individually (product_a_frequency and product_b_frequency).

The affinity is then simply:

joint_frequency / (product_a_frequency + product_b_frequency - joint_frequency)

Extensions

The analysis does not need to stop at order affinity.

  • This same analysis could be done at the user level, to generate email or product recommendations (Amazon’s users who bought X), or at the pageview level, to generate article recommendations (reader who liked this article also read X).

  • This analysis could also be used to generate more general category or sub-category recommendations, rather than simply recommending products (like these kid’s shoes, check out selection of kid’s hats).

  • Subsetting the ‘recommendation set’ based upon user attributes (age, location, gender, etc) would allow for “personalized” recommendations that could cohort and target users more granularly (say pointing LA denim buyers to t-shirts and Vermont denim buyers to outerwear).

  • Finally, time-sliced subsetting could allow for seasonal recommendation changes (maintaining fresh article recommendations on a news site for example).

Next Previous

Subscribe for the Latest Posts