Dec 22, 2014

When you get tempted into buying salsa in the chips aisle of the grocery store or adding one more book to your Amazon cart that “Customers Who Bought This Item Also Bought”, you have experienced how retailers use recommendations to drive more business.

By understanding what customers are purchasing in the same basket or the different goods that specific customer groups purchase over their lifetime, retailers can customize promotions, increase average order sizes, and improve product usage. These models are not only for data scientists - using simple analytics we can get started on the path to a full blown recommendation model.

If your goal is to recommend you a website to someone at random, the top 10 websites on the internet aren’t a bad starting point. If you think about it, this is a very basic recommendation model. Unfortunately, not many people are going to thank you for recommending Google, but a savvier algorithm isn’t much harder.

The following formula is a pretty easy starting point for recommending websites:

Unpacking our formula, the probability of visiting website B given that you visited website A is the probability of visiting both websites, divided by the probability of just visiting site A. Put another way, if we know websites a user has visited (A), then we can rank other websites that users that A visitors frequent (B) - a recommendation algorithm! Unfortunately this has some weakness, as you can see in the image below:

Unfortunately, we haven’t done much better than recommending the top websites on the internet. In this case, for visitors to ESPN.com, as many have visited Facebook as NBA.com, but clearly the relationship is tighter with NBA.com. Thankfully, we can make adjustments:

In this formula, we use the same math as above, but adjust by the frequency of our new site. Thus in the example above, Facebook would be downweighted by it’s enormous reach, while NBA.com would not. Much better! Unfortunately, we have new problems:

Now we are doing a much better job of surfacing more obscure sites, but unfortunately we are now over-biasing towards them. In the example above, we can see that NBA.com is 4x more likely for vistors of ESPN.com, but ESPNBoston is 20x more likely! The problem is that even with this huge lift, ESPNBoston is extremely unlikely. The issue is balance:

Great recommendation needs to balance our two examples above - popularity is important but needs to be weighed against discovery. Fortunately there are lots of ways we can balance the two with very simple equations, so let’s dive in to an example for an e-commerce store.

For this analysis, we’ll use a hypothetical e-commerce store with a pretty typical database set-up:

- An
**orders**database of transactions per user - An
**order_items**database of SKUs included in each order - A
**user**table containing every registrant in the store - A
**products**table with details on each SKU

orders:

id | created_at | user_id |
---|---|---|

125 | 2014-07-02 09:27:83 +0000 | 1053 |

126 | 2014-07-02 09:29:04 +0000 | 1057 |

127 | 2014-07-02 09:31:12 +0000 | 1251 |

order_items:

id | order_id | inventory_id | price |
---|---|---|---|

304 | 125 | 3076 | 137.50 |

305 | 125 | 14567 | 122.50 |

306 | 126 | 201 | 99.50 |

users:

id | created_at | name |
---|---|---|

1 | 2013-08-25 18:12:03 +0000 | Kevin Martin |

2 | 2013-08-25 19:29:04 +0000 | Grace Peters |

3 | 2014-08-26 11:31:12 +0000 | Erica Johnstone |

products:

id | category | item_name | brand | dep |
---|---|---|---|---|

1 | Legging | Retro Electric Patchwork Printed Cotton Leggings | Yelete | Women |

2 | Socks | Gold Toe Men's Freshcare Dress Rib 3 Pack | Gold Toe | Men |

3 | Socks | PACT Men's Signature Multistripe Socks | PACT | Men |

Note that all four tables can be joined into order_items, resulting in a larger table with order level, product level, and user level information for each order_item. Using this data set, we can begin to construct important fact tables that will let us quickly calculate relationships between fields like ‘category’ , ‘item_name’, or ‘brand’. And we can calculate these affinities at the order level or the user level.

Ambitious users could even aggregate or filter using different attributes at both the user level (age, gender, join date, location) or order level (time of day, day of year). For example, we could examine SKU level co-purchase rates for males 18-35 for an e-commerce store, or page-view affinity for mobile traffic vs web traffic to a news site. This are recommendation models.

For this affinity analysis, we will using Jaccard Similarity, but there are many different methodologies for performing the calculation. An example calculation will make the math clear. Let’s imagine a store with 3 products and 10 customers that have purchased 17 items in total.

user_id | purchase_a_yesno | purchase_b_yesno | purchase_c_yesno |
---|---|---|---|

1 | 1 | 1 | |

2 | 1 | 1 | 1 |

3 | 1 | ||

4 | 1 | 1 | |

5 | 1 | 1 | |

6 | 1 | ||

7 | 1 | ||

8 | 1 | ||

9 | 1 | 1 | |

10 | 1 | 1 | |

Total | 5 | 5 | 7 |

To calculate similarity scores between the items, we simply find the number of customers in the intersection of two products (A ∩ B), divided by the union of the same two products (A U B).

Grouping product_a and product_b:

- Union: 9 customers bought one of the two products (customer 3 did not)
- Intersection: 1 customers bought both products (customer 2)
- Affinity = 1/9 = 0.11

Grouping product_a and product_c:

- Union: 7 customers bought one of the two products (customers 6,7,8 did not)
- Intersection: 5 customers bought both products (customers 1,2,4,5,10)
- Affinity = 5/7 = 0.71

Grouping product_b and product_c:

- Union: all 10 customers bought one of the two products
- Intersection: 2 customers bought both products (customers 2 and 9)
- Affinity = 2/10 = 0.20

Generating affinity scores across large data sets is as simple in practice as our example above. We need to simply generate the intersection of a given pair and the union of a given pair. Thankfully, we can use a quick shortcut to speed up the process (visualized below).

The intersection is simply the total number of instances where both A and B occurred (the purple square). The union appears a bit trickier to calculate, due to finding three individual cases. But note how simply adding all occurrences of Purchase A (the red, first column) and all occurrences of Purchase B (the blue, first row), and then subtracting the double-counted intersection (the purple square).

Thus we are left with:

In Looker, we can use derived tables to quickly do these calculations. Continuing our SQL example above, we first enumerate every order, product pair:

```
- view: order_product
derived_table:
persist_for: 24 hours
indexes: [order_id]
sql: |
SELECT o.id as order_id
, oi.inventory_item_id as inventory_item_id
, p.item_name as product
FROM order_items oi
JOIN orders o ON o.id = oi.order_id
JOIN inventory_items ii ON oi.inventory_item_id = ii.id
JOIN products p ON ii.product_id = p.id
GROUP BY order_id, item_name
```

We then calculate a simple count distinct of the order ids for each item (the purchase count):

```
- view: total_order_product
derived_table:
persist_for: 24 hours
indexes: [product]
sql: |
SELECT p.item_name as product
, count(distinct p.item_name, o.id) as product_frequency
FROM order_items oi
JOIN orders o ON o.id = oi.order_id
JOIN inventory_items ii ON oi.inventory_item_id = ii.id
JOIN products p ON ii.product_id = p.id
GROUP BY p.item_name
```

Now using these roll-up tables, we can examine the joint frequency of any two products, by joining the order, product table to itself:

```
- view: order_purchase_affinity
derived_table:
persist_for: 24 hours
indexes: [product_a]
sql: |
SELECT product_a
, product_b
, joint_frequency
, top1.product_frequency as product_a_frequency
, top2.product_frequency as product_b_frequency
FROM (
SELECT op1.product as product_a
, op2.product as product_b
, count(*) as joint_frequency
FROM ${order_product.SQL_TABLE_NAME} as op1
JOIN ${order_product.SQL_TABLE_NAME} op2 ON op1.order_id = op2.order_id
GROUP BY product_a, product_b
) as prop
JOIN ${total_order_product.SQL_TABLE_NAME} as top1 ON prop.product_a = top1.product
JOIN ${total_order_product.SQL_TABLE_NAME} as top2 ON prop.product_b = top2.product
ORDER BY product_a, joint_frequency DESC, product_b
```

The result is a table with every product pair that was ordered together (product_a and product_b), along with the joint frequency (joint_frequency), and the raw frequency of each product individually (product_a_frequency and product_b_frequency).

The affinity is then simply:

```
joint_frequency / (product_a_frequency + product_b_frequency - joint_frequency)
```

The analysis does not need to stop at order affinity.

This same analysis could be done at the user level, to generate email or product recommendations (Amazon’s users who bought X), or at the pageview level, to generate article recommendations (reader who liked this article also read X).

This analysis could also be used to generate more general category or sub-category recommendations, rather than simply recommending products (like these kid’s shoes, check out selection of kid’s hats).

Subsetting the ‘recommendation set’ based upon user attributes (age, location, gender, etc) would allow for “personalized” recommendations that could cohort and target users more granularly (say pointing LA denim buyers to t-shirts and Vermont denim buyers to outerwear).

Finally, time-sliced subsetting could allow for seasonal recommendation changes (maintaining fresh article recommendations on a news site for example).