...

How I Built a Recommender System for a Leading Fitness Company

- 4 mins

Recommender systems harness data to connect people with what they are likely to want. One way to recommend content involves filtering that content for a user based on their similarity to other users. This is called collaborative filtering. Using this method, I created a recommender system for a leading fitness company and built it into production on its app.

The company’s app allows users to view videos, articles, and workout plans. The company’s leadership and I decided to work to increase user engagement through video recommendations. My recommender system recommends videos to users based on their viewing history.

A key challenge of this kind of recommender system is deciding how to make recommendations for users who haven’t watched any videos yet. This is called the cold start problem.

In order to overcome this challenge, my approach was:

Collaborative Filtering

Say Amanda has watched video A and video B. How can we determine what she would like to watch next? Suppose Bob has watched videos A, B, and C, and that no other users have watched both videos A and B. Then, Amanda’s viewing history is more similar to Bob’s than it is to that of any other user. Amanda and Bob appear to have similar preferences, so it is probably a good bet to recommend the other video Bob has watched — video C — to Amanda. This is essentially how collaborative filtering works. For a given person, find the other people who are most similar to her. Then, predict that this person will behave as the others have.

In the previous example, Amanda’s set of viewed videos is a subset of Bob’s set of viewed videos, and is not a subset of anyone else’s. However, what happens if no one else has viewed all the videos that Amanda has viewed? Other people have viewed A and C, or A and B, but no one else has viewed A, B, and C. In this case, we still want to figure out which other users are most similar to Amanda, but we need a measure of similarity. I used cosine similarity. If you imagine a given user’s video watching history as a vector, then the cosine similarity between two users is the cosine of the angle between their vectors. For two users, A, and B, their cosine similarity is defined as:

$ \text{similarity} =\cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over |\mathbf{A}| |\mathbf{B}|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} } $

After computing the similarity between all users, I imputed the likelihood of each user watching each video. Then, for each user, I ordered the videos by the likelihood that they would watch them. The end result is a list of each user and their top 5 most-recommended videos.

Here is an example of a recommendation for one user:

member_id video score
1 Foundation Cardio Blast 1 0.93

Which Videos Drive Engagement?

For users with no viewing history, I recommended the videos that most drive engagement. I defined engagement as the number of actions that users completed on the app, such as reading an article or watching a fitness class. For each video, I found how many actions a user does on average the month after viewing the video. I’ll call this next-month engagement — a measure of how engaged a user tends to be given that he or she watched a certain video.

Here are the ten videos with highest next-month engagement:

The app recommends these videos to users with no viewing history.

The following plot shows view counts for the top ten most-watched videos compared to their next-month engagement:

As you can see, the most popular videos are not necessarily the most-engaging.

I also used a random forest model to predict user engagement in the second month from user video watching history in the first month. This served as a sanity check for the engagement analysis.

Random forest models work by constructing a large number of decision trees, each of which uses only a subset of all features in order to make predictions. This can help avoid the curse of dimensionality — a collection of problems that arise because high-dimensional data creates a large search space, making it difficult for some machine learning algorithms to make predictions. Random forest models not only produce a predicted value of the outcome variable, but also a measure of variable importance. I used this assessment of variable importance to make inferences about what types of videos are the top drivers of engagement.

Below is a plot of variable importance. Note that the variables (i.e., videos) that are important in predicting engagement are among those with highest next-month engagement:

Here is an overview of the tools I used in this project:

rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora