Machine Learning Part 6：Recommender System

All of this series is mainly based on the Machine Learning course given by Andrew Ng, which is hosted on cousera.org.

Recommender System

Predicting Movie Rating

-w923

Content Based Recommender system

-w1438

Note: For each movie i, we have some features, with $x^{(i)}_0=1$.

Problem Formulation

-w543

Optimization Objective To learn $\theta^{(j)} $ (parameters for user j):

\[\min _{\theta(j)} \frac{1}{2} \sum_{i: r(i, j)=1}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right)^{2}+\frac{\lambda}{2} \sum_{k=1}^{n}\left(\theta_{k}^{(j)}\right)^{2}\]

To learn $\theta^{(j)}, \forall j$:

\[\min _{\theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}} \frac{1}{2} \sum_{j=1}^{n_{u}} \sum_{i: r(i, j)=1}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right)^{2}+\frac{\lambda}{2} \sum_{j=1}^{n_{u}} \sum_{k=1}^{n}\left(\theta_{k}^{(j)}\right)^{2}\]

My Thoughts: After we get the $\theta \in R^{n+1}$ for each user, we can try to use PCA to divide the user into a limited number of representative groups: $Group_g,g=1,….,G$. Since for these group, we can have more ratings available, we can get more information about this group, and derive better $\theta$ for each of these group. After this, we find the approximation to $user_j$: which is a linear combination of $Group_g,g=1,….,G$. The r(i,j) is then approximated by the linear combination of the rating for the groups.

Collaborative Filtering

A algorithm which can find which feature to use. After we already get the $\theta$ for the user, based on their ratings on a movie, we can estimate the features of the movie.

Optimization Algorithm

\[\begin{array}{l}{\text { Given } \theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}, \text { to learn } x^{(i)}:} \\ {\quad \min _{x^{(i)}} \frac{1}{2} \sum_{j: r(i, j)=1}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right)^{2}+\frac{\lambda}{2} \sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2}}\end{array}\]

Further, we can learn the features $x_{k}$ for all the movies.

\[\begin{array}{l}{\text { Given } \theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}, \text { to learn } x^{(1)}, \ldots, x^{\left(n_{m}\right)}:} \\ {\qquad \begin{array}{l}{\min _{x^{(1)}, \ldots, x^{\left(n_{m}\right)}} \frac{1}{2} \sum_{i=1}^{n_{m}} \sum_{j: r(i, j)=1}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{n_{m}} \sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2}}\end{array}} \\ \end{array}\]

-w495

Collaboration between users in the sense of better estimation of features: every user gives the system their preference, with which the system learn better feature x_i, and then further helping make the estimation of preference for each user.

Algorithm for Collaborative Filtering

$\text { Given } x^{(1)}, \ldots, x^{\left(n_{m}\right)}, \text { estimate } \theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}:$

${\text { Given } \theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}, \text { to learn } x^{(1)}, \ldots, x^{\left(n_{m}\right)}:}$

\[\min _{x^{(1)}, \ldots, x^{\left(n_{m}\right)}} \frac{1}{2} \sum_{i=1}^{n_{m}} \sum_{j: r(i, j)=1}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{n_{m}} \sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2}\]

We combine this process: $\text { Minimizing } x^{(1)}, \ldots, x^{\left(n_{m}\right)} \text { and } \theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)} \text { simultaneously: }$

\[J\left(x^{(1)}, \ldots, x^{\left(n_{m}\right)}, \theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}\right)=\frac{1}{2} \sum_{(i, j): r(i, j)=1}\left(\left(\theta^{(j)}\right)^{T} x^{(i)}-y^{(i, j)}\right)^{2}+\frac{\lambda}{2} \sum_{i=1}^{n_{m}} \sum_{k=1}^{n}\left(x_{k}^{(i)}\right)^{2}+\frac{\lambda}{2} \sum_{j=1}^{n_{u}} \sum_{k=1}^{n}\left(\theta_{k}^{(j)}\right)^{2}\]

The optimization problem:

\[\min_{x,\theta} J(x,\theta)\]

where: $x=x^{(1)}, \ldots, x^{\left(n_{m}\right)}$ $\theta=\theta^{(1)}, \ldots, \theta^{\left(n_{u}\right)}$

Note: when we combine the two learning process, we can get rid of $x^{(i)}_0=1$. This is because the features can be learned during the process. If the algorithm really thinks a constant feature is needed, it will make one by itself during the iteration.

In conclusion:

-w619

Note:

here $x,\theta \in R^n$ without the constant term.
The final predicted star rating is $\theta^T x$
Initialization of the x and $\theta$ is for the purpose of breaking the symmetry and ensuring the learned $x_i$ and $\theta_j$ are different.

Implementation/vectorization of Collaborative Filtering–Low Rank matrix Factorization

For the collaborative filtering, we basically deal with the problem of finding parameters given the result.

-w701

In other words:

$Y= X^T *\theta $ where $\Theta=[{\theta^{(1)}},\theta^{(2)},…,\theta^{(n_u)}]$ $X=[{x^{(1)}},x^{(2)},…,x^{(n_m)}]$

Mathematics about Low-Rank matrix factorization

$\min |\boldsymbol A - \boldsymbol UV^{T} |_2 \text{, subject to}~ rank(\boldsymbol UV^{T}) \leq r), where (|\cdot|_2$ denotes the Frobenius norm.

We start with the basic MF model, formulated as:

\[\min _{\mathbf{U}, \mathbf{V}}\left\|\mathbf{X}-\mathbf{U} \mathbf{V}^{T}\right\|+\mathcal{L}(\mathbf{U}, \mathbf{V})\]

where $X\in R^{m\times n}$ is the data matrix to be approximated, and $U\in R^{m\times k},V\in R^{n\times k}$ are two low-dimensional matrices ($k«min(m,m)$), $\mathcal{L}(U,V)$ is a regularization part to avoid overfitting.

Implementation/vectorization of Collaborative Filtering–Mean Normalization

If we do not have a mean normalization process, then for a guy that gives no rating, the $\theta$ will be set to be 0. In this case, no recommendation can be made, as the predicted ratings will be 0 for any movie.

-w815

Implementation of mean normalization

$Y=\left[\begin{array}{lllll}{5} & {5} & {0} & {0} & {?} \ {5} & {?} & {?} & {0} & {?} \ {?} & {4} & {0} & {?} & {?} \ {0} & {0} & {5} & {4} & {?} \ {0} & {0} & {5} & {0} & {?}\end{array}\right]$

then we can compute the average rating: $\mu=\left[\begin{array}{c}{2.5} \ {2.5} \ {2} \ {2.25} \ {1.25}\end{array}\right]$

bsfun(@minus,Y,mu):

\[Y=\left[\begin{array}{ccccc}{2.5} & {2.5} & {-2.5} & {-2.5} & {?} \\ {2.5} & {?} & {?} & {-2.5} & {?} \\ {?} & {2} & {-2} & {?} & {?} \\ {-2.25} & {-2.25} & {2.75} & {1.75} & {?} \\ {-1.25} & {-1.25} & {3.75} & {-1.25} & {?}\end{array}\right]\]

Now for user j and movie i, the predicted rating is: $<\theta^{(j)},x^{(i)} >+\mu_i$

In this case, even for the user who has not given any ratings, the ratings can be predicted as $0+\mu$.

Twitter Facebook LinkedIn

Machine Learning Part 6：Recommender System

Recommender System

Predicting Movie Rating

Collaborative Filtering

Algorithm for Collaborative Filtering

Implementation/vectorization of Collaborative Filtering–Low Rank matrix Factorization

Mathematics about Low-Rank matrix factorization

Implementation/vectorization of Collaborative Filtering–Mean Normalization

Implementation of mean normalization

Comments

You May Also Enjoy

Time-varing models for estimating Value-at-Risk(vars)

Quantitative Trading - 1. Introduction

Macroeconomics, Economic Principles in the Real World - Part 2

Macroeconomics, Economic Principles in the Real World - Part 1