Skip to content
Mighten's Blog

Collaborative Filtering: From UserCF and ItemCF to Matrix Factorization

Explore the evolution of Collaborative Filtering from UserCF and ItemCF to Matrix Factorization. This post explains how recommender systems leverage user similarity, item relationships, and latent factors to model user preferences.

Recommender System 18 min read

Hi, today I will talk about an early recommendation systems paradigm - Collaborative Filtering (CF) - which built on a simple yet powerful premise: the past behavior of users is the best predictor of their future preferences. 1

So, in this post, we will break down the evolution of Collaborative Filtering into 3 core stages: UserCF, ItemCF, and the revolutionary Matrix Factorization (MF).


In the early days of the web, the most intuitive way to recommend something was through word-of-mouth. If you wanted a movie recommendation, you would ask friends who shared your taste. UserCF digitized this social phenomenon, becoming one of the earliest successful algorithms powering personalized content feeds.

The core philosophy of UserCF is: “Find users who are similar to the target user, and recommend items that those like-minded users enjoyed.”

UserCF operates entirely on the user-item interaction matrix.

Imagine a streaming platform with 3 users: Alice, Bob, and Charlie:

graph LR
classDef user_low_sim fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px;
classDef user fill:#f8cecc,stroke:#b85450,stroke-width:2px;
classDef item fill:#fff2cc,stroke:#d6b656,stroke-width:2px;
Alice["👤 Alice<br/>
Inception: 5 😍<br/>
Interstellar: 4 😊<br/>
<em>The Matrix</em>: ?"]:::user
Bob["👤 Bob<br/>
Inception: 5 😍<br/>
Interstellar: 5 😍<br/>
<em>The Matrix</em>: 5 😍"]:::user
Charlie["👤 Charlie<br/>
The Notebook: 5 😍"]:::user_low_sim
Matrix["🎬 <em>The Matrix</em>"]:::item
Alice --->|High similarity| Bob
Alice -.->|Low similarity| Charlie
Bob --> Matrix
Charlie -.-> Matrix
Matrix -->|Recommend this item| Alice
  • Alice loves Inception and Interstellar.
  • Bob loves Inception, Interstellar, and The Matrix.
  • Charlie only likes The Notebook.
User \ MovieInceptionInterstellarThe MatrixThe Notebook
Alice5 😍4 😊?
Bob5 😍5 😍5 😍
Charlie5 😍

Because Alice and Bob share identical tastes in sci-fi movies, UserCF identifies Bob as Alice’s “neighbor.” Since Bob also highly rated The Matrix — a movie Alice hasn’t seen yet — the system still recommends The Matrix to Alice.

To predict a rating for user uu on item ii, UserCF executes a two-step mathematical process:

Step 1: Calculate User SimilarityPearson Correlation Coefficient

NOTE: Why NOT Cosine Similarity?

Cosine Similarity does NOT explicitly remove user rating bias because it operates on raw vectors.

simcosine(u,v)=uvuv=iIuvRu,iRv,iiIuRu,i2iIvRv,i2(1-1)sim_{\text{cosine}}(u, v) = \frac{\vec{u} \cdot \vec{v}}{\|\vec{u}\| \|\vec{v}\|} = \frac{\sum_{i \in I_{uv}} R_{u,i} R_{v,i}}{\sqrt{\sum_{i \in I_{u}} R_{u,i}^2} \sqrt{\sum_{i \in I_{v}} R_{v,i}^2}} \tag{1-1}

In the real world, different users harbor different inner benchmarks for their scores. A harsh critic might give a masterpiece a 33 out of 55, while a lenient optimist might give the exact same item a 55. Their preference trends are identical, but their absolute score vectors differ. Because it uses absolute values, Cosine Similarity misinterprets these numerical shifts as differences in actual taste.

Conversely, for explicit ratings, the Pearson Correlation Coefficient in the following text acts as Mean-Centered Cosine Similarity that can cancel out the User Rating Bias.

We measure the similarity between user uu and user vv using Pearson Correlation Coefficient over their co-rated items IuvI_{uv}, to cancel out the User Rating Bias caused by differing individual scoring standards:

sim(u,v)=iIuv(Ru,iRˉu)(Rv,iRˉv)iIuv(Ru,iRˉu)2iIuv(Rv,iRˉv)2(1-2)sim(u, v) = \frac{\sum_{i \in I_{uv}} (R_{u,i} - \bar{R}_u)(R_{v,i} - \bar{R}_v)}{\sqrt{\sum_{i \in I_{uv}} (R_{u,i} - \bar{R}_u)^2} \sqrt{\sum_{i \in I_{uv}} (R_{v,i} - \bar{R}_v)^2}} \tag{1-2}

Where:

  • Ru,iR_{u,i} is the rating of user uu on item ii, and Rˉu\bar{R}_u is the average rating given by user uu.

By subtracting each user’s average rating (Rˉ\bar{R}), Pearson Correlation Coefficient shifts the coordinate origin to the center of each user’s historical behavior (centering).

Pearson Correlation Coefficient successfully strips away individual rating biases and forces all users onto a standardized playing field where a score is judged by its relative preference rather than its raw value.

Step 2: Predict the Rating

Because we used the Pearson Correlation in Step 1 to strip away individual rating biases, we cannot simply use the raw absolute ratings of the similar users to predict the target user’s score. Doing so would reintroduce the exact bias we just removed.

Instead, we must look at the neighbors’ relative ratings (how much their rating for item ii deviates from their own personal average) and apply that weighted deviation to the target user’s baseline average. The prediction equation is:

R^u,i=Rˉu+vS(u,K)sim(u,v)(Rv,iRˉv)vS(u,K)sim(u,v)(1-3)\boxed{\hat{R}_{u,i} = \bar{R}_u + \frac{\sum_{v \in S(u, K)} sim(u, v) \cdot (R_{v,i} - \bar{R}_v)}{\sum_{v \in S(u, K)} |sim(u, v)|}} \tag{1-3}

Where:

  • Rˉu\bar{R}_u is the average rating given by the target user uu.
  • sim(u,v)sim(u, v) is the Pearson similarity between user uu and user vv. Note the absolute value in the denominator to properly normalize weights, as Pearson values can be negative.
  • (Rv,iRˉv)(R_{v,i} - \bar{R}_v) is the relative rating (deviation) given by user vv to item ii.

After obtaining the predicted recommendation scores R^u,i\hat{R}_{u,i} for various unrated items, the final recommendation list can be derived by sorting these items in descending order based on their scores. This completes the entire collaborative filtering recommendation process.

  • Pros:
    • Highly intuitive and easy to implement.
    • Captures social, trending, and serendipitous discoveries (recommending items from entirely different categories if similar users bought them).
  • Cons:
    • Scalability issues: As the user base grows into millions, computing the user-to-user similarity matrix becomes computationally prohibitive (O(U2)O(U^2)) — However, in a real-world scenario, the number of items is often much smaller than the number of users
    • Sparsity: Real-world user-item matrices are incredibly sparse, making similar-user identification inaccurate for those with minimal interactions. Consequently, UserCF performs poorly in low-frequency, hard-to-feedback scenarios like hotel bookings or big-ticket purchases.

Therefore, in the next section, I will talk about Item-Based Collaborative Filtering, or ItemCF.

As e-commerce giants like Amazon grew exponentially in the late 1990s and early 2000s, they realized their user base fluctuated wildly and grew much faster than their product catalog. UserCF could no longer scale.

Amazon popularized large-scale item-to-item collaborative filtering (ItemCF) in production systems, shifting the focus from user dynamics to item relationships.

The core philosophy of ItemCF is: “Find items that are similar to the ones the user has liked in the past.”

It assumes that a user’s preference is stable over time, and items can be characterized by the users who consume them.

Also imagine a streaming platform where Alice just finished watching Interstellar and loved it. The system needs to decide whether to recommend The Matrix to her:

graph LR
%% Define Nodes
Alice["👤 Alice"]
subgraph Liked_Items ["Items Alice Liked"]
direction TB
Item1["🎬 Inception"]
Item2["🎬 Interstellar"]
end
Target["🎬 The Matrix<br/>Normalized Score: <br/>6 / (4+2.5) * 5 = 4.6 😍"]
%% User Interest Edges (Left Side)
Alice -->|"like(Alice, Inception) = 5 😍"| Item1
Alice -->|"like(Alice, Interstellar) = 4 😊"| Item2
%% Item Similarity Edges (Right Side)
Item1 -->|"sim(Inception, Matrix) = 0.8"| Target
Item2 -->|"sim(Interstellar, Matrix) = 0.5"| Target
%% Styling
classDef user_node fill:#f8cecc,stroke:#6c8ebf,stroke-width:2px;
classDef item_node fill:#f8cecc,stroke:#b85450,stroke-width:2px;
classDef target_node fill:#fff2cc,stroke:#d6b656,stroke-width:2px,stroke-dasharray: 4 4;
class Alice user_node;
class Item1,Item2 item_node;
class Target target_node;
  • Alice highly rated Inception (5 😍) and Interstellar (4 😊).
  • Based on global historical data, The Matrix has an item-to-item similarity score of 0.8 with Inception and 0.5 with Interstellar.

Instead of looking at other users, the system looks at the similarity between the items themselves:

Item \ ItemInceptionInterstellarThe Matrix
Inception1.00.60.8
Interstellar0.61.00.5
The Matrix0.80.51.0

And the predicted score with normalization is: 5×0.8+4×0.50.8+0.54.6\frac{5 \times 0.8 + 4 \times 0.5}{0.8 + 0.5} \approx 4.6

So ItemCF System will recommend The Matrix to Alice.

NOTE:

  • Examples of explicit ratings in UserCF and ItemCF are used for demonstration purpose only; modern industrial recommender systems mostly optimize implicit feedback objectives rather than explicit ratings:
    • Watch time
    • Click
    • purchase
    • (etc.)

Step 1: Construct an m×nm \times n User-Item Matrix

Let the total number of users in the system be mm and the total number of items be nn. We define the User-Item Matrix as R\mathbf{R}, which is a matrix belonging to the m×nm \times n-dimensional real vector space, denoted as RRm×n\mathbf{R} \in \mathbb{R}^{m \times n}:

R=(R1,1R1,2R1,nR2,1R2,2R2,nRm,1Rm,2Rm,n)(2-1)\mathbf{R} = \begin{pmatrix} R_{1,1} & R_{1,2} & \cdots & R_{1,n} \\ R_{2,1} & R_{2,2} & \cdots & R_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ R_{m,1} & R_{m,2} & \cdots & R_{m,n} \end{pmatrix} \tag{2-1}

Where:

  • the matrix element Ri,jR_{i,j} located at the ii-th row and jj-th column denotes the rating of user ii for item jj
  • R_i,jRating ValueR\_{i,j} \in { \text{Rating Value} }

Step 2: Construct an n×nn \times n Item Similarity Matrix

For any item jj (where 1jn1 \le j \le n), the corresponding column vector is denoted as vj\mathbf{v}_j, which is an mm-dimensional column vector:

vj=(R1,jR2,jRm,j)Rm×1(2-2)\mathbf{v}_j = \begin{pmatrix} R_{1,j} \\ R_{2,j} \\ \vdots \\ R_{m,j} \end{pmatrix} \in \mathbb{R}^{m \times 1} \tag{2-2}

We denote the similarity between item ii and item jj as wi,jw_{i,j}, so the Item Similarity Matrix can be defined as W\mathbf{W}, which is a matrix belonging to the n×nn \times n-dimensional real vector space, denoted as WRn×n\mathbf{W} \in \mathbb{R}^{n \times n}:

W=(w1,1w1,2w1,nw2,1w2,2w2,nwn,1wn,2wn,n)(2-3)\mathbf{W} = \begin{pmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,n} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n,1} & w_{n,2} & \cdots & w_{n,n} \end{pmatrix} \tag{2-3}

Where:

  • the matrix element wi,jw_{i,j}, located at the ii-th row and jj-th column
  • wi,jw_{i,j} denotes the similarity between item ii and item jj
  • wi,j[0,1]w_{i,j} \in [0, 1]

Step 3: Construct Top-KK Most Similar Item Candidate Set

For a target user, retrieve all items to which they have given positive feedback.

For each of these interacted items, identify its Top-KK most similar items based on the similarity matrix W\mathbf{W}.

Then, aggregate these neighboring items to construct an Item Candidate Set for the user.

Step 4: Generate Recommendation List

If a candidate item pp is similar to multiple items in the user’s history, the final recommendation score is the accumulated value of these similarities. The equation is expressed as:

Ru,p=hHwp,hRu,hhHwp,h(2-4)\boxed{R_{u,p} = \frac{\sum_{h \in H} w_{p,h} \cdot R_{u,h}}{\sum_{h \in H} |w_{p,h}|}} \tag{2-4}

Where:

  • HH is the set of positive feedback items in the target user’s history
  • wp,hw_{p,h} is the similarity between the candidate item pp and the historical item hh
  • Ru,hR_{u,h} is the rating score assigned by user uu to item hh

Sort the items in descending order of their predicted scores, forming a final recommendation list.

  • Pros:
    • Computationally Stable: Item catalogs change much slower than user populations, meaning the item similarity matrix can be computed offline and updated less frequently.
    • High Explainability: Allows for straightforward explanations that build trust (e.g., “Because you bought Item X, you might like Y”).
  • Cons:
    • Limited Serendipity: Tends to recommend items very similar to what the user has already seen, creating a “filter bubble.”
    • New Item Cold Start: Cannot effectively recommend a brand-new item until users start interacting with it.

DifferencesUserCFItemCF
Core Ideauser similarityitem similarity
Interest StabilityDynamicStable
FocusFocuses on Time:
Tracking changing trends in real-time
Item interaction patterns:
Recommending similar styles, genres, or categories
Applicationnews recommendationse-commerce shopping, video streaming

When handling massive and sparse datasets, traditional Memory-based Collaborative Filtering methods (such as UserCF and ItemCF) expose the fundamental limitation of a lack of generalization:

  • popular items tend to be similar to a lot of other items — more likely to be recommended.
  • long-tail items tend to be less similar to other items due to sparsityless likely to be recommended

The breakthrough came from Model-Based Collaborative Filtering, specifically Matrix Factorization (MF), which completely redefined how we model user preferences. 2

Matrix Factorization shifts the paradigm from analyzing direct interactions (such as the sparse User-Item Matrix) to discovering the latent factors (hidden features) that explain why a user likes a particular item.

Suppose we have a small movie rating dataset with 3 users and 4 movies, shown in User-Item Rating Matrix below:

Name \ MovieInceptionInterstellarThe MatrixThe Notebook
Alice54? (Unknown)? (Unknown)
Bob555? (Unknown)
Charlie? (Unknown)? (Unknown)? (Unknown)5

Obviously, the original User-Item Matrix is sparse: many users have not rated many items.

Instead of directly filling these missing values, Matrix Factorization assumes that both users and items can be represented in a lower-dimensional latent space.

For example, after the model training, suppose that the MF model decompose the User-Item Matrix into 2 latent factors:

  • Factor 1: Science Fiction
  • Factor 2: Romance

Note:

  • For illustration only, we can interpret latent dimensions as Sci-Fi or Romance affinity; in real models, latent factors usually do NOT have explicit semantic meanings.
  • The MF model just simply SKIP the unrated items during factorization phase:
    • MF models do NOT assume Alice hates it
    • MF models do NOT assume she loves it either

After MF model training and Matrix Factorization, assume we get User Latent Matrix and Item Latent Matrix:

  • User Latent Matrix (U\mathbf{U}):
Name \ FactorSci-FiRomance
Alice4.80.2
Bob5.50.1
Charlie0.15.0
  • Item Latent Matrix (VT\mathbf{V}^{\text{T}}):
Factor \ ItemInceptionInterstellarThe MatrixThe Notebook
Sci-Fi0.90.90.90
Romance0.10.201.0

Next, by performing matrix product (R^=UVT\hat{\mathbf{R}} = \mathbf{U}\mathbf{V}^{\text{T}}), MF learns a function that can estimate missing entries when needed:

User \ MovieInceptionInterstellarThe MatrixThe Notebook
Alice4.344.364.32 (Predicted)0.20 (Predicted)
Bob4.964.974.950.10 (Predicted)
Charlie0.59 (Predicted)1.09 (Predicted)0.09 (Predicted)5.00

So, in the example above, we can recommend The Matrix to Alice since MF model predicts that Alice may rate The Matrix 4.32 out of 5.

Note: This section is the reading note of the book Deep Learning Recommender System 2.0 (深度学习推荐系统2.0) by Zhe Wang (王喆).3

Matrix Factorization decomposes an m×nm \times n User-Item Matrix RR into an m×km \times k User Matrix UU and the transpose of an n×kn \times k Item Matrix VV:

R=UVT(3-1)\mathbf{R} = \mathbf{U}\mathbf{V}^{\text{T}} \tag{3-1}

Where:

  • mm: Number of users.
  • nn: Number of items.
  • kk: Dimension of the latent factors (hidden features) — a hyperparameter that directly impacts the complexity of factorization and needs to be tuned:
    • Large kk: Higher model capacity, weaker generalization.
    • Small kk: Lower model capacity, better generalization.

Hence the predicted rating of user uu for item ii is:

r^ui=qiTpu(3-2)\hat{r}_{ui} = \mathbf{q}_i^{\text{T}} \mathbf{p}_u \tag{3-2}

Where:

  • pu\mathbf{p}_u: Latent vector for user uu (the uu-th row of UU)
  • qi\mathbf{q}_i: Latent vector for item ii (the ii-th row of VV)

Note: By default, both pu\mathbf{p}_u and qi\mathbf{q}_i are treated as column vectors. So, the inner product of two vectors produces the scalar value r^ui\hat{r}_{ui}.

Now, the basic idea of MF is clear. So, let us look deeper into the calculation of Matrix Factorization:

With the core concept of MF established, let’s first concentrate on the goal of factorization:

The primary optimization goal is to minimize the discrepancy between the ground-truth rating ruir_{ui} and the dot product of the latent vectors qiTpu\mathbf{q}_i^{\text{T}} \mathbf{p}_u, thereby preserving the original information of the user-item matrix as much as possible.

So, We can define the baseline objective function for Matrix Factorization as follows:

minq,p(u,i)K(ruiqiTpu)2(3-3)\min_{\mathbf{q}^*, \mathbf{p}^*} \sum_{(u,i) \in K} (r_{ui} - \mathbf{q}_i^{\text{T}} \mathbf{p}_u)^2 \tag{3-3}

Where:

  • KK is the set of all observed user-item rating pairs in the training sample.

In practice, we often add the L2 regularization to the objective function to avoid over-fitting.

So the final objective function is:

L=minq,p(u,i)K[(ruiqiTpu)2+λ(qi2+pu2)](3-4)\boxed{L = \min_{\mathbf{q}^*, \mathbf{p}^*} \sum_{(u,i) \in K} [(r_{ui} - \mathbf{q}_i^{\text{T}} \mathbf{p}_u)^2 + \lambda \left( \|\mathbf{q}_i\|^2 + \|\mathbf{p}_u\|^2 \right)]} \tag{3-4}

Where:

  • λ\lambda is the regularization hyperparameter that controls the penalty strength to avoid large vector magnitudes.

NOTE: L2 regularization penalizes large vector magnitudes because smaller weights force the model to be simpler and smoother, preventing it from over-fitting to the training data.

Next, let’s dive into how we actually compute the decomposition. There are four primary methods to consider:

  • Eigendecomposition: NOT applicable, as the user-item matrix R\mathbf{R} is rarely a square (n×nn \times n) matrix.
  • Singular Value Decomposition (SVD): Impractical, because its computational complexity reaches O(mn2)O(mn^2) for mm users and nn items, which scales poorly (some pratical truncated SVD variants are not discussed in this post).
  • Gradient Descent (GD): The most popular choice (especially SGD). It bypasses missing entries by optimizing a loss function only over observed ratings, updating latent vectors iteratively — highly scalable and flexible
  • Alternating Least Squares (ALS): An elegant alternative optimized for parallelization, which transforms a non-convex problem into a series of quadratic subproblems.

So we will demonstrate how to optimize the objective function with Gradient Descent and Alternating Least Squares in the next two parts.


In this part, we use the Gradient Descent to optimize the objective function.

First, We take the partial derivatives of Equation (3-4) with respect to qi\mathbf{q}_i and pu\mathbf{p}_u:

Lqi=2(ruiqiTpu)pu+2λqi(3-5)\frac{\partial L}{\partial \mathbf{q}_i} = -2\left(r_{ui} - \mathbf{q}_i^{\text{T}}\mathbf{p}_u\right)\mathbf{p}_u + 2\lambda\mathbf{q}_i \tag{3-5} Lpu=2(ruiqiTpu)qi+2λpu(3-6)\frac{\partial L}{\partial \mathbf{p}_u} = -2\left(r_{ui} - \mathbf{q}_i^{\text{T}}\mathbf{p}_u\right)\mathbf{q}_i + 2\lambda\mathbf{p}_u \tag{3-6}

Then, In each step, qi\mathbf{q}_i and pu\mathbf{p}_u are updated via gradient descent:

qiqiγLqi(3-7)\mathbf{q}_i \leftarrow \mathbf{q}_i - \gamma \frac{\partial L}{\partial \mathbf{q}_i} \tag{3-7} pupuγLpu(3-8)\mathbf{p}_u \leftarrow \mathbf{p}_u - \gamma \frac{\partial L}{\partial \mathbf{p}_u} \tag{3-8}

By substituting the partial derivatives and absorbing the coefficients into γ\gamma, we obtain the final update rules:

qiqi+γ((ruiqiTpu)puλqi)(3-9)\boxed{\mathbf{q}_i \leftarrow \mathbf{q}_i + \gamma \left( \left( r_{ui} - \mathbf{q}_i^{\text{T}}\mathbf{p}_u \right) \mathbf{p}_u - \lambda \mathbf{q}_i \right)} \tag{3-9} pupu+γ((ruiqiTpu)qiλpu)(3-10)\boxed{\mathbf{p}_u \leftarrow \mathbf{p}_u + \gamma \left( \left( r_{ui} - \mathbf{q}_i^{\text{T}}\mathbf{p}_u \right) \mathbf{q}_i - \lambda \mathbf{p}_u \right)} \tag{3-10}

Repeat Rule (3-9) and Rule (3-10) until MF model converges, or until any of the following termination criteria are met:

  • The maximum number of training steps is exceeded
  • The training loss falls below a predefined threshold θ\theta

In this part, we use the Alternating Least Squares to optimize the objective function.

While the objective function (3-4) is non-convex when optimizing pu\mathbf{p}_u and qi\mathbf{q}_i simultaneously, it becomes a standard quadratic convex problem if we fix one set of variables and solve for the other. ALS leverages this property by alternatingly fixing V\mathbf{V} to solve for U\mathbf{U}, and then fixing U\mathbf{U} to solve for V\mathbf{V}.

Step 1: Fix Item Latent Matrix V\mathbf{V}, Update User Latent Vector pu\mathbf{p}_u

When the item matrix V\mathbf{V} (and thus all qi\mathbf{q}_i) is treated as constant, we can solve for each user’s latent vector pu\mathbf{p}_u independently. For a specific user uu, we take the partial derivative of the objective function LL with respect to pu\mathbf{p}_u and set it to zero:

Lpu=iKu2(ruiqiTpu)qi+2λpu=0\frac{\partial L}{\partial \mathbf{p}_u} = \sum_{i \in K_u} -2\left(r_{ui} - \mathbf{q}_i^{\text{T}}\mathbf{p}_u\right)\mathbf{q}_i + 2\lambda\mathbf{p}_u = 0

Where KuK_u denotes the set of all items rated by user uu. By rearranging the terms to isolate pu\mathbf{p}_u, we get:

iKu(qiqiTpu)+λpu=iKuruiqi\sum_{i \in K_u} \left(\mathbf{q}_i\mathbf{q}_i^{\text{T}}\mathbf{p}_u\right) + \lambda\mathbf{p}_u = \sum_{i \in K_u} r_{ui}\mathbf{q}_i (iKuqiqiT+λIk)pu=iKuruiqi\left( \sum_{i \in K_u} \mathbf{q}_i\mathbf{q}_i^{\text{T}} + \lambda \mathbf{I}_k \right) \mathbf{p}_u = \sum_{i \in K_u} r_{ui}\mathbf{q}_i

Where Ik\mathbf{I}_k is the k×kk \times k identity matrix. By multiplying both sides by the inverse matrix, we obtain the exact closed-form solution for pu\mathbf{p}_u:

pu=(iKuqiqiT+λIk)1iKuruiqi(3-11)\boxed{\mathbf{p}_u = \left( \sum_{i \in K_u} \mathbf{q}_i\mathbf{q}_i^{\text{T}} + \lambda \mathbf{I}_k \right)^{-1} \sum_{i \in K_u} r_{ui}\mathbf{q}_i} \tag{3-11}

Step 2: Fix User Latent Matrix U\mathbf{U}, Update Item Latent Vector qi\mathbf{q}_i

Conversely, when the user matrix U\mathbf{U} (and thus all pu\mathbf{p}_u) is fixed, the problem becomes symmetric. For a specific item ii, we take the partial derivative of LL with respect to qi\mathbf{q}_i, set it to zero, and solve the linear system for the optimal qi\mathbf{q}_i:

qi=(uKipupuT+λIk)1uKiruipu(3-12)\boxed{\mathbf{q}_i = \left( \sum_{u \in K_i} \mathbf{p}_u\mathbf{p}_u^{\text{T}} + \lambda \mathbf{I}_k \right)^{-1} \sum_{u \in K_i} r_{ui}\mathbf{p}_u} \tag{3-12}

Where: KiK_i denotes the set of all users who have rated item ii.

Repeat Equation (3-11) and Equation (3-12) alternatively to update all users and items until the MF model converges, or until any of the following termination criteria are met:

  • The maximum number of training iterations is exceeded.
  • The training loss falls below a predefined threshold θ\theta.

Since each user’s update is completely independent of the others (and the same applies to items), ALS is highly parallelizable.


  • Pros:
    • Superb Sparsity Handling: Fills in the blanks of massive, sparse matrices exceptionally well by leveraging underlying latent structures.
    • High Predictive Accuracy: Outperformed classic CF algorithms by significant margins in industry benchmarks.
    • Memory Efficiency: Compresses huge user-item tracking matrices into much smaller latent vectors.
    • High Scalability: Matrices are decomposed to latent vectors — similar to Embeddings — for feature engineering pipelines, and thus can be seamlessly bridged with Deep Learning networks
  • Cons:
    • Inflexible Feature Integration:
      • Inconvenient to incorporate side information (richer data such as user demographics, item attributes, and contextual features) into matrix factorization
    • Cold Start Problem: It fails to provide effective recommendations when historical user-item interactions are completely absent (e.g., for brand-new users or items).

Collaborative Filtering is one of the foundational paradigms in modern recommender systems. Starting from the intuitive idea of “users with similar behaviors may share similar preferences,” UserCF models recommendations through user relationships, while ItemCF shifts the focus toward item-to-item interactions for better scalability and stability.

However, traditional memory-based approaches rely heavily on direct similarity computation and struggle with sparse data and generalization. Matrix Factorization addresses these limitations by learning latent representations of users and items, transforming sparse interaction patterns into compact mathematical embeddings.

This evolution from UserCF and ItemCF to MF reflects a broader shift in recommendation systems: from manually measuring similarity to automatically discovering hidden preference structures. Although modern industrial systems have moved toward deep learning and hybrid recommendation architectures, collaborative filtering remains a fundamental building block for understanding personalized recommendation.


  1. G. Linden, B. Smith and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” in IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan.-Feb. 2003, doi: 10.1109/MIC.2003.1167344.

  2. Y. Koren, R. Bell and C. Volinsky, “Matrix Factorization Techniques for Recommender Systems,” in Computer, vol. 42, no. 8, pp. 30-37, Aug. 2009, doi: 10.1109/MC.2009.263.

  3. Zhe Wang. “Deep Learning Recommender System 2.0” (“深度学习推荐系统2.0”), Beijing: Publishing House of Electronics Industry (“电子工业出版社”), Mar. 2025, ISBN: 9787121497469.

Comments