Collaborative Filtering: From UserCF and ItemCF to Matrix Factorization

Explore the evolution of Collaborative Filtering from UserCF and ItemCF to Matrix Factorization. This post explains how recommender systems leverage user similarity, item relationships, and latent factors to model user preferences.

June 18, 2026 Updated on July 11, 2026 Recommender System 18 min read

Hi, today I will talk about an early recommender systems paradigm - Collaborative Filtering (CF) - which built on a simple yet powerful premise: the past behavior of users is the best predictor of their future preferences. ¹

So, in this post, we will break down the evolution of Collaborative Filtering into 3 core stages: UserCF, ItemCF, and the revolutionary Matrix Factorization (MF).

In the early days of the web, the most intuitive way to recommend something was through word-of-mouth. If you wanted a movie recommendation, you would ask friends who shared your taste. UserCF digitized this social phenomenon, becoming one of the earliest successful algorithms powering personalized content feeds.

The core philosophy of UserCF is: “Find users who are similar to the target user, and recommend items that those like-minded users enjoyed.”

UserCF operates entirely on the user-item interaction matrix.

Imagine a streaming platform with 3 users: Alice, Bob, and Charlie:

graph LR

classDef user_low_sim fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px;
classDef user fill:#f8cecc,stroke:#b85450,stroke-width:2px;
classDef item fill:#fff2cc,stroke:#d6b656,stroke-width:2px;


Alice["👤 Alice<br/>
Inception: 5 😍<br/>
Interstellar: 4 😊<br/>
<em>The Matrix</em>: ?"]:::user

Bob["👤 Bob<br/>
Inception: 5 😍<br/>
Interstellar: 5 😍<br/>
<em>The Matrix</em>: 5 😍"]:::user

Charlie["👤 Charlie<br/>
The Notebook: 5 😍"]:::user_low_sim

Matrix["🎬 <em>The Matrix</em>"]:::item

Alice --->|High similarity| Bob
Alice -.->|Low similarity| Charlie

Bob --> Matrix
Charlie -.-> Matrix
Matrix -->|Recommend this item| Alice

Alice loves Inception and Interstellar.
Bob loves Inception, Interstellar, and The Matrix.
Charlie only likes The Notebook.

User \ Movie	Inception	Interstellar	The Matrix	The Notebook
Alice	5 😍	4 😊	?
Bob	5 😍	5 😍	5 😍
Charlie				5 😍

Because Alice and Bob share identical tastes in sci-fi movies, UserCF identifies Bob as Alice’s “neighbor.” Since Bob also highly rated The Matrix — a movie Alice hasn’t seen yet — the system still recommends The Matrix to Alice.

To predict a rating for user $u$ on item $i$ , UserCF executes a two-step mathematical process:

Step 1: Calculate User Similarity — Pearson Correlation Coefficient

NOTE: Why NOT Cosine Similarity?

Cosine Similarity does NOT explicitly remove user rating bias because it operates on raw vectors.
$sim_{\text{cosine}}(\boldsymbol{u}, \boldsymbol{v}) = \frac{\boldsymbol{u} \cdot \boldsymbol{v}}{\|\boldsymbol{u}\| \|\boldsymbol{v}\|} = \frac{\sum_{i \in I_{uv}} R_{u,i} R_{v,i}}{\sqrt{\sum_{i \in I_{u}} R_{u,i}^2} \sqrt{\sum_{i \in I_{v}} R_{v,i}^2}} \tag{1-1}$
In the real world, different users harbor different inner benchmarks for their scores. A harsh critic might give a masterpiece a $3$ out of $5$ , while a lenient optimist might give the exact same item a $5$ . Their preference trends are identical, but their absolute score vectors differ. Because it uses absolute values, Cosine Similarity misinterprets these numerical shifts as differences in actual taste.

Conversely, for explicit ratings, the Pearson Correlation Coefficient in the following text acts as Mean-Centered Cosine Similarity that can cancel out the User Rating Bias.

We measure the similarity between user $u$ and user $v$ using Pearson Correlation Coefficient over their co-rated items $I_{uv}$ , to cancel out the User Rating Bias caused by differing individual scoring standards:

sim(u, v) = \frac{\sum_{i \in I_{uv}} (R_{u,i} - \bar{R}_u)(R_{v,i} - \bar{R}_v)}{\sqrt{\sum_{i \in I_{uv}} (R_{u,i} - \bar{R}_u)^2} \sqrt{\sum_{i \in I_{uv}} (R_{v,i} - \bar{R}_v)^2}} \tag{1-2}

Where:

$R_{u,i}$ is the rating of user $u$ on item $i$ , and $\bar{R}_u$ is the average rating given by user $u$ .

By subtracting each user’s average rating ( $\bar{R}$ ), Pearson Correlation Coefficient shifts the coordinate origin to the center of each user’s historical behavior (centering).

Pearson Correlation Coefficient successfully strips away individual rating biases and forces all users onto a standardized playing field where a score is judged by its relative preference rather than its raw value.

Step 2: Predict the Rating

Because we used the Pearson Correlation in Step 1 to strip away individual rating biases, we cannot simply use the raw absolute ratings of the similar users to predict the target user’s score. Doing so would reintroduce the exact bias we just removed.

Instead, we must look at the neighbors’ relative ratings (how much their rating for item $i$ deviates from their own personal average) and apply that weighted deviation to the target user’s baseline average. The prediction equation is:

\boxed{\hat{R}_{u,i} = \bar{R}_u + \frac{\sum_{v \in S(u, K)} sim(u, v) \cdot (R_{v,i} - \bar{R}_v)}{\sum_{v \in S(u, K)} |sim(u, v)|}} \tag{1-3}

Where:

$\bar{R}_u$ is the average rating given by the target user $u$ .
$sim(u, v)$ is the Pearson similarity between user $u$ and user $v$ . Note the absolute value in the denominator to properly normalize weights, as Pearson values can be negative.
$(R_{v,i} - \bar{R}_v)$ is the relative rating (deviation) given by user $v$ to item $i$ .

After obtaining the predicted recommendation scores $\hat{R}_{u,i}$ for various unrated items, the final recommendation list can be derived by sorting these items in descending order based on their scores. This completes the entire collaborative filtering recommendation process.

Pros:
- Highly intuitive and easy to implement.
- Captures social, trending, and serendipitous discoveries (recommending items from entirely different categories if similar users bought them).
Cons:
- Scalability issues: As the user base grows into millions, computing the user-to-user similarity matrix becomes computationally prohibitive ( $O(U^2)$ ) — However, in a real-world scenario, the number of items is often much smaller than the number of users
- Sparsity: Real-world user-item matrices are incredibly sparse, making similar-user identification inaccurate for those with minimal interactions. Consequently, UserCF performs poorly in low-frequency, hard-to-feedback scenarios like hotel bookings or big-ticket purchases.

Therefore, in the next section, I will talk about Item-Based Collaborative Filtering, or ItemCF.

As e-commerce giants like Amazon grew exponentially in the late 1990s and early 2000s, they realized their user base fluctuated wildly and grew much faster than their product catalog. UserCF could no longer scale.

Amazon popularized large-scale item-to-item collaborative filtering (ItemCF) in production systems, shifting the focus from user dynamics to item relationships.

The core philosophy of ItemCF is: “Find items that are similar to the ones the user has liked in the past.”

It assumes that a user’s preference is stable over time, and items can be characterized by the users who consume them.

Also imagine a streaming platform where Alice just finished watching Interstellar and loved it. The system needs to decide whether to recommend The Matrix to her:

graph LR
    %% Define Nodes
    Alice["👤 Alice"]

    subgraph Liked_Items ["Items Alice Liked"]
        direction TB
        Item1["🎬 Inception"]
        Item2["🎬 Interstellar"]
    end

    Target["🎬 The Matrix<br/>Normalized Score: <br/>6 / (4+2.5) * 5 = 4.6 😍"]

    %% User Interest Edges (Left Side)
    Alice -->|"like(Alice, Inception) = 5 😍"| Item1
    Alice -->|"like(Alice, Interstellar) = 4 😊"| Item2

    %% Item Similarity Edges (Right Side)
    Item1 -->|"sim(Inception, Matrix) = 0.8"| Target
    Item2 -->|"sim(Interstellar, Matrix) = 0.5"| Target

    %% Styling
    classDef user_node fill:#f8cecc,stroke:#6c8ebf,stroke-width:2px;
    classDef item_node fill:#f8cecc,stroke:#b85450,stroke-width:2px;
    classDef target_node fill:#fff2cc,stroke:#d6b656,stroke-width:2px,stroke-dasharray: 4 4;

    class Alice user_node;
    class Item1,Item2 item_node;
    class Target target_node;

Alice highly rated Inception (5 😍) and Interstellar (4 😊).
Based on global historical data, The Matrix has an item-to-item similarity score of 0.8 with Inception and 0.5 with Interstellar.

Instead of looking at other users, the system looks at the similarity between the items themselves:

Item \ Item	Inception	Interstellar	The Matrix
Inception	1.0	0.6	0.8
Interstellar	0.6	1.0	0.5
The Matrix	0.8	0.5	1.0

And the predicted score with normalization is: $\frac{5 \times 0.8 + 4 \times 0.5}{0.8 + 0.5} \approx 4.6$

So ItemCF System will recommend The Matrix to Alice.

NOTE:

Examples of explicit ratings in UserCF and ItemCF are used for demonstration purpose only; modern industrial recommender systems mostly optimize implicit feedback objectives rather than explicit ratings:

Watch time

Click

purchase

(etc.)

Step 1: Construct an $m \times n$ User-Item Matrix

Let the total number of users in the system be $m$ and the total number of items be $n$ . We define the User-Item Matrix as $\mathbf{R}$ , which is a matrix belonging to the $m \times n$ -dimensional real vector space, denoted as $\mathbf{R} \in \mathbb{R}^{m \times n}$ :

\mathbf{R} = \begin{pmatrix} R_{1,1} & R_{1,2} & \cdots & R_{1,n} \\ R_{2,1} & R_{2,2} & \cdots & R_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ R_{m,1} & R_{m,2} & \cdots & R_{m,n} \end{pmatrix} \tag{2-1}

Where:

the matrix element $R_{i,j}$ located at the $i$ -th row and $j$ -th column denotes the rating of user $i$ for item $j$
$R_{i,j} \in \{\text{Rating Value}\}$

Step 2: Construct an $n \times n$ Item Similarity Matrix

For any item $j$ (where $1 \le j \le n$ ), the corresponding column vector is denoted as $\boldsymbol{v}_j$ , which is an $m$ -dimensional column vector:

\boldsymbol{v}_j = \begin{pmatrix} R_{1,j} \\ R_{2,j} \\ \vdots \\ R_{m,j} \end{pmatrix} \in \mathbb{R}^{m \times 1} \tag{2-2}

We denote the similarity between item $i$ and item $j$ as $w_{i,j}$ , so the Item Similarity Matrix can be defined as $\mathbf{W}$ , which is a matrix belonging to the $n \times n$ -dimensional real vector space, denoted as $\mathbf{W} \in \mathbb{R}^{n \times n}$ :

\mathbf{W} = \begin{pmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,n} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{n,1} & w_{n,2} & \cdots & w_{n,n} \end{pmatrix} \tag{2-3}

Where:

the matrix element $w_{i,j}$ , located at the $i$ -th row and $j$ -th column
$w_{i,j}$ denotes the similarity between item $i$ and item $j$
$w_{i,j} \in [0, 1]$

Step 3: Construct Top- $K$ Most Similar Item Candidate Set

For a target user, retrieve all items to which they have given positive feedback.

For each of these interacted items, identify its Top- $K$ most similar items based on the similarity matrix $\mathbf{W}$ .

Then, aggregate these neighboring items to construct an Item Candidate Set for the user.

Step 4: Generate Recommendation List

If a candidate item $p$ is similar to multiple items in the user’s history, the final recommendation score is the accumulated value of these similarities. The equation is expressed as:

\boxed{R_{u,p} = \frac{\sum_{h \in H} w_{p,h} \cdot R_{u,h}}{\sum_{h \in H} |w_{p,h}|}} \tag{2-4}

Where:

$H$ is the set of positive feedback items in the target user’s history
$w_{p,h}$ is the similarity between the candidate item $p$ and the historical item $h$
$R_{u,h}$ is the rating score assigned by user $u$ to item $h$

Sort the items in descending order of their predicted scores, forming a final recommendation list.

Pros:
- Computationally Stable: Item catalogs change much slower than user populations, meaning the item similarity matrix can be computed offline and updated less frequently.
- High Explainability: Allows for straightforward explanations that build trust (e.g., “Because you bought Item X, you might like Y”).
Cons:
- Limited Serendipity: Tends to recommend items very similar to what the user has already seen, creating a “filter bubble.”
- New Item Cold Start: Cannot effectively recommend a brand-new item until users start interacting with it.

Differences	UserCF	ItemCF
Core Idea	user similarity	item similarity
Interest Stability	Dynamic	Stable
Focus	Focuses on Time: Tracking changing trends in real-time	Item interaction patterns: Recommending similar styles, genres, or categories
Application	news recommendations	e-commerce shopping, video streaming

When handling massive and sparse datasets, traditional Memory-based Collaborative Filtering methods (such as UserCF and ItemCF) expose the fundamental limitation of a lack of generalization:

popular items tend to be similar to a lot of other items — more likely to be recommended.
long-tail items tend to be less similar to other items due to sparsity — less likely to be recommended

The breakthrough came from Model-Based Collaborative Filtering, specifically Matrix Factorization (MF), which completely redefined how we model user preferences. ²

Matrix Factorization shifts the paradigm from analyzing direct interactions (such as the sparse User-Item Matrix) to discovering the latent factors (hidden features) that explain why a user likes a particular item.

Suppose we have a small movie rating dataset with 3 users and 4 movies, shown in User-Item Rating Matrix below:

Name \ Movie	Inception	Interstellar	The Matrix	The Notebook
Alice	5	4	? (Unknown)	? (Unknown)
Bob	5	5	5	? (Unknown)
Charlie	? (Unknown)	? (Unknown)	? (Unknown)	5

Obviously, the original User-Item Matrix is sparse: many users have not rated many items.

Instead of directly filling these missing values, Matrix Factorization assumes that both users and items can be represented in a lower-dimensional latent space.

For example, after the model training, suppose that the MF model decompose the User-Item Matrix into 2 latent factors:

Factor 1: Science Fiction
Factor 2: Romance

Note:

For illustration only, we can interpret latent dimensions as Sci-Fi or Romance affinity; in real models, latent factors usually do NOT have explicit semantic meanings.

The MF model just simply SKIP the unrated items during factorization phase:

MF models do NOT assume Alice hates it

MF models do NOT assume she loves it either

After MF model training and Matrix Factorization, assume we get User Latent Matrix and Item Latent Matrix:

User Latent Matrix ( $\mathbf{U}$ ):

Name \ Factor	Sci-Fi	Romance
Alice	4.8	0.2
Bob	5.5	0.1
Charlie	0.1	5.0

Item Latent Matrix ( $\mathbf{V}^{\text{T}}$ ):

Factor \ Item	Inception	Interstellar	The Matrix	The Notebook
Sci-Fi	0.9	0.9	0.9	0
Romance	0.1	0.2	0	1.0

Next, by performing matrix product ( $\hat{\mathbf{R}} = \mathbf{U}\mathbf{V}^{\text{T}}$ ), MF learns a function that can estimate missing entries when needed:

User \ Movie	Inception	Interstellar	The Matrix	The Notebook
Alice	4.34	4.36	4.32 (Predicted)	0.20 (Predicted)
Bob	4.96	4.97	4.95	0.10 (Predicted)
Charlie	0.59 (Predicted)	1.09 (Predicted)	0.09 (Predicted)	5.00

So, in the example above, we can recommend The Matrix to Alice since MF model predicts that Alice may rate The Matrix 4.32 out of 5.

Note: This section is the reading note of the book Deep Learning Recommender System 2.0 (深度学习推荐系统2.0) by Zhe Wang (王喆).³

Matrix Factorization decomposes an $m \times n$ User-Item Matrix $R$ into an $m \times k$ User Matrix $U$ and the transpose of an $n \times k$ Item Matrix $V$ :

\mathbf{R} = \mathbf{U}\mathbf{V}^{\text{T}} \tag{3-1}

Where:

$m$ : Number of users.
$n$ : Number of items.
$k$ : Dimension of the latent factors (hidden features) — a hyperparameter that directly impacts the complexity of factorization and needs to be tuned:
- Large $k$ : Higher model capacity, weaker generalization.
- Small $k$ : Lower model capacity, better generalization.

Hence the predicted rating of user $u$ for item $i$ is:

\hat{r}_{ui} = \boldsymbol{q}_i^{\text{T}} \boldsymbol{p}_u \tag{3-2}

Where:

$\boldsymbol{p}_u$ : Latent vector for user $u$ (the $u$ -th row of $U$ )
$\boldsymbol{q}_i$ : Latent vector for item $i$ (the $i$ -th row of $V$ )

Note: By default, both $\boldsymbol{p}_u$ and $\boldsymbol{q}_i$ are treated as column vectors. So, the inner product of two vectors produces the scalar value $\hat{r}_{ui}$ .

Now, the basic idea of MF is clear. So, let us look deeper into the calculation of Matrix Factorization:

With the core concept of MF established, let’s first concentrate on the goal of factorization:

The primary optimization goal is to minimize the discrepancy between the ground-truth rating $r_{ui}$ and the dot product of the latent vectors $\boldsymbol{q}_i^{\text{T}} \boldsymbol{p}_u$ , thereby preserving the original information of the user-item matrix as much as possible.

So, We can define the baseline objective function for Matrix Factorization as follows:

\min_{\boldsymbol{q}^*, \boldsymbol{p}^*} \sum_{(u,i) \in K} (r_{ui} - \boldsymbol{q}_i^{\text{T}} \boldsymbol{p}_u)^2 \tag{3-3}

Where:

$K$ is the set of all observed user-item rating pairs in the training sample.

In practice, we often add the L2 regularization to the objective function to avoid over-fitting.

So the final objective function is:

\boxed{L = \min_{\boldsymbol{q}^*, \boldsymbol{p}^*} \sum_{(u,i) \in K} [(r_{ui} - \boldsymbol{q}_i^{\text{T}} \boldsymbol{p}_u)^2 + \lambda \left( \|\boldsymbol{q}_i\|^2 + \|\boldsymbol{p}_u\|^2 \right)]} \tag{3-4}

Where:

$\lambda$ is the regularization hyperparameter that controls the penalty strength to avoid large vector magnitudes.

NOTE: L2 regularization penalizes large vector magnitudes because smaller weights force the model to be simpler and smoother, preventing it from over-fitting to the training data.

Next, let’s dive into how we actually compute the decomposition. There are four primary methods to consider:

Eigendecomposition: NOT applicable, as the user-item matrix $\mathbf{R}$ is rarely a square ( $n \times n$ ) matrix.
Singular Value Decomposition (SVD): Impractical, because its computational complexity reaches $O(mn^2)$ for $m$ users and $n$ items, which scales poorly (some pratical truncated SVD variants are not discussed in this post).
Gradient Descent (GD): The most popular choice (especially SGD). It bypasses missing entries by optimizing a loss function only over observed ratings, updating latent vectors iteratively — highly scalable and flexible
Alternating Least Squares (ALS): An elegant alternative optimized for parallelization, which transforms a non-convex problem into a series of quadratic subproblems.

So we will demonstrate how to optimize the objective function with Gradient Descent and Alternating Least Squares in the next two parts.

In this part, we use the Gradient Descent to optimize the objective function.

For each observed user-item pair $(u,i)$ , we take the partial derivatives of Equation (3-4) with respect to $\boldsymbol{q}_i$ and $\boldsymbol{p}_u$ :

\frac{\partial L}{\partial \boldsymbol{q}_i} = -2\left(r_{ui} - \boldsymbol{q}_i^{\text{T}}\boldsymbol{p}_u\right)\boldsymbol{p}_u + 2\lambda\boldsymbol{q}_i \tag{3-5}

\frac{\partial L}{\partial \boldsymbol{p}_u} = -2\left(r_{ui} - \boldsymbol{q}_i^{\text{T}}\boldsymbol{p}_u\right)\boldsymbol{q}_i + 2\lambda\boldsymbol{p}_u \tag{3-6}

Then, In each step, $\boldsymbol{q}_i$ and $\boldsymbol{p}_u$ are updated via gradient descent:

\boldsymbol{q}_i \leftarrow \boldsymbol{q}_i - \gamma \frac{\partial L}{\partial \boldsymbol{q}_i} \tag{3-7}

\boldsymbol{p}_u \leftarrow \boldsymbol{p}_u - \gamma \frac{\partial L}{\partial \boldsymbol{p}_u} \tag{3-8}

By substituting the partial derivatives and absorbing the coefficients into $\gamma$ , we obtain the final update rules:

\boxed{\boldsymbol{q}_i \leftarrow \boldsymbol{q}_i + \gamma \left( \left( r_{ui} - \boldsymbol{q}_i^{\text{T}}\boldsymbol{p}_u \right) \boldsymbol{p}_u - \lambda \boldsymbol{q}_i \right)} \tag{3-9}

\boxed{\boldsymbol{p}_u \leftarrow \boldsymbol{p}_u + \gamma \left( \left( r_{ui} - \boldsymbol{q}_i^{\text{T}}\boldsymbol{p}_u \right) \boldsymbol{q}_i - \lambda \boldsymbol{p}_u \right)} \tag{3-10}

Repeat Rule (3-9) and Rule (3-10) until MF model converges, or until any of the following termination criteria are met:

The maximum number of training steps is exceeded
The training loss falls below a predefined threshold $\theta$

In this part, we use the Alternating Least Squares to optimize the objective function.

While the objective function (3-4) is non-convex when optimizing $\boldsymbol{p}_u$ and $\boldsymbol{q}_i$ simultaneously, it becomes a standard quadratic convex problem if we fix one set of variables and solve for the other. ALS leverages this property by alternatingly fixing $\mathbf{V}$ to solve for $\mathbf{U}$ , and then fixing $\mathbf{U}$ to solve for $\mathbf{V}$ .

Step 1: Fix Item Latent Matrix $\mathbf{V}$ , Update User Latent Vector $\boldsymbol{p}_u$

When the item matrix $\mathbf{V}$ (and thus all $\boldsymbol{q}_i$ ) is treated as constant, we can solve for each user’s latent vector $\boldsymbol{p}_u$ independently. For a specific user $u$ , we take the partial derivative of the objective function $L$ with respect to $\boldsymbol{p}_u$ and set it to zero:

\frac{\partial L}{\partial \boldsymbol{p}_u} = \sum_{i \in K_u} -2\left(r_{ui} - \boldsymbol{q}_i^{\text{T}}\boldsymbol{p}_u\right)\boldsymbol{q}_i + 2\lambda\boldsymbol{p}_u = 0

Where $K_u$ denotes the set of all items rated by user $u$ . By rearranging the terms to isolate $\boldsymbol{p}_u$ , we get:

\sum_{i \in K_u} \left(\boldsymbol{q}_i\boldsymbol{q}_i^{\text{T}}\boldsymbol{p}_u\right) + \lambda\boldsymbol{p}_u = \sum_{i \in K_u} r_{ui}\boldsymbol{q}_i

\left( \sum_{i \in K_u} \boldsymbol{q}_i\boldsymbol{q}_i^{\text{T}} + \lambda \mathbf{I}_k \right) \boldsymbol{p}_u = \sum_{i \in K_u} r_{ui}\boldsymbol{q}_i

Where $\mathbf{I}_k$ is the $k \times k$ identity matrix. By multiplying both sides by the inverse matrix, we obtain the exact closed-form solution for $\boldsymbol{p}_u$ :

\boxed{\boldsymbol{p}_u = \left( \sum_{i \in K_u} \boldsymbol{q}_i\boldsymbol{q}_i^{\text{T}} + \lambda \mathbf{I}_k \right)^{-1} \sum_{i \in K_u} r_{ui}\boldsymbol{q}_i} \tag{3-11}

Step 2: Fix User Latent Matrix $\mathbf{U}$ , Update Item Latent Vector $\boldsymbol{q}_i$

Conversely, when the user matrix $\mathbf{U}$ (and thus all $\boldsymbol{p}_u$ ) is fixed, the problem becomes symmetric. For a specific item $i$ , we take the partial derivative of $L$ with respect to $\boldsymbol{q}_i$ , set it to zero, and solve the linear system for the optimal $\boldsymbol{q}_i$ :

\boxed{\boldsymbol{q}_i = \left( \sum_{u \in K_i} \boldsymbol{p}_u\boldsymbol{p}_u^{\text{T}} + \lambda \mathbf{I}_k \right)^{-1} \sum_{u \in K_i} r_{ui}\boldsymbol{p}_u} \tag{3-12}

Where: $K_i$ denotes the set of all users who have rated item $i$ .

Repeat Equation (3-11) and Equation (3-12) alternatively to update all users and items until the MF model converges, or until any of the following termination criteria are met:

The maximum number of training iterations is exceeded.
The training loss falls below a predefined threshold $\theta$ .

Since each user’s update is completely independent of the others (and the same applies to items), ALS is highly parallelizable.

Pros:
- Superb Sparsity Handling: Fills in the blanks of massive, sparse matrices exceptionally well by leveraging underlying latent structures.
- High Predictive Accuracy: Outperformed classic CF algorithms by significant margins in industry benchmarks.
- Space Efficiency: Reduces space complexity from $O(n^2)$ (storing full similarity matrices) to $O((m + n) \cdot k)$ (storing only latent vectors), where $k \ll m, n$ .
- High Scalability: Matrices are decomposed to latent vectors — similar to Embeddings — for feature engineering pipelines, and thus can be seamlessly bridged with Deep Learning networks
Cons:
- Inflexible Feature Integration:
  - Inconvenient to incorporate side information (richer data such as user demographics, item attributes, and contextual features) into matrix factorization
- Cold Start Problem: It fails to provide effective recommendations when historical user-item interactions are completely absent (e.g., for brand-new users or items).

Collaborative Filtering is one of the foundational paradigms in modern recommender systems. Starting from the intuitive idea of “users with similar behaviors may share similar preferences,” UserCF models recommendations through user relationships, while ItemCF shifts the focus toward item-to-item interactions for better scalability and stability.

However, traditional memory-based approaches rely heavily on direct similarity computation and struggle with sparse data and generalization. Matrix Factorization addresses these limitations by learning latent representations of users and items, transforming sparse interaction patterns into compact mathematical embeddings.

This evolution from UserCF and ItemCF to MF reflects a broader shift in recommender systems: from manually measuring similarity to automatically discovering hidden preference structures. Although modern industrial systems have moved toward deep learning and hybrid recommendation architectures, collaborative filtering remains a fundamental building block for understanding personalized recommendation.

G. Linden, B. Smith and J. York, “Amazon.com recommendations: item-to-item collaborative filtering,” in IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan.-Feb. 2003, doi: 10.1109/MIC.2003.1167344. ↩
Y. Koren, R. Bell and C. Volinsky, “Matrix Factorization Techniques for Recommender Systems,” in Computer, vol. 42, no. 8, pp. 30-37, Aug. 2009, doi: 10.1109/MC.2009.263. ↩
Zhe Wang. “Deep Learning Recommender System 2.0” (“深度学习推荐系统2.0”), Beijing: Publishing House of Electronics Industry (“电子工业出版社”), Mar. 2025, ISBN: 9787121497469. ↩

Comments