# Course:CPSC522/Using Subset Information with Matrix Factorization

Note: this entry in the current state is, to say the least, unsophisticated (possibly a total nonsense); the explanation, which is itself dubious, is in the last section. It should be redesigned and redone in a near future, if I still have access to this site. It is not for the sake of grade; I am going to show -- at least to myself, even if no one else cares -- that I can think and reflect.

## Title

The entry concerns possible incorporation of information about subsets of domain entities in matrix factorization.

Author: Shunsuke Ishige

## Abstract

Matrix factorization, applied in recommender systems in matching users with items, uses embedding vectors representing domain entities such as users or items. In particular, it trains such vectors of latent factors or hidden features as model parameters. Entities, however, may be organized into sub-classes based on their properties, for instance, such as gender or occupation in the case of users. It can naturally be expected that entities sharing properties and thereby belonging to the same sub-class manifest similarities such as preferences observed as ratings of items. This entry very briefly examines possible effect of incorporation of subclass information in matrix factorization.

### Builds on

Basic understanding of machine learning is assumed. Although some background information about collaborative filtering is provided in the entry, for more detailed discussion, please see this section of [2].

## Content

### Preliminaries

In this section, general contextual information about and a mathematical formulation of matrix factorization are briefly provided.

#### General Background: Collaborative Filtering

Collaborative filtering is a process of finding a relation between two entities, such as a movie and a prospective viewer, based on some measure of similarities, which is utilized in predicting user preferences in recommender systems [1, 2]. Matrix factorization is a formalism for an approach to collaborative filtering called latent factor models, which are distinguished from models that extrapolate from entities in a neighborhood determined by some metric, such as related products or users [1]. In particular, latent factors or hidden properties of the entities comprise model parameters to be learnt [2]; use of such factors can thus dispense with prior knowledge of those properties while still providing a measure of plausible relations between the entities [1].

#### Formalism of Matrix Factorization

The formalism used in this entry follows [1] and [2], with the notation being aligned with [2]; for a comprehensive overview of matrix factorization, see [3].

With ${\displaystyle u}$ denoting a user, ${\displaystyle i}$, an item, and ${\displaystyle r}$, a rating, let ${\displaystyle E:=\{(u,i,r)\}}$. Then, given the vector embeddings defined by ${\displaystyle u\mapsto up\in \mathbb {R} ^{k}}$ and ${\displaystyle i\mapsto ip\in \mathbb {R} ^{k}}$, where ${\displaystyle \mathbb {R} ^{k}}$is the (inner-product) space of latent factors, the optimization of the latent factors with the bias terms for items and users denoted by ${\displaystyle ib}$ and ${\displaystyle ub}$, respectively is given by

${\displaystyle {\underset {ib,ub,ip,up}{min}}((\sum _{(u,i,r)\in E}(r-{\widehat {r}}(u,i))^{2})+\lambda \underbrace {L(ib,ip,ub,up)} _{regularization\,term})}$,

where ${\displaystyle {\hat {r}}(u,i):=\underbrace {\mu +ib[i]+ub[u]} _{bias\,terms}+\underbrace {\sum _{p=1}^{k}ip[i,p]*up[u,p]} _{dot\,product:\,ip^{T}up}}$ ${\displaystyle (*)}$ .

The optimization problem can be solved by stochastic gradient; please refer to the pseud-code [2].

Notes on the formula:

• ${\displaystyle ip}$ and ${\displaystyle up}$ are arrays of (hidden) properties associated with the item and user, respectively.
• As a result of embedding, we can now use mathematical recourse associated with a continuous space.
• As explicitly shown in the summation form, the product or "interaction" [1] of such properties is element-wise.
• A somewhat related point is that dot product is a vector operation, so that the relative orientation of the vectors matters.
• If we ignore the bias terms (, which could be expressed in the matrix format as well, though) in ${\displaystyle {\hat {r}}}$, dot products collectively represented in a matrix format would be ${\displaystyle {\begin{pmatrix}ip_{1}^{T}\\.\\.\\.\\ip_{n}^{T}\end{pmatrix}}}$${\displaystyle {\bigl (}up_{1}...up_{n}{\bigr )}}$.

### Problem Statement

The problem discussed in this entry is, in nature, the new user problem in recommender systems -- the problem of finding efficacious criteria for recommendation/prediction for each new user [4]. The focus of this entry is, however, an admittedly cursory examination of the relationship between domain information expressed in terms of subsetting of entities and latent factors in vector embeddings representing those entities. The choice of this context of collaborative filtering is motivated by availability of convenient data and the relative simplicity of formulation. So, even though incorporating extraneous information may not necessarily make sense in the context of recommender systems, the primary purpose of this entry justifies it.

The class of users can essentially be partitioned into finer sub-classes by way of demographic information. For instance, a user might belong to the class of students as well as that of female. Similarly, another user might be a male administrator. Presumably, users of each sub-class with similar properties share preferences. Those properties are reflected in learnt hidden features, and should help make prediction for a new user, based on class membership.

### Data and Procedures

#### Data

The data set used in this entry is MovieLens ml-1m [5], which comprises the records of 6,040 users rating about 3,900 movies of diverse genres on the categorical scale of 1 to 5. Each user rates multiple times, with the total number of ratings being 1,000,209. The data about ratings include time stamps, while those about users, demographic information such as gender, occupation, and age. The rating and demographic information contained in separate files can be cross-referenced by the unique identification number of each user. See the ml-1m README for the exact data format.

#### Procedures

To consider the effect of incorporating subset information in predicting the rating by new users, two basic cases are considered:

1. basing the prediction solely (i.e. without learning) on the user's demographic class information; and
2. using latent factors containing similar information.

A few notes are in order. First, in terms of the distinction between neighborhood and latent factor models mentioned in the Preliminaries, the first is of the former, while the second, a mixture of the two kinds of models. Second, although the primary interest here is the case 2, the case 1 is included for two reasons: (1) it helps estimate the extent that demographic differences are actually reflected in statistical trends in rating, and (2) it provides comparison in effectiveness with the other case. Finally, although movies may also share properties in terms of genres, in this evaluation only individual movies are considered, fixing the experiment variable.

The Python code used in the evaluation is based on the following two files: file 1 [6] and file 2 [7].

In the following, each of the above two cases is elaborated.

##### Case 1
subsetting figure

The reference class of a user can be defined as the subset with respect to his or her demographic attributes, such as gender and occupation. Each reference class, such as the female students or the male retired, is expected to share certain properties that are reflected in their preferences for movies.

In the case 1, the mode of the ratings, which are categorical on the scale of 1 to 5, of a movie by the reference class is used as the prediction for the new user, with 0-1 loss error. Basing prediction on the reference class of a new user should be able to leverage the information about shared properties. To coarsely simulate the accumulation of data as new users join and rate (possibly multiple times), two different splitting of the data with respect to the time stamp is tested: (1) using the first half for computing the mode specific to a reference class, and (2) using the first two thirds as available data. Prediction error is computed simply as the ratio of wrong prediction, with respect to the ratings by the new users (i.e., those not in the training set) in the remaining data in each case, to the total number of the ratings.

There might be a trade-off between the granularity of sub-setting of the base class 'user' and the amount of data available at the time. (See the accompanying figure: each partition can be considered as a bin for data.) The finer the subsetting is, the less data each class might contain for a given amount of the total data available for the generic class 'user', so that the mode of a reference class might not reflect the actual distribution. Ideally, to see the possible effect of this trade-off over time, two different subsetting should be considered, such as the male and female on the one hand and gender plus occupation on the other. However, due to the time constraint, in the current version of the entry, only gender is considered for subsetting.

##### Case 2
combining embedding vectors figure
split figure

In this case, effect of using subset information with latent factors/hidden features is considered. In particular, the rating data for a given movie are split as in the case 1 (i.e., the first half or the first two thirds for training). When trained on the input data, the matrix factorization model yields calibrated vector embeddings of the users (as well as of the movie). In particular, to incorporate the gender subclass information in the hidden features, the training is done with an entry of the embedding vector being set to 1 for the female users, whereas, to 0, for the male users. This fixing should make the vectors of each subclass of the users similar -- in terms of the effect on the corresponding item vector dimension and also, to some extent, of the way that the remaining dimensions in the user vectors are calibrated -- to the others in the same class with respect to that dimension, i.e. the feature of gender. Note that there is only one model trained, the one for all users in the training set, as opposed to two models with separate sets of parameters fit to gender specific training data.

To make prediction of ratings by new users in the remaining data set (test set) of the respective splits, the calibrated embedding vectors representing female or male users in the training set are aggregated by taking element-wise mean, according to the respective gender sub-classes. This creates an approximated embedding vector for the new user, which can be applied in accordance with the gender of the new user, which is assumed to be known. Note that the new vector is created from existing calibrated vectors; the model is not retrained. The accompanying figures on the right schematically represents this process.

The implementation of the element-wise aggregation exploits the facts that we fixed a item/movie and that dot product distributes. So, referring to the equation for prediction, marked with ${\displaystyle (*)}$ in the section 3.1.2 above, the gender specific embedding is estimated for a new user, who is either female or male, by

${\displaystyle {\hat {r}}_{new\ user}:={\tfrac {1}{n}}\sum _{u}{\hat {r}}(u,i_{0})=\mu +ib(i_{0})+{\tfrac {1}{n}}\sum _{u}ub(u)+ip_{0}\centerdot \underbrace {\tfrac {\sum _{u}up(u)}{n}} _{:=\ up_{new\ user}}}$, where ${\displaystyle n=card(\{u:u\ is\ female\ (male)\ in\ training\ set\})}$ .

The embedding and bias vectors are already optimized with respect to the existing users in the training set as explained above, so that the ratings, ${\displaystyle {\hat {r}}}$, should be close to their true labels. In particular, ignoring bias terms, ${\displaystyle {\tfrac {1}{n}}\sum _{u}{\hat {r}}(u,i_{0})=ip_{0}\centerdot \underbrace {\tfrac {\sum _{u}up(u)}{n}} _{:=\ up_{new\ user}}}$; this definition of the new embedding in effect just results in approximation of the average of the ratings by each gender group. (Or maybe there is some symmetry with respect to the vector ${\displaystyle ip_{0}}$ so that the vectors ${\displaystyle up}$ just cancel each other, making the new user term 0? But can they be oriented so that the component along the item vector in particular somehow cancel out? Alternatively, the new user vector might be vertically oriented with respect to the item vector, so that the dot product is zero; but that seems to be besides the point. Perhaps, though, all of this might as well be nonsense.). Since the prediction in this case is a real number, the accuracy is estimated by the averaged absolute value error and averaged squared sum error with respect to the true labels in the test set.

To estimate the effect of subset information, the above process is repeated without fixing a feature for gender and by aggregating all vectors from the training to represent new users regardless of their gender.

### Hypothesis

Incorporating sub-class information should in general improve the accuracy of prediction.

### Results

As stated in the subsection Data above, the data set ml-1m includes rating for 3,900 movies. However, in this evaluation, movies are considered individually to simplify the design of the study. The number of ratings for a given movie considerably varies. Unless there is a noticeable difference in ratings by male and female users, the subset information is not expected to be particularly useful.

The following summarizes the results for the movie with id = 1 in ml-1m. For the case 2, the number of iterations = 100 and number of features = 10. Since at least for this movie, the use of mode in the case 1 indicates that there is only a limited gender specific difference in rating, the following formula summaries the raw data (tables 2-5) in the table 1: ${\displaystyle error\ fraction:=100*{\tfrac {error_{user}-error_{gender}}{error_{user}}}}$. Given this formula, a negative value indicates increase in error.

table 1: summary
split and case error fraction
half-split case 1 1.12
half-split case 2 (for mean absolute error) 0.18
two-thirds-split case 1 - 1.91
two-thirds-split case 2 (for mean absolute error) - 0.16
table 2: half split case 1 (total ratings: 2077; 1039 by new users)
set error
random guessing 0.7853705486044273
user (mode: 4) 0.6005774783445621
gender (male mode: 4, female mode: 5) 0.5938402309913379
table 3: half split case 2
set mean absolute error mean sum of squares error
user (prediction: 4.1503976076277524) 0.6719347997415742 0.7054705550502061
gender (male: 4.120188653463477,

female: 4.223677469658462)

0.6707133978929161 0.7074000302671102
table 4: two thirds split case 1 (ratings by new user: 693)
set error
random guessing 0.7994227994227994
user (mode: 4) 0.6031746031746031
gender (male mode: 4, female mode: 5) 0.6147186147186147
table 5: two thirds split case 2
set mean absolute error mean sum of squares error
user (prediction: 4.145937907081121) 0.6702061649941212 0.6950883361305636
gender (male: 4.115530241968873,

female: 4.221792058275481)

0.671269515971787 0.6995159581605226

### "Concluding" Remarks (Self Criticism)

Insofar as the above data are concerned, roughly 1/6 of the positive difference (1/12 for the negative one) observed in case 1 is reflected in case 2.

Actually, even if a greater degree of difference were observed for some movie or at least the same degree of difference were consistently observed across multiple movies, it provides no insight in how subset differences are reflected in embedding vectors; what the current version of this entry does, if any, is to simply indicate whether subset differences are reflected in embedding vectors.

One way the above can be salvaged might be to claim that the accuracy of prediction using sub-class information to form a vector for a new user outweighs the cost of retraining the model frequently (and in practice, collecting extra data). Additionally, if extra information is available, in practice, it might be questioned whether matrix factorization is the best approach. So, some defense for the choice would also be necessary. At any rate, the rationale for the case 1 does not seem clear. What is that ratio supposed to show at all? The stated reason for fixing a feature in the case 2 is mathematically vague: how does it exactly factor in in the process of optimization? What is the proof/justification for that? What is to be expected from that change? If what is expected is not clearly known, data cannot confirm or disprove anything -- in the first place, what data are to be collected cannot be known.

It is suspected, however, that the mathematics behind is very complex. For simplicity, let us ignore biases, and assume that there is only one item embedding. Let ${\displaystyle {\hat {r}}_{1}=up_{1}\cdot ip}$ and ${\displaystyle {\hat {r}}_{2}=up_{2}\cdot ip}$. Then, assuming that those users belong to the same sub-class with a shared tendency in rating of the movie, after training, ${\displaystyle r\approx {\hat {r_{1}}}\approx {\hat {r_{2}}}\Rightarrow 0\approx (up_{1}-up_{2})\cdot ip}$. But we cannot conclude from this that ${\displaystyle up_{1}\approx up_{2}}$, since these are vectors. In the ${\displaystyle k=2}$ case, for ${\displaystyle \mathbb {R} ^{k}}$, there may be a symmetry with respect to the single ${\displaystyle ip}$ vector and there might be nice clusters of user vectors belonging to sub-classes, which can be separated. But, for an arbitrary ${\displaystyle k}$, there might not be such a simple geometric interpretation. Formally, if a vector representing a user could be in any of the multiple places, the convergence would be in probability. If we are to give some sub-class interpretation to hidden features (i.e. values in dimensions), that would also be in terms of probability. This gets even more complex if we allow multiple items and thereby corresponding embedding vectors; the "simple" case of a single item vector might not generalize in a straightforward manner. Would there be some probabilisttic clusters for sub-classes of users and similarly those for items, for instance, sharing genres?

This problem should be approached more carefully in mathematical terms. Such a basic idea as convergence in matrix factorization should readily be found in the literature. But at least for a single item vector as in this entry, the reasonable place to start might be distribution in each dimension, since vectors are determined by their components. Each entry/dimension of embedding vectors is initialized with values sampled from the uniform distribution. How would they be distributed after the model is trained? In particular, if user embedding vectors are divided according to demographic sub-classes of users and the distribution of each dimension is examined, would they show any observable inter-class differences? There might be in fact probabilistic clusters of sub-class vectors, which might explain the apparent gender specific difference observed (see the tables 3 and 4) for gender specific new user vectors formed by taking the element-wise mean. Nested sub-classing is possible by incorporating more demographic information such as occupation and age groups, as in the set of female students or that of male technicians. The rather naive schematic figure (see the subsetting figure above) indicates the obvious fact that finer grained classes of the female students and retired may have different intra-class behavioral tendencies such as movie rating between them; but when we consider those classes in terms of the immediate superset, namely the female, those entities may have similarity, compared with those in the class of the male. The rating tendency of the female can be represented as a statistical combination of those of the subclasses, the female students and retired in this case. In light of the conjectured picture of probabilistic clusters, the vectors representing the female students and the female retired might be closer in probabilistic terms to each other. In particular, probabilistic characteristics of the class female might be accounted for in terms of those subclass clusters.

In conclusion, for now, this entry in the current state is little more than idle thinking.

## Annotated Bibliography

[1] Koren Y., Bell R., Volinsky C. Matrix Factorization Techniques for Recommender Systems. Computer, Vol. 42, Issue 8, 2009, pp 30-37. https://ieeexplore.ieee.org/document/5197422

[2] Poole D. Mackworth A. Artificial Intelligence: Foundations of Computational Agents 2nd Edition. Cambridge UP, 2017. https://artint.info/index.html

[3] Bokde D. et al. Matrix Factorization Model in Collaborative Filtering Algorithms: A Survey. Procedia Computer Science, Vol. 49, 2015, pp 136-146. https://www.sciencedirect.com/science/article/pii/S1877050915007462

[4] Rashid A. M. et al. Getting to Know You: Learning New User Preference in Recommender Systems. IUI '02 Proceedings of the 7th international conference on Intelligent user interfaces, Jan. 2001, pp 127-134. https://dl.acm.org/citation.cfm?id=502737

[5] Harper M.F., Konstan J. A. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19, Dec. 2015.

http://dx.doi.org/10.1145/2827872

[6] Poole D. Assignment 1. CPSC 532 - Topics in AI: Statistical Relational Artificial Intelligence, UBC, Spring 2017. https://www.cs.ubc.ca/~poole/cs532/2017/

[7] Poole D., Mackwoth A. AIPython: Python Code for AIFCA. https://artint.info/AIPython/