# Course:CPSC522/Regularization as an Alternative to Negative Sampling in KGs

## Title

Unintuitiveness of negative sampling in training of SimplE has led us to try to use regularization on relations as an alternative in link prediction task in knowledge graphs. The tests show that it is possible to achieve good results by only regularizing on relation counts in the train set to train SimplE without any negative samples.

Collaborators:

## Abstract

Due to the fact that most knowledge graphs contain only positive samples, generation of negative samples has been the technique that state-of-the-art algorithms of link prediction in knowledge graphs have been using to make the learning of embeddings possible in this task. In this page, a regularization based on the number of samples of each relation in the training tuples is proposed to be a substitute for the negative sampling approach. In this regard, two different regularization techniques are introduced and tested, which show that regularizing the norm of the embedding vectors will achieve fair results.

### Builds on

Basic understanding of knowledge graphs, embeddings, stochastic gradient descent, and regularization.

### Related Pages

SimplE algorithm[1]

Knowledge graphs[2]

## Link Prediction in Knowledge Graphs

### Knowledge Graphs

A knowledge graph consists of (subject, predicate, object) triples, referred to as SPO triples. We will use the (head, relation, tail) notation for the triples in this page. The existence of a particular SPO triple indicates an existing fact, i.e., that the respective entities are in the given relationship. For instance, the information "Leonardo DiCaprio starred in Shutter Island" can be expressed via the triple (Leonardo_DiCaprio, starred_in, Shutter_Island). Under the open world assumption (OWA), a non-existing triple is interpreted as unknown, i.e., the corresponding relationship can be either true or false. Continuing with the above example, the missing edge is not interpreted to mean that Nimoy did not star in Star Wars. This more cautious approach is justified, since KGs are known to be very incomplete. For example, sometimes just the main actors in a movie are listed, not the complete cast.

In schema-based approaches, entities and relations are represented via globally unique identifiers and all possible relations are predefined in a fixed vocabulary. For example, Freebase might represent the fact that Barack Obama was born in Hawaii using the triple (/m/02mjmr, /people/person/born-in, /m/03gh4), where /m/02mjmr is the unique machine ID for Barack Obama.[2]

On this page, we will assume we are using a database with the open world assumption which is schema-based. In particular, the FB15K[3] data set which is a subset of Freebase[4] is used in the experiments described on this page. This data set consists of 14,951 entities and 1,345 relations, defining ${\displaystyle E}$to be the set of all entities in the data set, ${\displaystyle |E|=14951}$. The training set contains of 483,142 triplets, the validation set 50,000 triples and the test set 59,071 triples.

Link prediction task in knowledge graphs is concerned with predicting the existence (or probability of correctness) of edges in the graph (i.e., triples). This is important since existing knowledge graphs are often missing many facts, and some of the edges they contain are incorrect. In the context of knowledge graphs, link prediction is also referred to as knowledge graph completion. For example, by having the two triples (Barack_Obaba, born_in, Hawaii) and (Barack_Obaba, was_president_of, United_States) we might be able to deduct the triple (Hawaii, is_a_state_of, United States).

### SimplE for Link Prediction in KGs

SimplE algorithm[1] is a tensor factorization approach which uses embedding vectors to represent entities and relation. SimplE considers two vectors ${\displaystyle h_{e},t_{e}\in \mathbb {R} ^{d}}$ as the embedding of each entity ${\displaystyle e}$, and two vectors ${\displaystyle v_{r},v_{r^{-1}}\in \mathbb {R} ^{d}}$ for each relation ${\displaystyle r}$. The similarity function of SimplE for a triple ${\displaystyle (e_{i},r,e_{j})}$ is defined as ${\displaystyle {\frac {1}{2}}(+)}$, i.e. the average of the scores for ${\displaystyle (e_{i},r,e_{j})}$ and ${\displaystyle (e_{j},v_{r^{-1}},e_{i})}$. In the score formula, ${\displaystyle =\sum _{j=1}^{d}v[j].w[j].x[j]}$, where ${\displaystyle v}$, ${\displaystyle w}$, and ${\displaystyle x}$ are vectors of size ${\displaystyle d}$.

#### SimplE Model Evaluation

After training the model, which is essentially finding embedding vectors that give high scores for triples in train set, the model is evaluated using the mean reciprocal rank (MRR). To do this evaluation, for each test triple ${\displaystyle (h,r,t)}$ we compute the score of ${\displaystyle (h',r,t)}$ triples for all ${\displaystyle h'\in E}$ and calculate the ranking ${\displaystyle rank_{h}}$of the triple having ${\displaystyle h}$, and we compute the score of ${\displaystyle (h,r,t')}$ triples for all ${\displaystyle t'\in E}$ and calculate the ranking ${\displaystyle rank_{t}}$of the triple having ${\displaystyle t}$. Finally, the MRR is defined as the mean of the inverse of these rankings for all triples in the test set: ${\displaystyle MRR={\frac {1}{2|tt|}}\sum _{(h,r,t)\in tt}{\frac {1}{rank_{h}}}+{\frac {1}{rank_{t}}}}$, where ${\displaystyle tt}$is the set of test triples.

#### SimplE Training with Negative Sampling

Since in the FB15K data set the training set consists only of positive examples, a model which always predicts "True" will get the minimum training loss for most feasible loss functions. In order to mitigate this issue, state-of-the-art training methods generate negative samples and train the model on a set including both positive and negative samples. To generate negative samples, for each positive triple in data set, ${\displaystyle n}$negative triples are generated by corrupting the positive triple. For a positive triple ${\displaystyle (h,r,t)}$, we randomly decide to corrupt the head or tail. If the head is selected, we replace ${\displaystyle h}$ in the triple with an entity ${\displaystyle h'}$ randomly selected from ${\displaystyle E-\{h\}}$ and generate the corrupted triple ${\displaystyle (h',r,t)}$. If the tail is selected, we replace ${\displaystyle t}$ in the triple with an entity ${\displaystyle t'}$ randomly selected from ${\displaystyle E-\{t\}}$ and generate the corrupted triple ${\displaystyle (h,r,t')}$.

Using the union of the training set and the generated negative samples, stochastic gradient descent is used to minimize the following loss function:

${\displaystyle loss=min_{\theta }\sum _{((h,r,t),l)\in T}softplus(-l.\phi (h,r,t))+\lambda \|\theta \|_{2}^{2}}$,

where ${\displaystyle T}$is the union of labeled training set and labeled negative samples.

SimplE generates 10 negative samples for each positive sample in FB15K to train on the data set.

## Regularization as an Alternative to Negative Sampling

Since generating negative samples is not quite intuitive, in this page, we will evaluate how regularization can be used to mitigate the problem of nonexistent negative samples in training set. Our hypothesis is that some kind of regularization on number of samples of a relation in training set can be used to regularize the SimplE model, so that negative samples are not necessary for training the model. From this point on, the number of samples of a relation ${\displaystyle r}$in training set will be denoted as ${\displaystyle n_{r}}$. Since we want to regularize using ${\displaystyle n_{r}}$, we will have a regularizer ${\displaystyle Reg_{r}(n_{r})}$ which is a function of ${\displaystyle n_{r}}$ and some feature of the model whose point is to regularize relation ${\displaystyle r}$. Therefore, the general formula for the loss function with this regularization will be as follows:

${\displaystyle loss=min_{\theta }\sum _{r\in {\mathcal {R}}}{\bigg [}\sum _{h,t:(h,r,t)\in {\textbf {Train}}}{\bigg (}softplus(-\phi (h,r,t)){\bigg )}+Reg_{r}(n_{r}){\bigg ]}+\lambda \|\theta \|_{2}^{2}}$,

where ${\displaystyle {\mathcal {R}}}$ is the set of all relations in the dataset and ${\displaystyle {\textbf {Train}}}$ is the training set. As you can see in the above formula, there is no label for each sample because all the samples are positive. It is also important to notice the position of ${\displaystyle Reg_{r}(n_{r})}$ in the loss expression. To make this loss function work with stochastic gradient descent, we need summation over all the samples, so the final formula for the loss function, which will be used to train the model, is as follows:

${\displaystyle loss=min_{\theta }\sum _{(h,r,t)\in {\textbf {Train}}}{\bigg (}softplus(-\phi (h,r,t))+{\frac {1}{n_{r}}}.Reg_{r}(n_{r}){\bigg )}+\lambda \|\theta \|_{2}^{2}}$,

which is equal to the previously mentioned loss function.

### Test 1 - Regularizing on Predictions of the Model

Usually, the regularization term is based on model structure or parameters, but as a first test, we decided to first try a regularizer based on the model predictions. This is simpler than regularizing based on model structure or parameters because it is hard to devise a regularizer which will incorporate ${\displaystyle n_{r}}$ and model parameters. The tested regularization term in this test was ${\displaystyle Reg_{r}(n_{r})={\Big (}softplus(-\phi (h,r,t))-\alpha n_{r}{\Big )}^{2}}$, and therefore the loss function was

${\displaystyle loss=min_{\theta }\sum _{(h,r,t)\in {\textbf {LB}}}{\bigg (}softplus(-\phi (h,r,t))+{\frac {\lambda '}{n_{r}}}{\Big (}softplus(-\phi (h,r,t))-\alpha n_{r}{\Big )}^{2}{\bigg )}+\lambda \|\theta \|_{2}^{2}}$,

where ${\displaystyle \lambda '}$is the regularization rate for the new regularizer and ${\displaystyle \alpha }$is a normalizing factor to normalize the values of ${\displaystyle n_{r}}$ to the range ${\displaystyle [0,1]}$so that they are comparable with the values of ${\displaystyle softplus}$.

The result of a grid search over ${\displaystyle \lambda '}$in range ${\displaystyle [0.01,1]}$ and sensible ⍺ gave very bad results: The MMR score for the trained models were all close to 0.0001, compared to MRR=0.72 for SimplE implementation of Kazemi and Poole[1].

### Test 2 - Regularizing on Norm of the Embedding Vectors

To derive a regularization term based on embedding vector of relation ${\displaystyle r}$ and ${\displaystyle n_{r}}$, a few plots were drawn which can be seen in figure 1. Analyzing the plots in figure 1 shows that there is a correlation between the norm of embedding vectors ${\displaystyle v_{r}}$and ${\displaystyle n_{r}}$.

Figure 1. These plots show different features of the embedding vectors ${\displaystyle v_{r}}$compared to ${\displaystyle n_{r}}$. The embedding vectors for relations are extracted from SimplE model trained using negative samples. As it is observed, there seems to be a high correlation between the norm of the relation embedding vectors and the frequency of the occurrence of the relation in the training set. Plots drawn using the other embedding vectors of the relations ${\displaystyle v_{r^{-1}}}$are very similar to these plots.

To make use of the correlation observed in plots in figure 1 to derive a regularization term based on ${\displaystyle \|v_{r}\|^{2}}$, polynomial regression functions of degrees 1, 2, and 3 were fitted to the plot of ${\displaystyle \|v_{r}\|^{2}}$ vs. ${\displaystyle n_{r}}$. The resulting regression curves can be seen in figure 2.

Figure 2. Regressions fitted on the ${\displaystyle \|v_{r}\|^{2}}$ vs. ${\displaystyle n_{r}}$ plot and the ${\displaystyle \|v_{r^{-1}}\|^{2}}$ vs. ${\displaystyle n_{r}}$ plot.

To derive the regularization term, the degree one polynomial regression curve was used. As a result, the formula for the regularizer was decided to be ${\displaystyle Reg_{r}(n_{r})={\Big (}\|v_{r}\|^{2}-(\alpha n_{r}+\beta ){\Big )}^{2}}$. Finally, the loss function will be as follows:

Figure 3. This figure shows the result of MRR scores achieved on FB15K data set using different values for ${\displaystyle \lambda '}$ values.

${\displaystyle loss=min_{\theta }\sum _{(h,r,t)\in {\textbf {LB}}}{\bigg (}softplus(-\phi (h,r,t))+{\frac {\lambda '}{n_{r}}}{\Big (}|v_{r}|-(\alpha n_{r}+\beta ){\Big )}^{2}{\bigg )}+\lambda \|\theta \|_{2}^{2}}$,

where ${\displaystyle \alpha }$ and ${\displaystyle \beta }$ are derived from the fitted regression line (${\displaystyle \alpha =0.00005675}$ and ${\displaystyle \beta =0.0327}$), and ${\displaystyle \lambda '}$is the regularization rate. MRRs achieved with a grid search over ${\displaystyle \lambda '}$can be seen in figure 3. As it can be seen in figure 3, results as high as MRR=0.18 are achieved, which is comparable to MRR=0.72 that is achieved by SimplE implementation of Kazemi and Poole[1]. It seems like the MRR is increasing as ${\displaystyle \lambda '}$increases and it does not seem to have saturated yet, which means that testing ${\displaystyle \lambda '}$ values greater than 0.7 might give even better results.

#### Devised Regularizer Tested with Negative Sampling

Testing this new regularization term alongside negative sampling resulted in MRR~0.10 for ${\displaystyle \lambda '\in [0.05,0.5]}$meaning that this regularizer with high ${\displaystyle \lambda '}$ values does not increase MRR score of SimplE implementation of Kazemi and Poole. Further tests are needed with smaller ${\displaystyle \lambda '}$ values to make a general statement about the regularizers impact on SimplE.

## Discussion and Future Work

Results of the second test show that regularization is powerful enough to achieve good results and might some day completely substitute negative sampling in training embedding-based models in link prediction of KGs.

Future work includes multiple tests that can be done to further analyze the potentials of the new devised regularization term. Some of these test include:

• Tests to check if the regression line on the ${\displaystyle \|v_{r}\|^{2}}$ vs. ${\displaystyle n_{r}}$ plot stays fixed among multiple data sets, so that this regularizer can potentially be transported to any data set without any change.
• Trying degree two regression to see if that achieves better results than the degree one regression.

## Annotated Bibliography

1. Seyed Mehran Kazemi and David Poole, SimplE Embedding for Link Prediction in Knowledge Graphs, in Proc. Thirty-second Conference on Neural Information Processing Systems(NeurIPS 2018). Also a previous version in Eighth International Workshop on Statistical Relational AI, July 2018.
2. Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2016. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104(1):11-33.
3. Antoine Bordes and Nicolas Usunier and Alberto Garcia-Dur\'an and Jason Weston and Oksana Yakhnenko, Translating Embeddings for Modeling Multi-relational Data, Advances in Neural Information Processing Systems (NIPS 26), 2013
4. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In ACM SIGMOD, pages 1247–1250. AcM, 2008.