Course:CPSC522/Regularization as an Alternative to Negative Sampling in KGs

Title

Unintuitiveness of negative sampling in training of SimplE has led us to try to use regularization on relations as an alternative in link prediction task in knowledge graphs. The tests show that it is possible to achieve good results by only regularizing on relation counts in the train set to train SimplE without any negative samples.

Principal Author: Ali Mohammad Mehr
Collaborators:

Abstract

Due to the fact that most knowledge graphs contain only positive samples, generation of negative samples has been the technique that state-of-the-art algorithms of link prediction in knowledge graphs have been using to make the learning of embeddings possible in this task. In this page, a regularization based on the number of samples of each relation in the training tuples is proposed to be a substitute for the negative sampling approach. In this regard, two different regularization techniques are introduced and tested, which show that regularizing the norm of the embedding vectors will achieve fair results.

Builds on

Basic understanding of knowledge graphs, embeddings, stochastic gradient descent, and regularization.

Related Pages

SimplE algorithm^[1]

Knowledge graphs^[2]

Link Prediction in Knowledge Graphs

Knowledge Graphs

A knowledge graph consists of (subject, predicate, object) triples, referred to as SPO triples. We will use the (head, relation, tail) notation for the triples in this page. The existence of a particular SPO triple indicates an existing fact, i.e., that the respective entities are in the given relationship. For instance, the information "Leonardo DiCaprio starred in Shutter Island" can be expressed via the triple (Leonardo_DiCaprio, starred_in, Shutter_Island). Under the open world assumption (OWA), a non-existing triple is interpreted as unknown, i.e., the corresponding relationship can be either true or false. Continuing with the above example, the missing edge is not interpreted to mean that Nimoy did not star in Star Wars. This more cautious approach is justified, since KGs are known to be very incomplete. For example, sometimes just the main actors in a movie are listed, not the complete cast.

In schema-based approaches, entities and relations are represented via globally unique identifiers and all possible relations are predefined in a fixed vocabulary. For example, Freebase might represent the fact that Barack Obama was born in Hawaii using the triple (/m/02mjmr, /people/person/born-in, /m/03gh4), where /m/02mjmr is the unique machine ID for Barack Obama.^[2]

On this page, we will assume we are using a database with the open world assumption which is schema-based. In particular, the FB15K^[3] data set which is a subset of Freebase^[4] is used in the experiments described on this page. This data set consists of 14,951 entities and 1,345 relations, defining $E$ to be the set of all entities in the data set, $|E|=14951$ . The training set contains of 483,142 triplets, the validation set 50,000 triples and the test set 59,071 triples.

Link Prediction Task

Link prediction task in knowledge graphs is concerned with predicting the existence (or probability of correctness) of edges in the graph (i.e., triples). This is important since existing knowledge graphs are often missing many facts, and some of the edges they contain are incorrect. In the context of knowledge graphs, link prediction is also referred to as knowledge graph completion. For example, by having the two triples (Barack_Obaba, born_in, Hawaii) and (Barack_Obaba, was_president_of, United_States) we might be able to deduct the triple (Hawaii, is_a_state_of, United States).

SimplE for Link Prediction in KGs

SimplE algorithm^[1] is a tensor factorization approach which uses embedding vectors to represent entities and relation. SimplE considers two vectors $h_{e},t_{e}\in \mathbb {R} ^{d}$ as the embedding of each entity $e$ , and two vectors $v_{r},v_{r^{-1}}\in \mathbb {R} ^{d}$ for each relation $r$ . The similarity function of SimplE for a triple $(e_{i},r,e_{j})$ is defined as ${\frac {1}{2}}(<h_{e_{i}},v_{r},t_{e_{j}}>+<h_{e_{j}},v_{r^{-1}},t_{e_{i}}>)$ , i.e. the average of the scores for $(e_{i},r,e_{j})$ and $(e_{j},v_{r^{-1}},e_{i})$ . In the score formula, $<v,w,x>=\sum _{j=1}^{d}v[j].w[j].x[j]$ , where $v$ , $w$ , and $x$ are vectors of size $d$ .

SimplE Model Evaluation

After training the model, which is essentially finding embedding vectors that give high scores for triples in train set, the model is evaluated using the mean reciprocal rank (MRR). To do this evaluation, for each test triple $(h,r,t)$ we compute the score of $(h',r,t)$ triples for all $h'\in E$ and calculate the ranking $rank_{h}$ of the triple having $h$ , and we compute the score of $(h,r,t')$ triples for all $t'\in E$ and calculate the ranking $rank_{t}$ of the triple having $t$ . Finally, the MRR is defined as the mean of the inverse of these rankings for all triples in the test set: $MRR={\frac {1}{2|tt|}}\sum _{(h,r,t)\in tt}{\frac {1}{rank_{h}}}+{\frac {1}{rank_{t}}}$ , where $tt$ is the set of test triples.

SimplE Training with Negative Sampling

Since in the FB15K data set the training set consists only of positive examples, a model which always predicts "True" will get the minimum training loss for most feasible loss functions. In order to mitigate this issue, state-of-the-art training methods generate negative samples and train the model on a set including both positive and negative samples. To generate negative samples, for each positive triple in data set, $n$ negative triples are generated by corrupting the positive triple. For a positive triple $(h,r,t)$ , we randomly decide to corrupt the head or tail. If the head is selected, we replace $h$ in the triple with an entity $h'$ randomly selected from $E-\{h\}$ and generate the corrupted triple $(h',r,t)$ . If the tail is selected, we replace $t$ in the triple with an entity $t'$ randomly selected from $E-\{t\}$ and generate the corrupted triple $(h,r,t')$ .

Using the union of the training set and the generated negative samples, stochastic gradient descent is used to minimize the following loss function:

$loss=min_{\theta }\sum _{((h,r,t),l)\in T}softplus(-l.\phi (h,r,t))+\lambda \|\theta \|_{2}^{2}$ ,

where $T$ is the union of labeled training set and labeled negative samples.

SimplE generates 10 negative samples for each positive sample in FB15K to train on the data set.

Regularization as an Alternative to Negative Sampling

Since generating negative samples is not quite intuitive, in this page, we will evaluate how regularization can be used to mitigate the problem of nonexistent negative samples in training set. Our hypothesis is that some kind of regularization on number of samples of a relation in training set can be used to regularize the SimplE model, so that negative samples are not necessary for training the model. From this point on, the number of samples of a relation $r$ in training set will be denoted as $n_{r}$ . Since we want to regularize using $n_{r}$ , we will have a regularizer $Reg_{r}(n_{r})$ which is a function of $n_{r}$ and some feature of the model whose point is to regularize relation $r$ . Therefore, the general formula for the loss function with this regularization will be as follows:

$loss=min_{\theta }\sum _{r\in {\mathcal {R}}}{\bigg [}\sum _{h,t:(h,r,t)\in {\textbf {Train}}}{\bigg (}softplus(-\phi (h,r,t)){\bigg )}+Reg_{r}(n_{r}){\bigg ]}+\lambda \|\theta \|_{2}^{2}$ ,

where ${\mathcal {R}}$ is the set of all relations in the dataset and ${\textbf {Train}}$ is the training set. As you can see in the above formula, there is no label for each sample because all the samples are positive. It is also important to notice the position of $Reg_{r}(n_{r})$ in the loss expression. To make this loss function work with stochastic gradient descent, we need summation over all the samples, so the final formula for the loss function, which will be used to train the model, is as follows:

$loss=min_{\theta }\sum _{(h,r,t)\in {\textbf {Train}}}{\bigg (}softplus(-\phi (h,r,t))+{\frac {1}{n_{r}}}.Reg_{r}(n_{r}){\bigg )}+\lambda \|\theta \|_{2}^{2}$ ,

which is equal to the previously mentioned loss function.

Test 1 - Regularizing on Predictions of the Model

Usually, the regularization term is based on model structure or parameters, but as a first test, we decided to first try a regularizer based on the model predictions. This is simpler than regularizing based on model structure or parameters because it is hard to devise a regularizer which will incorporate $n_{r}$ and model parameters. The tested regularization term in this test was $Reg_{r}(n_{r})={\Big (}softplus(-\phi (h,r,t))-\alpha n_{r}{\Big )}^{2}$ , and therefore the loss function was

$loss=min_{\theta }\sum _{(h,r,t)\in {\textbf {LB}}}{\bigg (}softplus(-\phi (h,r,t))+{\frac {\lambda '}{n_{r}}}{\Big (}softplus(-\phi (h,r,t))-\alpha n_{r}{\Big )}^{2}{\bigg )}+\lambda \|\theta \|_{2}^{2}$ ,

where $\lambda '$ is the regularization rate for the new regularizer and $\alpha$ is a normalizing factor to normalize the values of $n_{r}$ to the range $[0,1]$ so that they are comparable with the values of $softplus$ .

The result of a grid search over $\lambda '$ in range $[0.01,1]$ and sensible ⍺ gave very bad results: The MMR score for the trained models were all close to 0.0001, compared to MRR=0.72 for SimplE implementation of Kazemi and Poole^[1].

Test 2 - Regularizing on Norm of the Embedding Vectors

To derive a regularization term based on embedding vector of relation $r$ and $n_{r}$ , a few plots were drawn which can be seen in figure 1. Analyzing the plots in figure 1 shows that there is a correlation between the norm of embedding vectors $v_{r}$ and $n_{r}$ .

Figure 1. These plots show different features of the embedding vectors

v_{r}

compared to

n_{r}

. The embedding vectors for relations are extracted from SimplE model trained using negative samples. As it is observed, there seems to be a high correlation between the norm of the relation embedding vectors and the frequency of the occurrence of the relation in the training set. Plots drawn using the other embedding vectors of the relations

v_{r^{-1}}

are very similar to these plots.

To make use of the correlation observed in plots in figure 1 to derive a regularization term based on $\|v_{r}\|^{2}$ , polynomial regression functions of degrees 1, 2, and 3 were fitted to the plot of $\|v_{r}\|^{2}$ vs. $n_{r}$ . The resulting regression curves can be seen in figure 2.

Figure 2. Regressions fitted on the

\|v_{r}\|^{2}

vs.

n_{r}

plot and the

\|v_{r^{-1}}\|^{2}

vs.

n_{r}

plot.

To derive the regularization term, the degree one polynomial regression curve was used. As a result, the formula for the regularizer was decided to be $Reg_{r}(n_{r})={\Big (}\|v_{r}\|^{2}-(\alpha n_{r}+\beta ){\Big )}^{2}$ . Finally, the loss function will be as follows:

Figure 3. This figure shows the result of MRR scores achieved on FB15K data set using different values for

\lambda '

values.

$loss=min_{\theta }\sum _{(h,r,t)\in {\textbf {LB}}}{\bigg (}softplus(-\phi (h,r,t))+{\frac {\lambda '}{n_{r}}}{\Big (}|v_{r}|-(\alpha n_{r}+\beta ){\Big )}^{2}{\bigg )}+\lambda \|\theta \|_{2}^{2}$ ,

where $\alpha$ and $\beta$ are derived from the fitted regression line ( $\alpha =0.00005675$ and $\beta =0.0327$ ), and $\lambda '$ is the regularization rate. MRRs achieved with a grid search over $\lambda '$ can be seen in figure 3. As it can be seen in figure 3, results as high as MRR=0.18 are achieved, which is comparable to MRR=0.72 that is achieved by SimplE implementation of Kazemi and Poole^[1]. It seems like the MRR is increasing as $\lambda '$ increases and it does not seem to have saturated yet, which means that testing $\lambda '$ values greater than 0.7 might give even better results.

Devised Regularizer Tested with Negative Sampling

Testing this new regularization term alongside negative sampling resulted in MRR~0.10 for $\lambda '\in [0.05,0.5]$ meaning that this regularizer with high $\lambda '$ values does not increase MRR score of SimplE implementation of Kazemi and Poole. Further tests are needed with smaller $\lambda '$ values to make a general statement about the regularizers impact on SimplE.

Discussion and Future Work

Results of the second test show that regularization is powerful enough to achieve good results and might some day completely substitute negative sampling in training embedding-based models in link prediction of KGs.

Future work includes multiple tests that can be done to further analyze the potentials of the new devised regularization term. Some of these test include:

Tests to check if the regression line on the $\|v_{r}\|^{2}$ vs. $n_{r}$ plot stays fixed among multiple data sets, so that this regularizer can potentially be transported to any data set without any change.
Trying degree two regression to see if that achieves better results than the degree one regression.

Annotated Bibliography

↑ ^1.0 ^1.1 ^1.2 ^1.3 Seyed Mehran Kazemi and David Poole, SimplE Embedding for Link Prediction in Knowledge Graphs, in Proc. Thirty-second Conference on Neural Information Processing Systems(NeurIPS 2018). Also a previous version in Eighth International Workshop on Statistical Relational AI, July 2018.
↑ ^2.0 ^2.1 Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2016. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104(1):11-33.
↑ Antoine Bordes and Nicolas Usunier and Alberto Garcia-Dur\'an and Jason Weston and Oksana Yakhnenko, Translating Embeddings for Modeling Multi-relational Data, Advances in Neural Information Processing Systems (NIPS 26), 2013
↑ Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In ACM SIGMOD, pages 1247–1250. AcM, 2008.

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

[:1-1] 1.0 ^1.1 ^1.2 ^1.3 Seyed Mehran Kazemi and David Poole, SimplE Embedding for Link Prediction in Knowledge Graphs, in Proc. Thirty-second Conference on Neural Information Processing Systems(NeurIPS 2018). Also a previous version in Eighth International Workshop on Statistical Relational AI, July 2018.

[:0-2] 2.0 ^2.1 Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2016. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104(1):11-33.

[3] Antoine Bordes and Nicolas Usunier and Alberto Garcia-Dur\'an and Jason Weston and Oksana Yakhnenko, Translating Embeddings for Modeling Multi-relational Data, Advances in Neural Information Processing Systems (NIPS 26), 2013

[4] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In ACM SIGMOD, pages 1247–1250. AcM, 2008.

[1]

[2]

[3]

[4]