Course:CPSC532:StaRAI2020:Negative Sampling on Knowledge Graphs

Title

In this page, I first define the negative sampling concept and then talk about popular methods of negative sampling in knowledge graphs and finally test the hypothesis that whether changing the previous methods and considering the nodes label will provides better results or not.

Authors: Melika Farahani

Abstract

Negative sampling in knowledge graphs is an approach can helps to improve the link prediction and graph embedding models. They are several commonly used methods for negative sampling in KBs, such as, random sampling, corrupting positive instances and relational sampling. These methods consider relations and relation types and try to find a subset of relations that are not available in the KB's relation set and assume them as negative facts. Although these models have their own advantages, they don't perform well in many cases. I am proposing the idea that by considering node labels or assigning labels to knowledge graph's node, we can have some improvements in the previous methods.

Content

Negative Sampling Concept

In converse to the sampling from target area (positive samples), we will define negative samples as sampling from the set with the negative meanings. Although in this work I am emphasizing on the negative sampling on knowledge graphs, I provide a negative sampling example in the context of word2vec for a better understanding.

Negative Sampling in Word2vec

Negative Sampling firstly defined in the context of word embedding and word2vec. Imagine we want to predict a word within a certain window. Those words that are in the windows can be considered as positive words. So that we can either use those positive word for learning the model to predict the context word. However, we need some false label to do the prediction. So we need to figure out some negative case for that. So we can pick some non-surrounding words (we call it as negative word) to be a false label.

Negative Sampling in Knowledge Graphs

Knowledge Graphs capture knowledge as triples <entity, relation, entity>, with entities mapped to nodes, and relations to edges. KGs contain only positive instances. Negative instances in the knowledge are not marked in a knowledge graph. The task of link prediction and knowledge graph embeddings has much in common with the word2vec task. So we need to figure out some negative case for easing that problem. The negative samples in knowledge graphs are the one which are not available in the KB. In the most cases we are specifically looking for the cases which have a negative meaning. For example, when we have a relation in graph for <Leonardo_Dicaprio, plays_in, Inception>, it is a positive sample and we can consider <Leonardo_Dicaprio, plays_in, Interstellar> as a negative sample, since Leonardo_Dicaprio did not play in Interstellar movie.

Negative Sampling Methods

Random Sampling: The simplest form of sampling negative instances is to assume a closed world hypothesis and consider any triple that does not appear in the knowledge graph as a negative instance. Let $K=K^{+}=\{(s_{i},r_{i},t_{i})\mid y_{i}=1;i=1,2,...,N\}$ , denote the complete knowledge graph, where $y_{i}=1$ represents the presence of a triple and $y_{i}=0$ represents absence. The set of negatives $K^{-}$ is given by K $K^{-}$ $=\{(s_{i},r_{i},t_{i})\mid y_{i}=0;i=1,2,...,N\}$ . In this sampling methods we sample randomly from the $K^{-}$ . However the KG is incomplete and this set also contains positive triples not present in the KG. Furthermore, this set might be very large because the incorrect facts are far more than the correct ones. A simple solution to the scalability problem and make it faster is randomly sampling a small number of samples from the edges are not available in the set of relations and this is also the easiest approach to implement. However we don’t expect its results to be useful and they may be meaningless since entities and relations may have different types. For example, if we have entities with types actor, movies and food and we have a relation for (actor_x, plays_in, movie_y) and we don't have a relation for (actor_x, plays_in, food_z), that does not mean that (actor_x, plays_in, food_z) is a valid specific sample.
Corrupting Positive Instances: In this approach, for all relation r, they considered all the s that once be a source entity with r, and all the t s where once were a target in that specific relation. More formally, for every relation r, collect the sets: $S=\{s|(s,r,*)\in K^{+}\}$ and $T=\{t|(*,r,t)\in K^{+}\}$ produce sets of corrupted triples $S'=\{(s',r,t)|s'\in S,(s',r,t)\not \in K^{+}\},T'=\{(s,r,t')|t'\in T,(s,r,t')\not \in K^{+}\}$ . We sample a number $n_{s}$ of negative samples from $S'$ and $T'$ . Such a method produces negative instances that are closer to the positive ones than those produced through random sampling. We can consider all the r’s that are not available between these two sets as the negative samples. Due to the incompleteness of the KG, some of these candidates could be unknown positives.
Relational Sampling: If we assume that source target pairs participate in only one relation, then sampling targets that are not connected to the current source through relations other than the current relation can yield true negatives. More formally, for positive triple $(s,r,t)$ the negative candidate source set is $S^{-}=\{s|(s,r',t'),\forall r'\in R,r'\neq r\}$ and target set $T^{-}=\{t|(s',r',t),\forall r'\in R,r'\neq r\}$ . This is a common procedure in multi-class learning. However, this assumption may be not valid in all the knowledge graphs as one source can participate in more than one relation.

In "Analysis of the Impact of Negative Sampling on Link Prediction in Knowledge Graphs"^[1] paper, several link prediction methods such as transE, rescal and distmult has been tested with different random sampling methods on two different dataset (FB15K and WN18). The results show that the different sampling methods have different effects on the two datasets. They have a very good plot for comparing them in different dataset in the figure 5 of their paper. However I don't attach it here due to the copy right issues.

My main goal in this project is to emphasize on one of the issues that all the above approaches has on the knowledge graphs. The point is that all of the mentioned methods use only the relation labels and does not consider the nodes' types. I took the WN18 as an example for testing the methods and in the below example I will show this weakness.

Consider these relations I got from the WN18:

(central_nervous_system, has_part, brain)
(Poland, has_part, Europe)
(Philippines, has_part, Southeast_Asia)
(evolve, verb_group, develope)

In the Random Sampling method, we may choose the relations like (evolve, has_part, Europe) or (Poland, verb_group, develope) which either of them does not make any sense. Although we don't sample these examples in the Relational Sampling and Corrupting Positive Instances methods, we may suffer from generating the relations like <central_nervous_system, has_part, Europe> or <Poland, has_part, brain>. This problem it's because we don't assume any differences between the Europe and brain.

For solving this issue I am proposing a new methods and call it "classified corruptive positive instances". The main idea of this method is that we can consider the node labels (countries, body_part, verbs, etc.) or classify the nodes if they already don't have labels. They are many general approaches for node classifying like KNN or decision trees. However, as we have a graph structured dataset I tried a simple common classification on graphs and tested it on WN18 dataset. For each relation r, I make an undirected graph from the base knowledge graph $G$ , considering the relations with type r and l and cal this new graph $G_{r}$ . In the $G_{r}$ they are some large connected components and each of them belongs to a different meaning (one for countries, one for body parts, one for vocabulary and grammar, etc). If we do corruptive positive instances method on each of the components separately, we will not have the above mentioned meaningless samples. Although we can also do this approach on different label clusters (if our nodes are labeled), I tested this approach using connected components with r = "has_part" on the WN18. In the $G_{r}$ we have 5193 nodes,

which has 901 connected components. 750 of these components has less or equal than 3 number of nodes and 151 of them has more than that. Between these 150 nodes I checked manually and each of them belonged to one meaningful cluster. The three biggest clusters have more than 400 nodes. One of them is for cities and location which are parts of a bigger states or country with size 1541. One of them is for body parts with size 409 and one is for furnitures or electronic parts with size 898. Therefore, by this approach we have less meaningless samples. However, we may eliminate some meaningless negative samples as we are limiting ourselves and some of the small components may be a part of a bigger components and we don't consider them due to the KB's incompleteness. A good idea for future works could be merging smaller components by getting feedbacks from link predictions.

[1] Analysis of the Impact of Negative Sampling on Link Prediction in Knowledge Graphs

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

↑ https://arxiv.org/pdf/1708.06816.pdf

[1] ttps://arxiv.org/pdf/1708.06816.pdf

[1]