Course:CPSC532:StaRAI2020:Query2Box and Faithful Embeddings for Knowledge Base Queries.

From UBC Wiki

Query Embedding Methods

Answering complex logical queries on large-scale incomplete knowledge graphs(KGs) is a fundamental yet challenging task[1]. Query2Box [1] and Faithful Embeddings for Knowledge Base[2] tackle this problem by embedding KG entities as well as the query into a vector space such that the entities that answer the query are embedded close to the query.

Authors: Maulik Parmar

Abstract

"The deductive closure of an ideal KBs contain exactly the logical queries that the KB can answer"[2]. However, in reality, most KBs are incomplete and fail to answer queries that have real-world answers. Query embedding methods may disagree with deductive reasoning on answers that do not require generalization or relaxation[1]. In "Faithful Embeddings for Knowledge Base"[2], the authors propose a novel QE method "EmQL" that is more faithful to deductive reasoning and show that this method has better performance on complex queries than incomplete KBs.

Introduction

Query embedding(QE) methods represent KB entities and KB queries in a joint embedding space, supporting relaxation and generalization in KB inference for incomplete KBs. Graph Query embedding encodes a query q and entities e as vectors such that distance represents e's score as an answer to query q. Traditional logical inference finds deductively entailed answers to queries; Knowledge Base embedding(KBE) allows a system to generalize from triples in incomplete KBs to similar triples, and QE system combine both of these methods to generalize soft version of logical entailment.

In Query2Box[1], queries are embedded as hyper-rectangles(Boxes) where a set of points inside the box corresponds to a set of answer entities of the corresponding queries. This provides 3 important benefits:

  1. Boxes naturally model sets of entities they enclose.
  2. Boxes allow easy adaptation of logical operations like Venn Diagrams.
  3. "Executing logical operators over boxes results in new boxes, which means that the operations are closed; thus, logical reasoning can be efficiently performed in QUERY2BOX by iteratively updating boxes according to the query computation graph"[1].

However, compared to deductive reasoning, most QE systems like Query2Box[1] perform poorly on queries that do not require generalization. But, the authors of Faithful Embeddings for Knowledge Base[2] show that EmQL is more faithful to deductive reasoning and this leads to better performance on complex queries to incomplete KBs. A QE system is logically faithful if it behaves similarly to a traditional logical inference system w.r.t entailed answers[2]. EmQL proposes 2 novel methods for improving faithfulness while also preserve its ability to generalize well:

  1. EmQL adds a non-parametric component by implementing some logical operations using neural retrieval over embedded triples, in contrast to Query2Box that implements logical operations using geometric operations in embedding space.
  2. Secondly, EmQL uses a randomized data structure called a count-min sketch to propagate scores of logically-entailed answers.

The main contributions of "Faithful Embedding for Knowledge Base"[2] are:

  1. A new QE system with expressive set and relational operators.
  2. A new analysis of the QE system using a check of faithfulness.
  3. Application of QE system as a module in KB question-answering(KBQA) system.

Logically faithful:

The authors say that a QE system is logically faithful if it behaves similarly to a traditional logical inference system w.r.t entailed answers.

Representation

Query2Box[1] uses boxes(axis-aligned hyper-rectangles) to model sets of entities. The benefit of using a box is that, unlike simple point embeddings, the box has an interior; thus, if we have a set of entities, it is natural to model the entity embedding as a point inside the box. A box is defined using Cen(p) and Off(p), where Cen(p) is defined as the center of box p and Off(p) is a positive offset of the box, modeling the size of the box. Also, logical operations are designed such that the output of those operations is also a box, which means that the operations are closed.

EmQL [2] represents entity sets using a format that not only supports generalization but also allows for precisely encoding weights of sets that are defined by composition logic-like operation. EmQL unlike Query2Box represents sets of entities using embeddings. Intuitively, this representation helps them to do neural retrieval over the KB of embedded triples. Doing neural retrieval using point embedding will be more simple compared to box embedding(can be done by simple dot product similarity). They use a weighted centroid to represent a set of entities and they assume that sets will only have entities that are closer to each other in embedding space. Thus, the region around the centroid can be thought to represent a set of entities.

A set X is represented with a pair , , where is learned embedding for each entity i, holds the non-negative real weight of element i in X, is the weighted centroid of elements of X that identifies the general region of elements of set X, and is an optional count min-sketch, which encodes additional information on weights of element of X. Count-min sketches [3] is a widely used randomized data structures that can approximate the vector with limited storage. The centroid-sketch representation assumes that all elements of the same set are similar i.e they are all in a sphere around a specific centroid. The use of is optional, and the authors have also compared results without using it.

"All EmQL operations are closed under composition, because they all take and return the same sparse-dense representation, and the computation graph constructed from an EmQL expression is similar in size and structure to the original EmQL expression"[2]. In Query2Box, a union is implemented by rewriting the query into a normal form. This can be sometimes disadvantageous as the normal form can be exponentially larger than the original expression.

Both methods support relation following, intersection and union. EmQL also supports an additional relational operation called "Relational filtering". "Relational filtering, similar to an existential restriction in description logics, removes from X those entities that are not related to something in set Y via some relation in R "[2].

Logical Operation and Loss function

We describe the logical operation defined in EmQL in below table:

EmQL Query2Box
Intersection

Union

Reformulate the problem in Normal form

(where is Hadamard product).

Relation following:

(i) Query2Box:

Given an input box embedding p and a relation embedding r , Query2Box model projection(relation following) by p+r, where they sum the centers and sum the offsets. This results in a new box with translated centers and a larger offset.

(ii) EmQL:

In EmQL, relation following is implemented using an embedding matrix K for KB triples: for every triplet t = r(x, y) in the KB, K contains a row concatenating the embeddings for r,x , and y. To compute Y=X.follow(R) , a query vector is created and a Maximum inner product search(MIPS) is performed against all triples in KB K to get the top k triples matching the query, and these triples are rescored with the sketches of X and R.

Let be the representation of retrieved triple. Its score is:

where is the approximation of (weight vector) using count-min sketch .

EmQL project out the objects from the top k triples as a sparse k-hot vector:

The output of operation Y=X.follow(R) is obtained by converting to a set representation . Intuitively, this operation is like doing a neural retrieval over a set of embedded KBs triples.

Loss:

(i) Query2Box:

Given a training set of queries and their answers, Query2Box optimize a negative sampling loss to effectively optimize their distance-based model:

where represents a fixed scalar margin, is a positive entity and is the -th negative entity and k is the number of negative entities.

(ii) EmQL:

The output of the query is an approximation , encoded as and the goal of the EmQL training is to make approximate target output set .

where E is a matrix of entity embeddings and is the canonical -hot encoding of Y.

Experiments and Results

While Query2Box is trained on examples from only 5 reasoning tasks (1p, 2p, 3p, 4i, 5i), EmQL was trained on only two tasks: relational following ( a variation of 1p), and set intersection. Queries are generated using query templates as shown below[2] .

Query Template Format
1p X.follow(R)
2p X.follow(R1).follow(R2)
3p X.follow(R1).follow(R2).follow(R3)
2i X1.follow(R1) X2.follow(R2)
3i X1.follow(R1) X2.follow(R2) X3.follow(R3)
ip (X1.follow(R1) X2.follow(R2)).follow(R)
pi X1.follow(R1).follow(R2) X2.follow(R3)
2u X1.follow(R1) X2.follow(R2)
up (X1.follow(R1) X2.follow(R2)).follow(R)


"To test the ability to infer logically entailed answers, EmQL and Q2B were trained with the full KB instead of the training KB, so only reasoning (not generalization) is required to find answers. It is important for a query language to be also be faithful to the KB when answering compositional logical queries. "[2].

Hits@3 results on the Query2Box datasets, FB15k and NELL995 as provided by "Faithful Embeddings for Knowledge Base Queries"[2]

The results in above table show EmQL outperforms Q2B on all tasks in this setting, with average Hits@3 raised from 36-51 to the 90’s!

Conclusion

EmQL tackles the problem of faithfulness to deductive reasoning. Query2Box[2] and other embedding methods perform poorly on queries that don't require generalization. EmQL generalizes well, is differentiable, compact, scalable, and faithful with respect to deductive reasoning[2]. EmQL outperforms previous QE methods in the usual reasoning experiments and massively outperforms them with respect to faithfulness[2]. But, one of the major drawbacks of EmQL is that it requires the entities that are to be grouped in sets to be close in embedding space, so entity embeddings must be trained to have this property.

Ambiguity in EmQL paper:

The following things have not been made clear in the paper:

(1) How do you create an initial set of entity vectors before doing any query operations.

(2) The vector is a k-hot vector. It is not clear what will be the weights of this k-hot vector and what do initial weight signify.

(3) Also, it is not clear how practical it is to fix k. This will restrict sets to have only k elements and output of a query is also restricted to at most k entities.

(4) Moreover, it is not clear how do you generate EmQL representation for relation sets.

Annotated Bibliography

(1) Hongyu Ren, Weihua Hu and Jure Leskovec ,"Query2box: Reasoning over Knowledge Graphs in Vector Space ".

(2) Haitian Sun, Andrew O. Arnold, Tania Bedrax-Weiss, Fernando Pereira and William W. Cohen, "Faithful Embeddings for Knowledge Base Queries".

(3) Graham Cormode and S. Muthukrishnan, "An improved data stream summary: the count-min sketch and its applications".

To Add

Put links and content here to be added. This does not need to be organized, and will not be graded as part of the page. If you find something that might be useful for a page, feel free to put it here.


Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0
By-nc-sa-small-transparent.png