Topological Data Analysis and Embedding Multi-relational Data

We discuss the application of topological data analysis in the context of the embedding multi-relational data and the TransE algorithm.

Principal Author: Carl Kwan

Figures are my own creation unless otherwise noted.

Abstract

In 2013, “Translating Embeddings for Modeling Multi-relational Data" (TransE) was published. TransE provided an effective algorithm for embedding multi-relational data into ${\displaystyle \mathbb {R} ^{n}}$. However, the embedding of the data depends on the metric of ${\displaystyle \mathbb {R} ^{n}}$. It would not be unusual to wonder whether there are benefits to using other metrics.

Topological data analysis is a (relatively) new approach to data analysis that levies the tools of topology for the purpose of examining data. Among the benefits of topology is the ability to discern the “shape" of the space independent of metric. This makes topology a potentially powerful tool in the context of TransE.

Builds On

The nature of topological data analysis builds upon algebraic topology. While this article remains informal in its approach, it would be prudent to consider some prerequisites (the first item is recommended viewing - the rest are for the interested):

• A short, informal but excellent video introduction to persistent homologies: Wright
• A longer video of a talk on topological data analysis: Carlsson
• An introductory paper on topological data analysis: Carlsson
• A standard (and free) introduction to algebraic topology: Hatcher
• A standard introduction to point set topology: Munkres
• Some excellent books on algebra and linear algebra: Artin, Lang, Jacobson

Finally, as this article applies topology to TransE, being familiar with TransE would be a good idea. The paper can be found here.

Related Pages

While this article focuses on applications to TransE, TDA can be applied to any field or problems involving large datasets. One possible application within this wiki would be the application to restaurant data.

TransE

The paper “Translating Embeddings for Modeling Multi-relational Data" was published in 2013.[1] Since then, numerous similar algorithms have been developed to embed data into ${\displaystyle \mathbb {R} ^{n}}$. Among the strengths of TransE is its ability to predict relationships between entities (link prediction). This is done by embedding data points, say ${\displaystyle h}$ and ${\displaystyle t}$, with a relationship between them, say ${\displaystyle \ell }$ into ${\displaystyle \mathbb {R} ^{n}}$ such that

${\displaystyle h+\ell \approx t}$

as members of ${\displaystyle \mathbb {R} ^{n}}$. This results in similar elements being clustered together in the embedding. The upshot of this is that link prediction for a subject data point ${\displaystyle p}$ reduces to looking the clusters with which the points surrounding ${\displaystyle p}$ have relationships.

TransE embeds data as points in ${\displaystyle \mathbb {R} ^{n}}$ according to their relationship. For example, if ${\displaystyle h}$ is "Saving Private Ryan" and ${\displaystyle t}$ is "Steven Spielberg", then ${\displaystyle \ell }$ could be "directed by".
TransE embeddings often resemble clusters.

Now consider once again

${\displaystyle h+\ell \approx t.}$

In particular, what does ${\displaystyle \approx }$ mean in the context of ${\displaystyle \mathbb {R} ^{n}}$? The usual metric on ${\displaystyle \mathbb {R} ^{2}}$ is the Euclidean metric

${\displaystyle d_{2}(x,y)={\sqrt {\sum _{i=1}^{n}|x_{i}-y_{i}|^{2}}}}$

but we can also define other metrics on ${\displaystyle \mathbb {R} ^{n}}$ such as the metrics induced by ${\displaystyle \ell ^{p}}$ norms

${\displaystyle d_{p}(x,y)=\left(\sum _{i=1}^{n}|x_{i}-y_{i}|^{p}\right)^{1/p}}$

as well as the maximum metric

${\displaystyle d_{\infty }(x,y)=\sup _{i}|x_{i}-y_{i}|.}$

Note that when ${\displaystyle p=2}$ we get the Euclidean metric.

The unit circle under the metrics ${\displaystyle d_{1}}$ (green), ${\displaystyle d_{2}}$ (red), and ${\displaystyle d_{\infty }}$ (blue).

The TransE paper proposed ${\displaystyle d_{1}}$ and ${\displaystyle d_{2}}$ as choices when running the embedding algorithm. However, it is unknown how the choice of metric will affect link prediction. It is possible that using different metrics will result in better performance. Topology can look at properties of spaces independent of metrics.

The TransE Algorithm.[1]

Topology

Informally speaking, topology looks at characteristics of spaces that are retained under continuous deformations. One way of doing this is by looking at collections of subsets of the space which we call a topology. For ${\displaystyle \mathbb {R} ^{n}}$, the usual topology is created by subsets of the form

${\displaystyle B_{r}(x)=\{y\in \mathbb {R} ^{n}\mid d(x,y)

ie. “open balls" of radius ${\displaystyle r>0}$. This approach is commonly referred to as point-set topology.

An open ball in ${\displaystyle \mathbb {R} ^{2}}$ with radius ${\displaystyle r}$.

Another approach is to look at paths in the space and their compositions under continuous deformations. These paths under composition form algebraic structures - the simplest of which is called the Fundamental Group. This approach to topology is algebraic topology.

The paths on the right (purple and teal) can be deformed continuously into a point so they are equivalent. However, the paths on the left (blue, red, and green) are not the same as the purple and teal paths because they cannot be deformed into a point without making a discontinuous "jump" over the hole on the left.

Embeddings

The TransE paper used “embedding” to describe the placing of data entities, say ${\displaystyle h}$, ${\displaystyle t}$, as points in ${\displaystyle (\mathbb {R} ^{n},d)}$. The problem we face now is the feasibility of taking one “embedding” in ${\displaystyle (\mathbb {R} ^{n},d)}$ into another space ${\displaystyle (\mathbb {R} ^{n},d^{\prime })}$. In general, we obtain the following definition:

Definition. An injection ${\displaystyle f:(X,d)\hookrightarrow (Y,d^{\prime })}$ between metric spaces is called an embedding.

If our dataset of interest is a finite set of points in ${\displaystyle (\mathbb {R} ^{n},d)}$, then the question of finding an embedding into ${\displaystyle (\mathbb {R} ^{n},d^{\prime })}$ becomes trivial since there is no condition on ${\displaystyle d}$, ${\displaystyle d^{\prime }}$.

Example. For example, if ${\displaystyle S\subset \mathbb {R} ^{n}}$ are our data entities and ${\displaystyle \sigma :S\to S}$ a permutation, then a simple embedding would be

${\displaystyle \varphi (x)=\left\{{\begin{array}{cl}\sigma (x)&{\text{if }}x\in S\\x&{\text{otherwise}}\end{array}}\right.}$

A desirable embedding is one that would preserve the clusters of data points so that we can continue to perform link prediction. The simplest condition we can impose on the embedding to satisfy our requirement is for it to be distence-preserving or, in other words, isometric.

Definition. A function ${\displaystyle \varphi :(X,d)\to (Y,d^{\prime })}$ is an isometry or isometric if

${\displaystyle d(x,y)=d^{\prime }(\varphi (x),\varphi (y))}$

for all ${\displaystyle x,y\in X}$.

Corollary. It is immediate that isometries must be injective since otherwise we could send distinct points in ${\displaystyle X}$ to the same point in ${\displaystyle Y}$.

But not all isometries are surjective.

Example. Let ${\displaystyle f(x)=x+1}$ on ${\displaystyle [0,\infty )}$ with metric ${\displaystyle d(x,y)=|x-y|}$. Now ${\displaystyle f}$ is trivially isometric but not surjective since ${\displaystyle 0}$ is not in its image.

But what about the isometries that are bijective?

Definition. A global or total isometry is a bijective isometry.[2]

Corollary. If ${\displaystyle \varphi :(X,d)\to (Y,d^{\prime })}$ is a global isometry, then so is ${\displaystyle \varphi ^{-1}:(Y,d^{\prime })\to (X,d)}$.

proof. Observe

${\displaystyle d^{\prime }(a,b)=d^{\prime }(f(x),f(y))=d(x,y)=d(f^{-1}\circ f(x),f^{-1}\circ f(y))=d(f^{-1}(a),f^{-1}(b))}$

where ${\displaystyle f(x)=a}$, ${\displaystyle f(y)=b}$.

An isometry preserves the distance between two points even though the definition of distance may be different depending on the metric space.

Despite the existence of global isometries, one would be hard-pressed to require even a non-bijective isometry. The natural question for TransE, then, is “Given a TransE embedding ${\displaystyle (\mathbb {R} ^{n},d_{p})}$, are there isometries to ${\displaystyle (\mathbb {R} ^{n},d_{q})}$?"

Fortunately, there are some positive results. The TransE paper mentions ${\displaystyle d_{1}}$ and ${\displaystyle d_{2}}$ as potential metrics. We begin with the results concerning them.

Theorem.

1. There is an isometry ${\displaystyle (\mathbb {R} ^{n},d_{p})}$ to ${\displaystyle (\mathbb {R} ^{n},d_{q})}$ for any ${\displaystyle 1\leq p\leq q\leq 2}$. In particular, there is an isometry ${\displaystyle (\mathbb {R} ^{n},d_{2})\hookrightarrow (\mathbb {R} ^{n},d_{1}).}$
2. There is an isometry ${\displaystyle (\mathbb {R} ^{n},d_{2})\hookrightarrow (\mathbb {R} ^{n},d_{p})}$ for ${\displaystyle 2\leq p\leq \infty }$.
3. Incidentally, ${\displaystyle (\mathbb {R} ^{n},d_{p})\hookrightarrow (\mathbb {R} ^{n},d_{\infty })}$ isometrically for any ${\displaystyle p}$.

At this point, the existence of a metric space from which we can isometrically embed into any other relevant metric space is sufficient to decide an optimal choice of metric for TransE. However, not all spaces embed isometrically the other way. Instead, the space will embed with some error which is called distortion.

Definition. Given an embedding ${\displaystyle f:(X,d)\to (Y,d^{\prime })}$, the distortion of ${\displaystyle f}$ is the smallest ${\displaystyle D\geq 1}$ such that

${\displaystyle r\cdot d(x,y)\leq d^{\prime }(f(x),f(y))\leq Dr\cdot d(x,y)}$

for every ${\displaystyle x,y\in X}$ and where ${\displaystyle r>0}$.

Distortion means that distance is not preserved under an embedding between metric spaces.

Essentially, this gives a bound on how much by which the embedding is scaled compared to the preimage. It is immediate that ${\displaystyle f}$ is an isometry when ${\displaystyle D=1}$

This provides a framework to consider the significance of the embedding’s failure. Indeed,

Theorem.

1. (Bourgain) A general metric embeds into ${\displaystyle (\mathbb {R} ^{n},d_{2})}$ with distortion ${\displaystyle O(\log m)}$ where ${\displaystyle m}$ is the number of points
2. (Matousek) In general, any metric space on ${\displaystyle n}$ points embeds into ${\displaystyle (\mathbb {R} ^{n},d_{p})}$ with distortion ${\displaystyle O\left({\frac {\log m}{p}}\right)}$
3. There is an embedding into ${\displaystyle (\mathbb {R} ^{n},d_{2})}$ that embeds with distortion ${\displaystyle \Omega (\log n)}$.

If one does need to embed the TransE embedding into another metric space and working under the assumption one does not want to run TransE multiple times, then one should run TransE with ${\displaystyle d_{2}}$ since embeddings from ${\displaystyle (\mathbb {R} ^{n},d_{2})}$ into any other ${\displaystyle \ell ^{p}}$ space can be isometric.

Homology and Barcodes

This section is particularly informal and would benefit greatly from a viewing of Matthew Wright's introductory video

We have been working under the assumption that embeddings should be isometric. In particular, we have assumed that distortion should be avoided. However, distortion may not be as debilitating as initially perceived. Recall that a successful TransE appears akin to the distinct clusters of data points with relationships in between. In order to perform link prediction we require sufficient separation between the clusters. As long as there is sufficient separation between the clusters we can proceed as usual. To track this requirement, we look at the "topological features" or the underlying "shape" of the data.

Distortion is acceptable as long as the clusters remain "separated".

To investigate the shape of the embedding we look at its persistent homology. The formal definition of homology is beyond the scope of this article, but roughly, it is an algebraic picture of the topological features of the data of interest. By topological features we mean structures such as connected components, holes, etc. However, finite data points in ${\displaystyle \mathbb {R} ^{n}}$ have a discrete topology which means not much can be discerned.

Instead, we form neighbourhoods of equal diameter ${\displaystyle r}$ around each point and connect points with overlapping neighbourhoods. As ${\displaystyle r}$ increases, the number of connected components decreases. This provides us with sufficient structure to investigate the shape of the data.

As the radius of the neighbourhoods increase, connect points that have overlapping neighbourhoods.

As the ${\displaystyle r}$ increases, we can record the number of connected components in a graphical structure known as a barcode. In the barcode, the number of connected components are recorded as an interval with respect to ${\displaystyle r}$. Ideally, the number with the longest interval will be the interval in which the clusters retains sufficient separation. The maximum of ${\displaystyle r}$ in the interval of interest should provide a sufficient bound for the acceptable amount of distortion for the embedding.

The barcode for the example above. The initial short bars represents the values of ${\displaystyle r}$ for which no vertices were connected and there were 9 "connected" components. The longest bar in the center represents the values of ${\displaystyle r}$ for which there are 3 connected components, ie. 3 clusters of data points that are sufficiently separated. The right bar represents the values of ${\displaystyle r}$ so large that all vertices are connected at which point we may stop since there is only one connected component.

Given such a framework, one could imagine the following procedure.

1 # Given some data, suppose we wish to embed first into a space with a metric ${\displaystyle d}$ and then decide whether it's possible to embed into another space with metric ${\displaystyle d^{\prime }}$ while maintaining separated clusters
2 Use TransE to exhibit an embedding with metric ${\displaystyle d}$
3 Look up the distortion ${\displaystyle D}$ of embedding from ${\displaystyle d}$ into ${\displaystyle d^{\prime }}$
4 Compute the barcode for the connected components in the embedding
5 Use the barcode to find a bound for the acceptable distortion
6 If ${\displaystyle D}$ is not within the acceptable amount of distortion, then the new embedding into ${\displaystyle d^{\prime }}$ is acceptable
7 Otherwise, its not


Conclusion and Future Work

Among the initial objectives of this project was to consider the feasibility of topological applications to the context of TransE. Indeed, it not only appears feasible, there seems to be concrete applications for topology. Given certain conditions we see evidence towards a choice for ${\displaystyle d_{2}}$ as the metric to run with TransE due to the existence of isometric embeddings into other metric spaces. Moreover, if we relax the requirement for isometrics, we see the framework for deciding the extent of acceptable distortion. From this point, there are several directions to take.

The immediate and obvious sequel is an implementation of these algorithms and decision procedures along with experimental testing. Algorithms for topological data analysis already exist. In particular, there are software packages for computing persistence intervals.[3] Unification in the form of a codified procedure involving a single run of TransE, then the computation of acceptable distortion via persistent homology and barcode, and, finally, the embedding of data into other metric spaces - with or without distortion - would be a pleasure to see in practice.

Secondly, the partitioned nature TransE’s output resembles the objective of clustering algorithms. Of special interest are the clustering algorithms between metric spaces. Ideally, if ${\displaystyle \Pi (X,d)}$ is the partition of ${\displaystyle X}$ with metric ${\displaystyle d}$ returned by a clustering algorithm, such a clustering algorithm would have:

• Scale-Invariance: ${\displaystyle \Pi (X,d)=\Pi (X,r\cdot d)}$ where ${\displaystyle r>0}$.

• Richness: the image of ${\displaystyle \Pi }$ is all the possible partitions of ${\displaystyle X}$.

• Consistency: if ${\displaystyle d^{\prime }}$ is another metric for ${\displaystyle X}$ and

1. for any ${\displaystyle x,y}$ in some cluster in ${\displaystyle \Pi (X,d)}$, we have ${\displaystyle d^{\prime }(x,y)\leq d(x,y)}$,

2. for any ${\displaystyle x,y}$ in distinct clusters in ${\displaystyle \Pi (X,d)}$, we have ${\displaystyle d^{\prime }(x,y)\geq d(x,y)}$,

then ${\displaystyle \Pi (X,d)=\Pi (X,d^{\prime })}$.

Unfortunately, it is a theorem of Kleinberg’s that no such algorithm satisfy all three characteristics.[4] The trade offs will have to be investigated further to consider what compromise can be made for the purposes of TransE.

Finally, recent work in topological data analysis has focused on the application of sheaf theory.[5]. Sheaves, loosely stated, considers structures of algebraic spaces that can be restricted to open subsets. Their application to algebraic geometry, differential geometry, and, in particular, to algebraic topology has been well explored. It remains to be of great interest what ramifications sheaves will have on topological data analysis and TransE.

Annotated Bibliography

1. A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating Embeddings for Modeling Multi-relational Data,” in Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 2013, pp. 2787–2795.
2. Much thanks to the peer who pointed out the distinction during the presentation
3. https://en.wikipedia.org/wiki/Persistent_homology#Computation
4. J. Kleinberg. An Impossibility Theorem for Clustering. Advances in Neural Information Processing Systems (NIPS) 15, 2002.
5. Curry, J.: Sheaves, Cosheaves and Applications. arXiv:1303.3255 [math.AT]