Course:CPSC522/Pretraining Methods for Graph Neural Networks

Pre-Training Methods for Graph Neural Networks

Understanding fundamental ideas for pre-training graph neural networks from the following two papers,

Paper 1 for Node Classification: When Does Self-Supervision Help Graph Convolutional Networks? (ICML 2020)
Paper 2 for Graph Classification: Strategies for Pre-Training Graph Neural Networks (ICLR 2020)

Principal Author: Nikhil Shenoy

Collaborators:

Abstract

Schematic depiction of multi-layer Graph Convolutional Network (GCN) for semisupervised learning with

C

input channels and

F

feature maps in the output layer. The graph structure (edges shown as black lines) is shared over layers, labels are denoted by

Y_{i}

. Source:https://arxiv.org/abs/1609.02907

One of the common challenges in Machine Learning (ML) is for models to improve performance on unseen and out-of-distribution (OOD) data. In subfields of ML like Computer Vision (CV) and Natural Language Processing (NLP), pre-training which refers to training a model on a pretext task has been the go-to approach for improving generalization performance. However, in the case of graph datasets, effectively applying pre-training methods to Graph Neural Networks (GNNs) remain an active research problem. In this study, we try to understand pre-training methods to improve upon important graph based machine learning tasks like Node classification and Graph classification . Our first paper uses self-supervision methods to improve generalization performance on node-classification tasks and performs a thorough analyses on the performance of these methods. Our second paper, focussed on the graph classification task, introduces a novel pre-training method and evaluates their effectiveness in comparison to existing approaches.

Builds On

This wiki page builds upon the knowledge of Graph Neural Networks and Graph Convolutional Networks as discussed in the foundation page.

Background

There could be different machine learning tasks defined over graphs, but we focus on the following tasks in this study,

Node Classification: Classifying individual nodes
Graph Classification: Classifying entire graphs

The terms self-supervised learning, pre-training and transfer learning are used throughout this article, so understanding what each of them mean is important

Self-Supervised Learning

Self-supervised learning refers to a form of machine learning where models are trained without supervision or labeled data, such as predicting missing parts of an image or the angle of a rotated image. In the context of graphs, semi-supervised graph-based learning assumes that nodes that are connected with edges of larger weight must be similar and therefore should share a similar label distribution ^[1]. There was only one work^[2] that performed self-supervision on graphs using deep learning before Paper 1 was published.

Pre-Training

Pre-training refers to the strategy of training a model on one task and then fine-tuning it on another. There are two challenges (Pan & Yang, 2009^[3]; Hendrycks et al., 2019^[4]) that pre-training aims to solve in the context of graph based deep learning. First, in situations where graph labels are scarce or where labels are cost and resource intensive^[5]. Second, in out-of-distribution settings where the graphs in test set are structurally different from the graphs in the training set.

Transfer Learning

Transfer learning refers to the strategy of using a pre-trained model as a starting point for a task, rather than training from scratch. This has been an attractive option in computer vision where performing transfer learning has resulted in improved generalization performance while reducing data and time costs associated with training. Studies ^[6]^[7]^[8] have shown that successful transfer learning in the case of graphs not only depends on the size of the pre-training dataset, but also depends on structural differences between the graphs in the pre-training dataset and the downstream task dataset.

Pre-training Methods for Node-Classification

In this section, we discuss the analysis done in the first paper (When Does Self-Supervision Help Graph Convolutional Networks?). This primary focus of this paper is therefore based on understanding the following question,

For the node-classification task, can self-supervised training improve the generalization and robustness capacity of the GCNs?

Supervised Learning on Graph Convolutional Networks (GCNs)

As discussed in the foundation page, we can revisit the supervised learning task in the context of GCNs. Given a graph dataset and label matrix $Y\in \mathbb {R} ^{{\mathcal {|V|}}_{\text{label}}\times N'}$ ( $N'$ is the label dimension), where labels exist for ${\mathcal {|V}}_{\text{label}}|\ll {\mathcal {|V|}}$ nodes, the model parameters in GCNs are learned by minimizing the supervised loss calculated between the output of the network $Z$ and the true labels for labeled nodes,

{\begin{aligned}Z&=f_{\theta }(X,{\hat {A}})\Theta \\\theta ^{*},\Theta ^{*}&=\arg \min _{\theta ,\Theta }{\mathcal {L}}_{\text{sup}}(\theta ,\Theta )\\&=\arg \min _{\theta ,\Theta }{\frac {1}{|{\mathcal {V}}_{\text{label}}|}}\sum _{v_{n}\in {\mathcal {V}}_{\text{label}}}L(z_{n},y_{n})\\\end{aligned}}

where $f_{\theta }$ is the feature extractor, $\Theta$ is the linear layer, $L(\cdot ,\cdot )$ is the loss function for each example, $y_{n}$ is the ground truth label and $z_{n}$ is the true label vector

Possible Schemes to Introduce Self-Supervision for GCNs

We go through three possible schemes to equip a GCNs with a node-level self-supervised (ss) task,

Pre-training and Fine-tuning

Here, we are interested in designing a self-supervised learning task that could be performed before performing the supervised learning task. In the pre-training process, the feature extractor $f_{\theta }$ network is pre-trained with the self-supervised as following,

{\begin{aligned}Z_{\text{ss}}&=f_{\theta }(X_{\text{ss}},{\hat {A}}_{\text{ss}})\Theta _{\text{ss}}\\\theta _{\text{ss}}^{*},\Theta _{\text{ss}}^{*}&=\arg \min _{\theta ,\Theta _{\text{ss}}}{\mathcal {L}}_{\text{ss}}(\theta ,\Theta _{\text{ss}})\\&=\arg \min _{\theta ,\Theta }{\frac {1}{|{\mathcal {V}}_{\text{ss}}|}}\sum _{v_{n}\in {\mathcal {V}}_{\text{ss}}}L_{\text{ss}}(z_{{\text{ss}},n},y_{{\text{ss}},n})\\\end{aligned}}

where

\Theta _{\text{ss}}

is the linear transformation parameter,

L_{\text{ss}}

is the loss function of the self-supervised task,

z_{{\text{ss}},n},y_{{\text{ss}},n}

being the prediction and true label for the self-supervised task. The feature extractor can then be used for the downstream task. The specific tasks will be described in the next section.

Self-training

An iterative process, introduced by ^[9] in the context of GCNs, they initially perform supervised learning on labeled samples. Using the trained model, they identify highly confident unlabeled samples and assign "pseudo" labels to them using the model's predictions. These samples are then included in the next round of training. The process is repeated for several rounds and can be formulated similar to the supervised learning equation of GCN.

Figure 1: The overall framework for self-supervision on GCN through multi-task learning. The target task and auxiliary self-supervised tasks share the same feature extractor

f_{\theta }

with their individual linear transformation parameters

\Theta ,\Theta _{\text{ss}}

. Source: https://arxiv.org/pdf/2006.09136.pdf

Multi-task learning

Considering a target task and a self-supervised task for a GCN, the output and the training process can be formulated as,

{\begin{aligned}Z&=f_{\theta }(X,{\hat {A}})\Theta ,\\Z_{\text{ss}}&=f_{\theta }(X_{\text{ss}},{\hat {A}}_{\text{ss}})\Theta _{\text{ss}}\\\theta ^{*},\Theta ^{*},\Theta _{\text{ss}}^{*}&=\arg \min _{\theta ,\Theta ,\Theta _{\text{ss}}}\alpha _{1}{\mathcal {L}}_{\text{sup}}(\theta ,\Theta )+\alpha _{2}{\mathcal {L}}_{\text{ss}}(\theta ,\Theta _{\text{ss}})\\\end{aligned}}

where

\alpha _{1},\alpha _{2}\in \mathbb {R} ^{+}

are the weights for the overall loss. The feature extractor

f_{\theta }

is shared across the two tasks but have their own linear transformation parameters

\Theta ^{*},\Theta _{\text{ss}}^{*}

as shown in Figure 1. The second term can also be seen a regularization term throughout the network training.

Self-Supervision Methods specific to GCNs

While the previous section describes various schemes to introduce self-supervision. In this section we explore self-supervised pretext tasks specific to GCNs. Further in later sections, we show that using these pretext tasks for self-supervision benefits various supervised/downstream tasks. All the three pre-text tasks defined have been visually shown in Figure 1.

Node Clustering

Given a node set ${\mathcal {V}}$ with the feature matrix ${\textbf {X}}$ as input, with a preset number of clusters as a hyperparameter $K\in \{1,....,{\mathcal {|V|}}\}$ , the clustering algorithm will output a set of node sets $\{{\mathcal {V}}_{\text{clu, 1}},...,{\mathcal {V}}_{\text{clu, K}}\mid {\mathcal {V}}_{\text{clu, n}}\subseteq {\mathcal {V}}\}$ such that every node set is disjoint and all node set are non-empty. Once we have the node-sets labels for each node, we can use a standard supervised learning algorithm to learn the node set label $y_{{\text{ss}},n}=k$ .

Graph Partitioning

This is a topology-based self-supervision method where two nodes connected by a "strong" edge (larger weight) are more likely to have the same label class ^[10]. Similar to node clustering, this method partitions the graph into roughly equal subsets, such that the number of edge connecting nodes across subsets is minimized ^[11]. Given the node set ${\mathcal {V}}$ , edge set ${\mathcal {E}}$ and the adjacency matrix $A$ as the input, with a preset number of partitions as a hyperparameter $K\in \{1,....,{\mathcal {|V|}}\}$ , this algorithm will output a set of node sets $\{{\mathcal {V}}_{\text{par, 1}},...,{\mathcal {V}}_{\text{par, K}}\mid {\mathcal {V}}_{\text{par, n}}\subseteq {\mathcal {V}}\}$ such that every node set is disjoint and all node sets are non-empty. With the node set partitioned, we assign partition indices as self-supervised labels.

Graph Completion

Motivated by image inpainting ^[12] in computer vision, the paper proposes graph completion, a regression task for self-supervised pre-training. The algorithm masks target nodes by removing their features and then aims to recover the masked node features by feeding the GCN the unmasked graph. The rationale behind choosing such a task is two-fold, 1) completion labels are free to obtain 2) graph completion can help improve representation by teaching network to learn feature from context.

All the three methods have been showed graphically in Figure 1. A summary table of the methods is as follows,

Table 1: Overview of Self-Supervision Methods
Task	Relied Feature	Primary Assumption	Type
Clustering	Nodes	Feature Similarity	Classification
Partitioning	Edges	Connection Density	Classification
Completion	Nodes & Edges	Context based Representation	Regression

Dataset Statistics

The paper uses the following three datasets for all the experiments,

Dataset Statistics,

|{\mathcal {V}}|,|{\mathcal {V}}_{\text{label}}|,|{\mathcal {E}}|

and

N

denote the number of nodes, number of labeled nodes, number of edges and feature dimension per node, respectively. Source: https://arxiv.org/abs/2006.09136

Discussion: Self-Supervision helps generalizability (Table 2)

Table 2: Node classification performances (accuracy; unit: %) when incorporating three self-supervision tasks (Node Clustering, Graph Partitioning, and Graph Completion) into GCNs through various schemes: pretraining & finetuning (abbr. P&T), self-training M3S (Sun et al., 2019 ^[13])), and multi-task learning (abbr. MTL). Red numbers indicate the best two performances with the mean improvement at least 0.8 (where 0.8 is comparable or less than observed standard deviations). In the case of GCN without self-supervision, gray numbers indicate the published results. Source: https://arxiv.org/abs/2006.09136

The three schemes of incorporating self-supervision in GCN training is examined in the Table 2 and the points below,

Pre-training and fine-tuning provides some performance improvement for a small dataset like Cora but does not do so for the larget datasets Citeseer and PubMed. This holds across different choices of self-supervision methods. Information learned through self-supervision in the pre-training stage, may be lost during finetuning.
Self-training and multi-task learning see improvement in performance when including self-supervision compared to not including it. In contrast to pre-training and fine-tuning, there is no switch in objective functions.
Multi-Task Learning setup is more general (in pseudo labels) as it allows for assigning semi-supervised labels in different ways corresponding to graph structure and node features without labeled data (as opposed to self training).

Table 3: Experiments on SOTAs (GCN, GAT, GIN, GMNN, and GraphMix) with multi-task self-supervision. Red numbers indicate the best two performances for each SOTA. Source: https://arxiv.org/abs/2006.09136

Discussion: Multi-Task Self-Supervision on SOTAs (Table 3)

Graph Partitioning is generally beneficial to all SOTA network architectures and on all datasets whereas node clustering do not benefit SOTA network architectures on PubMed.
Feature-based node clustering assumes that feature similarity implies target-label similarity and can group distant nodes with similar features together. When the dataset is large and the feature dimension is relatively low, feature-based clustering could be challenged in providing informative pseudo-labels.
Topology-based graph partitioning assumes that connections in topology implies similarity in labels, which is safe for the three datasets that are all citation networks. Therefore, the prior represented by graph partitioning can be general and effective to benefit GCNs.
Topology and feature-based graph completion assumes the feature similarity or smoothness in small neighborhoods of graphs. Such a context-based feature representation can greatly improve target performance, especially when the neighborhoods are small. However, the regression task can be challenged facing denser graphs with larger neighborhoods and more difficult completion tasks.

Self-Supervision in Graph Adversarial Defense

In this section we try to understand the role of self-supervision in gaining robustness against various graph adversarial attacks.

Adversarial Attacks

In this attack, the focus is on single-node direct evasion attacks: a node-specific attack type on the attributes/links of the target node $v_{n}$ under certain constraints ^[14], whereas the trained model remains unchanged during/after the attack. The attacker $g$ generates perturbed feature and adjacency matrix, $X'$ and $A'$ , as,

X',A'=g(X,A,Y,v_{n},\theta ^{*},\Theta ^{*})

with (attributes, links and label of) the target node and the model parameters as inputs.

Table 4: Adversarial defense performances on Cora using adversarial training (abbr. AdvT) without or with graph self-supervision. Attacks include those on links, features (abbr. Feats), and both. Red numbers indicate the best two performances in each attack scenario (node classification accuracy; unit: %). Source: https://arxiv.org/abs/2006.09136

Adversarial Defense

In the graph domain, it is difficult to generate adversarial examples because of the low labeling rates in the transductive semi-supervised setting. Wang et al. ^[15] proposes using unlabeled nodes in generating adversarial samples. Specifically, self-training is used to assign pseudo labels $Y_{\text{pseudo}}$ to unlabeled nodes. Then, randomly choose two disjoint subsets ${\mathcal {V}}_{\text{clean}}$ and ${\mathcal {V}}_{\text{attack}}$ from the unlabeled node set and attacked each target node $v_{n}\in {\mathcal {V}}_{\text{attack}}$ to generate perturbed feature and adjacency matrices $X'$ and $A'$ . Adversarial training can then be formulated as supervised learning for labeled nodes and recovering pseudo labels for unlabeled nodes,

{\begin{aligned}Z&=f_{\theta }(X,{\hat {A}})\Theta ,\\Z'&=f_{\theta }(X',{\hat {A}}')\Theta \\\theta ^{*},\Theta ^{*}&=\arg \min _{\theta ,\Theta }{\mathcal {L}}_{\text{sup}}(\theta ,\Theta )+\alpha _{3}{\mathcal {L}}_{\text{adv}}(\theta ,\Theta )\\\end{aligned}}

where

\alpha _{3}

is a weight for the adversarial loss.

Table 5: Adversarial defense performances on Citeseer using adversarial training without or with graph self-supervision. Source: https://arxiv.org/abs/2006.09136

Adversarial Defense with Self-Supervision

With self-supervision working in GCNs and adversarial training introduced in the section above, we can formulate adversarial training with self-supervision,

{\begin{aligned}Z&=f_{\theta }(X,{\hat {A}})\Theta ,\\Z'&=f_{\theta }(X',{\hat {A}}')\Theta \\Z_{\text{ss}}&=f_{\theta }(X_{\text{ss}},{\hat {A}}_{\text{ss}})\Theta _{\text{ss}}\\\theta ^{*},\Theta ^{*},\Theta _{\text{ss}}^{*}&=\arg \min _{\theta ,\Theta ,\Theta _{\text{ss}}}\alpha _{1}{\mathcal {L}}_{\text{sup}}(\theta ,\Theta )+\alpha _{2}{\mathcal {L}}_{\text{ss}}(\theta ,\Theta _{\text{ss}})+\alpha _{3}{\mathcal {L}}_{\text{adv}}(\theta ,\Theta )\\\end{aligned}}

Discussion: Does Self-Supervision help with Adversarial Defense?

Based on Table 4 and Table 5, introducing self-supervision into adversarial training improves GCNs adversarial defence,

Node clustering and graph partitioning are more effective against feature attacks and link attacks.
Graph completion boosts the adversarial accuracy by around 4.5 (%) against link attacks and over 8.0 (%) against the link & feature attacks on Cora.

Pre-training Methods for Graph-Classification

In this section, we introduce the method and analysis of our second paper (Strategies for Pre-Training Graph Neural Networks) to improve upon graph classification tasks. This paper has two key contributions,

Conduct the first systematic large-scale investigation of strategies for pre-training GNNs
Develop an effective pre-training strategy for GNNs and demonstrate its effectiveness and its ability for out-of-distribution generalization on hard transfer-learning problems

Before we delve into the method introduced in this paper, let's go the baseline pre-training strategies,

Node-Level Pre-Training

In this, the paper considers two self-supervision methods, context prediction and attribute masking. Both Context Prediction and Attribute Masking have been covered visually in Figure 2 below.

Method 1: Context Prediction

Figure 2: Illustration of our node-level methods, Context Prediction and Attribute Masking for pre-training GNNs. (a) In Context Prediction, the subgraph is a K-hop neighbourhood around a selected centre node, where K is the number of GNN layers and is set to 2 in the figure. The context is defined as the surrounding graph structure that is between

r_{1}

- and

r_{2}

-hop from the centre node, where we use

r_{1}=1

and

r_{2}=4

in the figure. (b) In Attribute Masking, the input node/edge attributes (e.g., atom type in the molecular graph) are randomly masked, and the GNN is asked to predict them. Source: https://arxiv.org/abs/1905.12265

In context prediction, the goal is to pre-train a GNN such that nodes appearing in similar structural contexts have similar embeddings. The following three steps are required to perform context prediction based node-level pre-training,

Neighbourhood and Context Graphs. For every node $v$ , $K$ -hop neighbourhood of $v$ contains all nodes and edges that are at most $K$ -hops away from $v$ in the graph. The context graph of node $v$ represents a subgraph that is between $r_{1}$ -hops and $r_{2}$ -hops away from $v$ (i.e., it is a ring of width $r_{2}$ − $r_{1}$ ).
Encoding context into a fixed vector using an auxiliary GNN. To this end, an auxiliary GNN, referred to as context GNN, is used to obtain node embeddings in the context graph. We then average embeddings of context anchor nodes to obtain a fixed-length context embedding. For node $v$ in graph G, we denote its corresponding context embedding as $c_{v}^{G}$ .
Learning via Negative Sampling. Negative Sampling ^[16]^[17] to jointly learn the main GNN and the context GNN. The main GNN encodes neighbourhoods to obtain node embeddings. The context GNN encodes context graphs to obtain context embeddings. In particular, the learning objective of Context Prediction is a binary classification of whether a particular neighbourhood and a particular context graph belong to the same node: $\sigma (h_{v}^{(K)T}>c_{v'}^{G'})\approx 1\{v{\text{ and }}v'{\text{ are the same nodes}}\}$ where $\sigma$ () is the sigmoid function. We either let $v'=v$ and $G'=G$ (i.e., a positive neighbourhood-context pair), or we randomly sample $v'$ from a randomly chosen graph $G'$ (i.e., a negative neighbourhood-context pair). We use a negative sampling ratio of 1 (one negative pair per one positive pair), and use the negative log likelihood as the loss function.

Method 2: Attribute Masking

This is a simple self-supervision strategy where node/edge attributes are masked and then we let the GNN predict those attributes ^[18] based on neighbouring structure. Specifically, we randomly mask input node/edge attributes, for example atom types in molecular graphs, by replacing them with special masked indicators. We then apply GNNs to obtain the corresponding node/edge embeddings (edge embeddings can be obtained as a sum of node embeddings of the edge’s end nodes). Finally, a linear model is applied on top of embeddings to predict a masked node/edge attribute. We operate on non-fully connected graphs and aim to capture the regularities of node/edge attributes distributed over different graph structures. Furthermore, we allow masking edge attributes, going beyond masking node attributes.

Graph-Level Pre-Training

Graph level pre-training of GNNs can lead to useful graph embeddings. We go over two ways of performing graph level pre-training,

Method 1: Supervised Graph-Level Property Prediction

Graph Level embeddings $h_{G}$ can be injected with graph-level domain-specific knowledge by defining supervised graph-level prediction tasks. Specifically, we consider a practical method to pre-train graph representations: graph-level multi-task supervised pre-training to jointly predict a diverse set of supervised labels of individual graphs. For example, in molecular property prediction, we can pre-train GNNs to predict essentially all the properties of molecules that have been experimentally measured so far.

Method 2: Structural Similarity Prediction

A different approach would be to define a graph-level predictive task, where the goal would be to model the structural similarity of two graphs. Examples of such tasks include modeling the graph edit distance (Bai et al., 2019^[19]) or predicting graph structure similarity (Navarin et al., 2018^[20]). However, finding the ground truth graph distance values is a difficult problem, and in large datasets there is a quadratic number of graph pairs to consider.

Issues with using only Node-Level and Graph-Level Pre-training Strategy

Figure: (a.i) When only node-level pre-training is used, nodes of different shapes can be well separated, however, node embeddings do not compose resulting in poor graph embeddings (denoted by their classes, + and −) that are not separable. (a.ii) With graph-level pre-training only, graph embeddings are well separated, however the embeddings of individual nodes do not necessarily capture their specific semantics. (a.iii) Using both node-level and graph-level pre-training results in node-embeddings that are separable and capture semantics while also composing in graph embeddings that are well separated. This allows for accurate and robust representations of entire graphs and enables robust transfer of pre-trained models to a variety of downstream tasks. Source: https://arxiv.org/abs/1905.12265

Novel Pre-Training Strategy

Altogether, our pre-training strategy is to first perform node-level self-supervised pre-training and then graph-level multi-task supervised pre-training. When the GNN pre-training is finished, we fine-tune the pre-trained GNN model on downstream tasks.

Experiments

Datasets

Pre-training datasets

For the chemistry domain, we use ZINC15 database^[21] for node-level supervised pre-training and for, graph level multi-task supervised pre-training, we use a preprocessed ChEMBL dataset^[22].
For the biology domain, we use 395K unlabeled protein ego-networks derived from PPI networks of 50 species for node-level supervised pre-training and 88k labeled protein ego-networks to jointly predict 5000 coarse-grained biological functions.

Downstream classification datasets.

For the chemistry domain, graph benchmarks like MUTAG^[23], PTC^[24] molecule datasets and MoleculeNet^[25].
For the biology domain, PPI networks from Zitnik et al. 2019^[26]

GNN Architectures

Graph Isomorphism Networks (GINs) ^[24] are the most expressive network based on the Weisfeiler-Lehman test^[24] and SOTA GNN architecture for graph-level prediction tasks. Although less expressive, other architectures like GCN, GAT^[27] and GraphSAGE^[28] are also experimented with (Refer to Table 7).

Discussion

Figure 3: Test ROC-AUC of protein function prediction using different pre-training strategies with GIN. (Left) Test ROC-AUC scores (%) obtained by different pre-training strategies, where the scores are averaged over the 40 fine-grained prediction tasks. (Middle and right): Scatter plot comparisons of ROC-AUC scores for a pair of pre-training strategies on the 40 individual downstream tasks. Each point represents a particular individual downstream task. (Middle): There are many individual downstream tasks where graph-level multi-task supervised pre-trained model performs worse than non-pre-trained model, indicating negative transfer. (Right): When the graph-level multitask supervised pre-training and Attribute Masking are combined, negative transfer is avoided across downstream tasks. The performance also improves over pure graph-level supervised pre-training. Source: https://arxiv.org/abs/1905.12265

Table 6: Test ROC-AUC (%) performance on molecular prediction benchmarks using different pre-training strategies with GIN. The rightmost column averages the mean of test performance across the 8 datasets. The best result for each dataset and comparable results (i.e., results within one standard deviation from the best result) are bolded. The shaded cells indicate negative transfer, i.e., ROC-AUC of a pre-trained model is worse than that of a non-pre-trained model. Notice that node- as well as graph-level pretraining are essential for good performance. Source: https://arxiv.org/abs/1905.12265

Observation (1): Table 7 shows that the most expressive GNN architecture (GIN), when pre-trained, achieves the best performance across domains and datasets. Compared with gains of pre-training achieved by GIN architecture, gains of pre-training using less expressive GNNs (GCN, GraphSAGE, and GAT) are smaller and can sometimes even be negative (Table 7).
Observation (2): As seen from the shaded cells of Table 6, the strong baseline strategy that performs extensive graph-level multi-task supervised pre-training of GNNs gives surprisingly limited performance gain and yields negative transfer on many downstream tasks (2 out of 8 datasets in molecular prediction, and 13 out of 40 tasks in protein function prediction).
Observation (3): From the upper half of Table 6 and the left panel of Figure 3, we see that another baseline strategy, which only performs node-level self-supervised pre-training, also gives limited performance improvement and is comparable to the graph-level multi-task supervised pre-training baseline.
Observation (4): From the lower half of Table 6 and the right panel of Figure 3, we see that our pre-training strategy of combining graph-level multi-task supervised and node-level self-supervised pre-training avoids negative transfer across downstream datasets and achieves best performance.

Table 7: Test ROC-AUC (%) performance of different GNN architectures with and without pre-training. Without pre-training, the less expressive GNNs give slightly better performance than the most expressive GIN because of their smaller model complexity in a low data regime. However, with pre-training, the most expressive GIN is properly regularized and dominates the other architectures. Source: https://arxiv.org/abs/1905.12265

Conclusion

In the case of node classification, the main takeaways with respect to self-supervision methods would be as follows,

Among the three schemes to incorporate self-supervision into GCNs, multi-task learning works as the regularizer and consistently benefits GCNs.
Pre-training & fine-tuning switches the objective function from self-supervision loss to target supervision loss, which causes "overwriting" and therefore gets limited performance gains.
Through multi-task learning, self-supervised tasks provide informative priors that can benefit GCN in generalizable target performance. Node clustering and graph partitioning provide priors on node features and graph structures, whereas graph completion with (joint) priors on both help GCN in context-based feature representation.
Whether a self-supervision task helps a SOTA GCN depends on whether the dataset allows for quality pseudo-labels corresponding to the task and whether self-supervised priors complement existing architecture-posed priors.
Multi-task self-supervision in adversarial training improves GCN’s robustness against various graph attacks.

In the case of graph classification, the main takeaways with respect to self-supervision methods would be as follows,

Using both node-level and graph-level pre-training in combination with an expressive GNN is crucial. This ensure node-embeddings capture local neighbourhood semantics that when pooled together capture meaningful graph-level representations.
Expressive GNNs like Graph Isomorphism Network (GIN) see the most improvement when pre-trained.
Makes an important step towards understanding transfer learning and addresses the issue of negative transfer observed in prior studies.

References

↑ Zhu, Xiaojin, and Andrew B. Goldberg. "Introduction to semi-supervised learning." Synthesis lectures on artificial intelligence and machine learning 3.1 (2009): 1-130.
↑ Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.
↑ Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359.
↑ Hendrycks, Dan, et al. "Using self-supervised learning can improve model robustness and uncertainty." Advances in neural information processing systems 32 (2019).
↑ Zitnik, Marinka, Rok Sosič, and Jure Leskovec. "Prioritizing network communities." Nature communications 9.1 (2018): 2544.
↑ Xu, Yuting, et al. "Demystifying multitask deep neural networks for quantitative structure–activity relationships." Journal of chemical information and modeling 57.10 (2017): 2490-2504.
↑ Ching, Travers, et al. "Opportunities and obstacles for deep learning in biology and medicine." Journal of The Royal Society Interface 15.141 (2018): 20170387.
↑ Wang, Jingshu, et al. "Data denoising with transfer learning in single-cell transcriptomics." Nature methods 16.9 (2019): 875-878.
↑ Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.
↑ Zhu, Xiaojin, and Andrew Goldberg. Introduction to semi-supervised learning. Morgan & Claypool Publishers, 2009.
↑ Karypis, George, and Vipin Kumar. "Multilevel algorithms for multi-constraint graph partitioning." SC'98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 1998.
↑ Yu, Jiahui, et al. "Generative image inpainting with contextual attention." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
↑ Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.
↑ Zügner, Daniel, Amir Akbarnejad, and Stephan Günnemann. "Adversarial attacks on neural networks for graph data." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.
↑ Wang, Xiaoyun, Xuanqing Liu, and Cho-Jui Hsieh. "Graphdefense: Towards robust graph convolutional networks." arXiv preprint arXiv:1911.04429 (2019).
↑ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).
↑ Ying, Rex, et al. "Graph convolutional neural networks for web-scale recommender systems." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.
↑ Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
↑ Bai, Yunsheng, et al. "Unsupervised inductive whole-graph embedding by preserving graph proximity." Proceedings of the seventh international conference on learning representations (ICLR 2019). 2019.
↑ Navarin, Nicolò, Dinh V. Tran, and Alessandro Sperduti. "Pre-training graph neural networks with kernels." arXiv preprint arXiv:1811.06930 (2018).
↑ Sterling, Teague, and John J. Irwin. "ZINC 15–ligand discovery for everyone." Journal of chemical information and modeling 55.11 (2015): 2324-2337.
↑ Gaulton, Anna, et al. "ChEMBL: a large-scale bioactivity database for drug discovery." Nucleic acids research 40.D1 (2012): D1100-D1107.
↑ Morris, Christopher, et al. "Tudataset: A collection of benchmark datasets for learning with graphs." arXiv preprint arXiv:2007.08663 (2020).
↑ ^{Jump up to: 24.0} ^24.1 ^24.2 Xu, Keyulu, et al. "How powerful are graph neural networks?." arXiv preprint arXiv:1810.00826 (2018).
↑ Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.
↑ Zitnik, Marinka, et al. "Evolution of resilience in protein interactomes across the tree of life." Proceedings of the National Academy of Sciences 116.10 (2019): 4426-4433.
↑ Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).
↑ Hamilton, Will, Zhitao Ying, and Jure Leskovec. "Inductive representation learning on large graphs." Advances in neural information processing systems 30 (2017).

[1] Zhu, Xiaojin, and Andrew B. Goldberg. "Introduction to semi-supervised learning." Synthesis lectures on artificial intelligence and machine learning 3.1 (2009): 1-130.

[2] Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.

[3] Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359.

[4] Hendrycks, Dan, et al. "Using self-supervised learning can improve model robustness and uncertainty." Advances in neural information processing systems 32 (2019).

[5] Zitnik, Marinka, Rok Sosič, and Jure Leskovec. "Prioritizing network communities." Nature communications 9.1 (2018): 2544.

[6] Xu, Yuting, et al. "Demystifying multitask deep neural networks for quantitative structure–activity relationships." Journal of chemical information and modeling 57.10 (2017): 2490-2504.

[7] Ching, Travers, et al. "Opportunities and obstacles for deep learning in biology and medicine." Journal of The Royal Society Interface 15.141 (2018): 20170387.

[8] Wang, Jingshu, et al. "Data denoising with transfer learning in single-cell transcriptomics." Nature methods 16.9 (2019): 875-878.

[9] Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.

[10] Zhu, Xiaojin, and Andrew Goldberg. Introduction to semi-supervised learning. Morgan & Claypool Publishers, 2009.

[11] Karypis, George, and Vipin Kumar. "Multilevel algorithms for multi-constraint graph partitioning." SC'98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 1998.

[12] Yu, Jiahui, et al. "Generative image inpainting with contextual attention." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[13] Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.

[14] Zügner, Daniel, Amir Akbarnejad, and Stephan Günnemann. "Adversarial attacks on neural networks for graph data." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.

[15] Wang, Xiaoyun, Xuanqing Liu, and Cho-Jui Hsieh. "Graphdefense: Towards robust graph convolutional networks." arXiv preprint arXiv:1911.04429 (2019).

[16] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

[17] Ying, Rex, et al. "Graph convolutional neural networks for web-scale recommender systems." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.

[18] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[19] Bai, Yunsheng, et al. "Unsupervised inductive whole-graph embedding by preserving graph proximity." Proceedings of the seventh international conference on learning representations (ICLR 2019). 2019.

[20] Navarin, Nicolò, Dinh V. Tran, and Alessandro Sperduti. "Pre-training graph neural networks with kernels." arXiv preprint arXiv:1811.06930 (2018).

[21] Sterling, Teague, and John J. Irwin. "ZINC 15–ligand discovery for everyone." Journal of chemical information and modeling 55.11 (2015): 2324-2337.

[22] Gaulton, Anna, et al. "ChEMBL: a large-scale bioactivity database for drug discovery." Nucleic acids research 40.D1 (2012): D1100-D1107.

[23] Morris, Christopher, et al. "Tudataset: A collection of benchmark datasets for learning with graphs." arXiv preprint arXiv:2007.08663 (2020).

[:1-24] {Jump up to: 24.0} ^24.1 ^24.2 Xu, Keyulu, et al. "How powerful are graph neural networks?." arXiv preprint arXiv:1810.00826 (2018).

[25] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.

[26] Zitnik, Marinka, et al. "Evolution of resilience in protein interactomes across the tree of life." Proceedings of the National Academy of Sciences 116.10 (2019): 4426-4433.

[27] Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).

[28] Hamilton, Will, Zhitao Ying, and Jure Leskovec. "Inductive representation learning on large graphs." Advances in neural information processing systems 30 (2017).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]