Course:CPSC522/Pretraining Methods for Graph Neural Networks
Pre-Training Methods for Graph Neural Networks
Understanding fundamental ideas for pre-training graph neural networks from the following two papers,
- Paper 1 for Node Classification: When Does Self-Supervision Help Graph Convolutional Networks? (ICML 2020)
- Paper 2 for Graph Classification: Strategies for Pre-Training Graph Neural Networks (ICLR 2020)
Principal Author: Nikhil Shenoy
Collaborators:
Abstract
One of the common challenges in Machine Learning (ML) is for models to improve performance on unseen and out-of-distribution (OOD) data. In subfields of ML like Computer Vision (CV) and Natural Language Processing (NLP), pre-training which refers to training a model on a pretext task has been the go-to approach for improving generalization performance. However, in the case of graph datasets, effectively applying pre-training methods to Graph Neural Networks (GNNs) remain an active research problem. In this study, we try to understand pre-training methods to improve upon important graph based machine learning tasks like Node classification and Graph classification . Our first paper uses self-supervision methods to improve generalization performance on node-classification tasks and performs a thorough analyses on the performance of these methods. Our second paper, focussed on the graph classification task, introduces a novel pre-training method and evaluates their effectiveness in comparison to existing approaches.
Builds On
This wiki page builds upon the knowledge of Graph Neural Networks and Graph Convolutional Networks as discussed in the foundation page.
Background
There could be different machine learning tasks defined over graphs, but we focus on the following tasks in this study,
- Node Classification: Classifying individual nodes
- Graph Classification: Classifying entire graphs
The terms self-supervised learning, pre-training and transfer learning are used throughout this article, so understanding what each of them mean is important
Self-Supervised Learning
Self-supervised learning refers to a form of machine learning where models are trained without supervision or labeled data, such as predicting missing parts of an image or the angle of a rotated image. In the context of graphs, semi-supervised graph-based learning assumes that nodes that are connected with edges of larger weight must be similar and therefore should share a similar label distribution [1]. There was only one work[2] that performed self-supervision on graphs using deep learning before Paper 1 was published.
Pre-Training
Pre-training refers to the strategy of training a model on one task and then fine-tuning it on another. There are two challenges (Pan & Yang, 2009[3]; Hendrycks et al., 2019[4]) that pre-training aims to solve in the context of graph based deep learning. First, in situations where graph labels are scarce or where labels are cost and resource intensive[5]. Second, in out-of-distribution settings where the graphs in test set are structurally different from the graphs in the training set.
Transfer Learning
Transfer learning refers to the strategy of using a pre-trained model as a starting point for a task, rather than training from scratch. This has been an attractive option in computer vision where performing transfer learning has resulted in improved generalization performance while reducing data and time costs associated with training. Studies [6][7][8] have shown that successful transfer learning in the case of graphs not only depends on the size of the pre-training dataset, but also depends on structural differences between the graphs in the pre-training dataset and the downstream task dataset.
Pre-training Methods for Node-Classification
In this section, we discuss the analysis done in the first paper (When Does Self-Supervision Help Graph Convolutional Networks?). This primary focus of this paper is therefore based on understanding the following question,
For the node-classification task, can self-supervised training improve the generalization and robustness capacity of the GCNs?
Supervised Learning on Graph Convolutional Networks (GCNs)
As discussed in the foundation page, we can revisit the supervised learning task in the context of GCNs. Given a graph dataset and label matrix ( is the label dimension), where labels exist for nodes, the model parameters in GCNs are learned by minimizing the supervised loss calculated between the output of the network and the true labels for labeled nodes,
where is the feature extractor, is the linear layer, is the loss function for each example, is the ground truth label and is the true label vector
Possible Schemes to Introduce Self-Supervision for GCNs
We go through three possible schemes to equip a GCNs with a node-level self-supervised (ss) task,
Pre-training and Fine-tuning
Here, we are interested in designing a self-supervised learning task that could be performed before performing the supervised learning task. In the pre-training process, the feature extractor network is pre-trained with the self-supervised as following,
where is the linear transformation parameter, is the loss function of the self-supervised task, being the prediction and true label for the self-supervised task. The feature extractor can then be used for the downstream task. The specific tasks will be described in the next section.
Self-training
An iterative process, introduced by [9] in the context of GCNs, they initially perform supervised learning on labeled samples. Using the trained model, they identify highly confident unlabeled samples and assign "pseudo" labels to them using the model's predictions. These samples are then included in the next round of training. The process is repeated for several rounds and can be formulated similar to the supervised learning equation of GCN.
Multi-task learning
Considering a target task and a self-supervised task for a GCN, the output and the training process can be formulated as,
Self-Supervision Methods specific to GCNs
While the previous section describes various schemes to introduce self-supervision. In this section we explore self-supervised pretext tasks specific to GCNs. Further in later sections, we show that using these pretext tasks for self-supervision benefits various supervised/downstream tasks. All the three pre-text tasks defined have been visually shown in Figure 1.
Node Clustering
Given a node set with the feature matrix as input, with a preset number of clusters as a hyperparameter , the clustering algorithm will output a set of node sets such that every node set is disjoint and all node set are non-empty. Once we have the node-sets labels for each node, we can use a standard supervised learning algorithm to learn the node set label .
Graph Partitioning
This is a topology-based self-supervision method where two nodes connected by a "strong" edge (larger weight) are more likely to have the same label class [10]. Similar to node clustering, this method partitions the graph into roughly equal subsets, such that the number of edge connecting nodes across subsets is minimized [11]. Given the node set , edge set and the adjacency matrix as the input, with a preset number of partitions as a hyperparameter , this algorithm will output a set of node sets such that every node set is disjoint and all node sets are non-empty. With the node set partitioned, we assign partition indices as self-supervised labels.
Graph Completion
Motivated by image inpainting [12] in computer vision, the paper proposes graph completion, a regression task for self-supervised pre-training. The algorithm masks target nodes by removing their features and then aims to recover the masked node features by feeding the GCN the unmasked graph. The rationale behind choosing such a task is two-fold, 1) completion labels are free to obtain 2) graph completion can help improve representation by teaching network to learn feature from context.
All the three methods have been showed graphically in Figure 1. A summary table of the methods is as follows,
Task | Relied Feature | Primary Assumption | Type |
---|---|---|---|
Clustering | Nodes | Feature Similarity | Classification |
Partitioning | Edges | Connection Density | Classification |
Completion | Nodes & Edges | Context based Representation | Regression |
Dataset Statistics
The paper uses the following three datasets for all the experiments,
Discussion: Self-Supervision helps generalizability (Table 2)
The three schemes of incorporating self-supervision in GCN training is examined in the Table 2 and the points below,
- Pre-training and fine-tuning provides some performance improvement for a small dataset like Cora but does not do so for the larget datasets Citeseer and PubMed. This holds across different choices of self-supervision methods. Information learned through self-supervision in the pre-training stage, may be lost during finetuning.
- Self-training and multi-task learning see improvement in performance when including self-supervision compared to not including it. In contrast to pre-training and fine-tuning, there is no switch in objective functions.
- Multi-Task Learning setup is more general (in pseudo labels) as it allows for assigning semi-supervised labels in different ways corresponding to graph structure and node features without labeled data (as opposed to self training).
Discussion: Multi-Task Self-Supervision on SOTAs (Table 3)
- Graph Partitioning is generally beneficial to all SOTA network architectures and on all datasets whereas node clustering do not benefit SOTA network architectures on PubMed.
- Feature-based node clustering assumes that feature similarity implies target-label similarity and can group distant nodes with similar features together. When the dataset is large and the feature dimension is relatively low, feature-based clustering could be challenged in providing informative pseudo-labels.
- Topology-based graph partitioning assumes that connections in topology implies similarity in labels, which is safe for the three datasets that are all citation networks. Therefore, the prior represented by graph partitioning can be general and effective to benefit GCNs.
- Topology and feature-based graph completion assumes the feature similarity or smoothness in small neighborhoods of graphs. Such a context-based feature representation can greatly improve target performance, especially when the neighborhoods are small. However, the regression task can be challenged facing denser graphs with larger neighborhoods and more difficult completion tasks.
Self-Supervision in Graph Adversarial Defense
In this section we try to understand the role of self-supervision in gaining robustness against various graph adversarial attacks.
Adversarial Attacks
In this attack, the focus is on single-node direct evasion attacks: a node-specific attack type on the attributes/links of the target node under certain constraints [14], whereas the trained model remains unchanged during/after the attack. The attacker generates perturbed feature and adjacency matrix, and , as,
with (attributes, links and label of) the target node and the model parameters as inputs.
Adversarial Defense
In the graph domain, it is difficult to generate adversarial examples because of the low labeling rates in the transductive semi-supervised setting. Wang et al. [15] proposes using unlabeled nodes in generating adversarial samples. Specifically, self-training is used to assign pseudo labels to unlabeled nodes. Then, randomly choose two disjoint subsets and from the unlabeled node set and attacked each target node to generate perturbed feature and adjacency matrices and . Adversarial training can then be formulated as supervised learning for labeled nodes and recovering pseudo labels for unlabeled nodes,
Adversarial Defense with Self-Supervision
With self-supervision working in GCNs and adversarial training introduced in the section above, we can formulate adversarial training with self-supervision,
Discussion: Does Self-Supervision help with Adversarial Defense?
Based on Table 4 and Table 5, introducing self-supervision into adversarial training improves GCNs adversarial defence,
- Node clustering and graph partitioning are more effective against feature attacks and link attacks.
- Graph completion boosts the adversarial accuracy by around 4.5 (%) against link attacks and over 8.0 (%) against the link & feature attacks on Cora.
Pre-training Methods for Graph-Classification
In this section, we introduce the method and analysis of our second paper (Strategies for Pre-Training Graph Neural Networks) to improve upon graph classification tasks. This paper has two key contributions,
- Conduct the first systematic large-scale investigation of strategies for pre-training GNNs
- Develop an effective pre-training strategy for GNNs and demonstrate its effectiveness and its ability for out-of-distribution generalization on hard transfer-learning problems
Before we delve into the method introduced in this paper, let's go the baseline pre-training strategies,
Node-Level Pre-Training
In this, the paper considers two self-supervision methods, context prediction and attribute masking. Both Context Prediction and Attribute Masking have been covered visually in Figure 2 below.
Method 1: Context Prediction
In context prediction, the goal is to pre-train a GNN such that nodes appearing in similar structural contexts have similar embeddings. The following three steps are required to perform context prediction based node-level pre-training,
- Neighbourhood and Context Graphs. For every node , -hop neighbourhood of contains all nodes and edges that are at most -hops away from in the graph. The context graph of node represents a subgraph that is between -hops and -hops away from (i.e., it is a ring of width − ).
- Encoding context into a fixed vector using an auxiliary GNN. To this end, an auxiliary GNN, referred to as context GNN, is used to obtain node embeddings in the context graph. We then average embeddings of context anchor nodes to obtain a fixed-length context embedding. For node in graph G, we denote its corresponding context embedding as .
- Learning via Negative Sampling. Negative Sampling [16][17] to jointly learn the main GNN and the context GNN. The main GNN encodes neighbourhoods to obtain node embeddings. The context GNN encodes context graphs to obtain context embeddings. In particular, the learning objective of Context Prediction is a binary classification of whether a particular neighbourhood and a particular context graph belong to the same node:where () is the sigmoid function. We either let and (i.e., a positive neighbourhood-context pair), or we randomly sample from a randomly chosen graph (i.e., a negative neighbourhood-context pair). We use a negative sampling ratio of 1 (one negative pair per one positive pair), and use the negative log likelihood as the loss function.
Method 2: Attribute Masking
This is a simple self-supervision strategy where node/edge attributes are masked and then we let the GNN predict those attributes [18] based on neighbouring structure. Specifically, we randomly mask input node/edge attributes, for example atom types in molecular graphs, by replacing them with special masked indicators. We then apply GNNs to obtain the corresponding node/edge embeddings (edge embeddings can be obtained as a sum of node embeddings of the edge’s end nodes). Finally, a linear model is applied on top of embeddings to predict a masked node/edge attribute. We operate on non-fully connected graphs and aim to capture the regularities of node/edge attributes distributed over different graph structures. Furthermore, we allow masking edge attributes, going beyond masking node attributes.
Graph-Level Pre-Training
Graph level pre-training of GNNs can lead to useful graph embeddings. We go over two ways of performing graph level pre-training,
Method 1: Supervised Graph-Level Property Prediction
Graph Level embeddings can be injected with graph-level domain-specific knowledge by defining supervised graph-level prediction tasks. Specifically, we consider a practical method to pre-train graph representations: graph-level multi-task supervised pre-training to jointly predict a diverse set of supervised labels of individual graphs. For example, in molecular property prediction, we can pre-train GNNs to predict essentially all the properties of molecules that have been experimentally measured so far.
Method 2: Structural Similarity Prediction
A different approach would be to define a graph-level predictive task, where the goal would be to model the structural similarity of two graphs. Examples of such tasks include modeling the graph edit distance (Bai et al., 2019[19]) or predicting graph structure similarity (Navarin et al., 2018[20]). However, finding the ground truth graph distance values is a difficult problem, and in large datasets there is a quadratic number of graph pairs to consider.
Issues with using only Node-Level and Graph-Level Pre-training Strategy
Novel Pre-Training Strategy
Altogether, our pre-training strategy is to first perform node-level self-supervised pre-training and then graph-level multi-task supervised pre-training. When the GNN pre-training is finished, we fine-tune the pre-trained GNN model on downstream tasks.
Experiments
Datasets
Pre-training datasets
- For the chemistry domain, we use ZINC15 database[21] for node-level supervised pre-training and for, graph level multi-task supervised pre-training, we use a preprocessed ChEMBL dataset[22].
- For the biology domain, we use 395K unlabeled protein ego-networks derived from PPI networks of 50 species for node-level supervised pre-training and 88k labeled protein ego-networks to jointly predict 5000 coarse-grained biological functions.
Downstream classification datasets.
- For the chemistry domain, graph benchmarks like MUTAG[23], PTC[24] molecule datasets and MoleculeNet[25].
- For the biology domain, PPI networks from Zitnik et al. 2019[26]
GNN Architectures
Graph Isomorphism Networks (GINs) [24] are the most expressive network based on the Weisfeiler-Lehman test[24] and SOTA GNN architecture for graph-level prediction tasks. Although less expressive, other architectures like GCN, GAT[27] and GraphSAGE[28] are also experimented with (Refer to Table 7).
Discussion
- Observation (1): Table 7 shows that the most expressive GNN architecture (GIN), when pre-trained, achieves the best performance across domains and datasets. Compared with gains of pre-training achieved by GIN architecture, gains of pre-training using less expressive GNNs (GCN, GraphSAGE, and GAT) are smaller and can sometimes even be negative (Table 7).
- Observation (2): As seen from the shaded cells of Table 6, the strong baseline strategy that performs extensive graph-level multi-task supervised pre-training of GNNs gives surprisingly limited performance gain and yields negative transfer on many downstream tasks (2 out of 8 datasets in molecular prediction, and 13 out of 40 tasks in protein function prediction).
- Observation (3): From the upper half of Table 6 and the left panel of Figure 3, we see that another baseline strategy, which only performs node-level self-supervised pre-training, also gives limited performance improvement and is comparable to the graph-level multi-task supervised pre-training baseline.
- Observation (4): From the lower half of Table 6 and the right panel of Figure 3, we see that our pre-training strategy of combining graph-level multi-task supervised and node-level self-supervised pre-training avoids negative transfer across downstream datasets and achieves best performance.
Conclusion
In the case of node classification, the main takeaways with respect to self-supervision methods would be as follows,
- Among the three schemes to incorporate self-supervision into GCNs, multi-task learning works as the regularizer and consistently benefits GCNs.
- Pre-training & fine-tuning switches the objective function from self-supervision loss to target supervision loss, which causes "overwriting" and therefore gets limited performance gains.
- Through multi-task learning, self-supervised tasks provide informative priors that can benefit GCN in generalizable target performance. Node clustering and graph partitioning provide priors on node features and graph structures, whereas graph completion with (joint) priors on both help GCN in context-based feature representation.
- Whether a self-supervision task helps a SOTA GCN depends on whether the dataset allows for quality pseudo-labels corresponding to the task and whether self-supervised priors complement existing architecture-posed priors.
- Multi-task self-supervision in adversarial training improves GCN’s robustness against various graph attacks.
In the case of graph classification, the main takeaways with respect to self-supervision methods would be as follows,
- Using both node-level and graph-level pre-training in combination with an expressive GNN is crucial. This ensure node-embeddings capture local neighbourhood semantics that when pooled together capture meaningful graph-level representations.
- Expressive GNNs like Graph Isomorphism Network (GIN) see the most improvement when pre-trained.
- Makes an important step towards understanding transfer learning and addresses the issue of negative transfer observed in prior studies.
References
- ↑ Zhu, Xiaojin, and Andrew B. Goldberg. "Introduction to semi-supervised learning." Synthesis lectures on artificial intelligence and machine learning 3.1 (2009): 1-130.
- ↑ Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.
- ↑ Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359.
- ↑ Hendrycks, Dan, et al. "Using self-supervised learning can improve model robustness and uncertainty." Advances in neural information processing systems 32 (2019).
- ↑ Zitnik, Marinka, Rok Sosič, and Jure Leskovec. "Prioritizing network communities." Nature communications 9.1 (2018): 2544.
- ↑ Xu, Yuting, et al. "Demystifying multitask deep neural networks for quantitative structure–activity relationships." Journal of chemical information and modeling 57.10 (2017): 2490-2504.
- ↑ Ching, Travers, et al. "Opportunities and obstacles for deep learning in biology and medicine." Journal of The Royal Society Interface 15.141 (2018): 20170387.
- ↑ Wang, Jingshu, et al. "Data denoising with transfer learning in single-cell transcriptomics." Nature methods 16.9 (2019): 875-878.
- ↑ Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.
- ↑ Zhu, Xiaojin, and Andrew Goldberg. Introduction to semi-supervised learning. Morgan & Claypool Publishers, 2009.
- ↑ Karypis, George, and Vipin Kumar. "Multilevel algorithms for multi-constraint graph partitioning." SC'98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing. IEEE, 1998.
- ↑ Yu, Jiahui, et al. "Generative image inpainting with contextual attention." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- ↑ Sun, Ke, Zhouchen Lin, and Zhanxing Zhu. "Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 04. 2020.
- ↑ Zügner, Daniel, Amir Akbarnejad, and Stephan Günnemann. "Adversarial attacks on neural networks for graph data." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.
- ↑ Wang, Xiaoyun, Xuanqing Liu, and Cho-Jui Hsieh. "Graphdefense: Towards robust graph convolutional networks." arXiv preprint arXiv:1911.04429 (2019).
- ↑ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).
- ↑ Ying, Rex, et al. "Graph convolutional neural networks for web-scale recommender systems." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.
- ↑ Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
- ↑ Bai, Yunsheng, et al. "Unsupervised inductive whole-graph embedding by preserving graph proximity." Proceedings of the seventh international conference on learning representations (ICLR 2019). 2019.
- ↑ Navarin, Nicolò, Dinh V. Tran, and Alessandro Sperduti. "Pre-training graph neural networks with kernels." arXiv preprint arXiv:1811.06930 (2018).
- ↑ Sterling, Teague, and John J. Irwin. "ZINC 15–ligand discovery for everyone." Journal of chemical information and modeling 55.11 (2015): 2324-2337.
- ↑ Gaulton, Anna, et al. "ChEMBL: a large-scale bioactivity database for drug discovery." Nucleic acids research 40.D1 (2012): D1100-D1107.
- ↑ Morris, Christopher, et al. "Tudataset: A collection of benchmark datasets for learning with graphs." arXiv preprint arXiv:2007.08663 (2020).
- ↑ 24.0 24.1 24.2 Xu, Keyulu, et al. "How powerful are graph neural networks?." arXiv preprint arXiv:1810.00826 (2018).
- ↑ Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.
- ↑ Zitnik, Marinka, et al. "Evolution of resilience in protein interactomes across the tree of life." Proceedings of the National Academy of Sciences 116.10 (2019): 4426-4433.
- ↑ Veličković, Petar, et al. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).
- ↑ Hamilton, Will, Zhitao Ying, and Jure Leskovec. "Inductive representation learning on large graphs." Advances in neural information processing systems 30 (2017).