Course:CPSC522/RoadgraphGraphNeuralNetworks

Graph Neural Networks for Roadgraph Encoding

This article analyzes the effect of varying graph neural network architectures on encoding road graph information.

Principal Author: Matthew Niedoba

Abstract

Modeling the behaviour of other vehicles is a critical component for ensuring safe operation of autonomous cars. One of the most important factors for modeling such behaviour is the geometry of the road on which vehicles are driving. Prior methods have incorporated road information through rendering the scene in a bird's eye view image or using fully connected attention mechanisms. However, both of these approaches fail to utilize the graph structure between lanes in the underlying roadgraph. In this page, we discuss how graph convolutional networks can be used to encode roadgraph information.

Builds on

This page leverages utilizes the concepts introduced in Graph Neural Networks. In addition, we utilize the attention mechanism, which is covered in Transformers.

Related Pages

Good examples of raster based and attention based roadgraph encoding schemes can be found in the following papers

Introduction

self-driving cars have the potential to revolutionize transportation by reducing time spent driving and traffic fatalities. One key aspect of this technology is motion forecasting which, along with perception and motion planning systems, generally form the pillars of autonomous vehicle software. The aim of motion forecasting is to predict the future motion of all the other agents in the scene around the autonomous vehicle. This includes predicting the future position and heading of all other vehicles, cyclists and pedestrians.

Many factors influence the motion of agents as they navigate the world. For example, the future motion of agents is limited by kinematic constraints parameterized by their past states. Cars have maximum braking and acceleration forces, and a fixed turning radius which limits how they can move. In addition, the motion of agents is influenced by the motions of other agents as they aim to safely navigate the driving environment.

Perhaps most critical amongst the factors which influence agent motion is the geometry of the road. Generally, agents tend to drift minimally as they follow the curvature of their lane, except to change lanes or exit the roadway. Therefore, it is essential to utilize road geometry when generating motion forecasting predictions.

In motion forecasting literature, there are two main approaches to incorporating road information. One method uses a renderer to generate images from a bird's eye view perspective which are consumed by convolutional networks. Although these networks are able to leverage the vast literature developed for processing image data, the fixed resolution and size of the images pose challenges for capturing road geometry over larger scales. As an alternative, some have recently proposed graph-based methods which produce embeddings by aggregating information from each of the lanes which make up the roadway. However, in these approaches, self-attention is generally used to combine the lane segment information, which fails to utilize the natural adjacency information of the roadgraph.

Hypothesis

We hypothesize that encoding roadgraph information using an architecture which utilizes the adjacency information of the roadgraph will lead to better quality encodings.

Dataset

An example roadgraph from the Argoverse 1 Motion Forecasting Dataset^[3]. Each Lane segment is shown as a different colour, with dots representing the component points of each lane centerline.

To test our hypothesis, we analyze graph neural network performance on the Argoverse 1 Motion Forecasting Dataset ^[3]. The dataset suite was developed by Argo AI for the purpose of benchmarking performance on a variety of self-driving vehicle tasks. The motion forecasting dataset is comprised of 324,557 driving scenarios, collected via recordings from a self-driving car in a variety of locations across Miami and Pittsburgh. The dataset is further split into training, validation and test sets.

Each driving segment is a five second recording of driving behaviour. The driving segment consists of the position for every vehicle, cyclist, and pedestrian detected by the self-driving car over the length of the segment along with a high definition map which captures the road geometry of the segment. In each log, there is an "ego" agent which has been manually selected due to data quality and interesting behaviour. The goal of the dataset is to forecast the position of the ego agent three seconds into the future, conditioned on the past states of all other agents in the last two seconds, and the map information.

The map information in Argoverse 1 is defined as a road graph. The nodes of these graphs are lane segments, which are short polylines of up to 10 2d points which represent the approximate center of a driving lane. Each lane segment has some associated properties, including boolean variables indicating whether it is in an intersection and whether it is a turning lane. Lane segment nodes may be connected to other nodes in the graph via predecessor, successor, left neighbor or right neighbor connections.

We preprocess the dataset by scaling the (x,y) position of the lanes to have variance 1, and transforming their positions such that they lie in a frame centered around the last observed location of the ego agent, and are rotated such that the last observed ego heading is zero

.

Problem Formulation

We wish to measure the encoding ability of several graph neural network architectures. However, "encoding ability" is not clearly defined. The typical set of tasks associated with graph neural networks are node prediction, graph prediction and edge prediction. For this project, we choose to focus on node prediction, and use it as a proxy for the encoding capacity of the graph neural networks studied. We choose a node classification based on the following label

y_{v}={\begin{cases}1\ {\text{if}}\quad {\text{The lane segment is within 2m of the final ego vehicle position}}\\0\ {\text{if}}\quad {\text{otherwise}}\end{cases}}

Although arbitrary, we selected 2m because it is a commonly used threshold in motion forecasting literature, corresponding to the Miss Rate metric which tracks how often the final positions of motion forecasting predictions miss the final logged agent position. Using this classification problem, we hope to capture which graph encoding schemes best encode information which is relevant to the motion forecasting task.

Methods

Shared Encoder

We use a shared encoder architecture consisting of a 3 layer MLP to convert the per-lane features into a preliminary pre-node encoding. The MLP has hidden dimension of 64, with ReLU nonlinearities.

Graph Neural Network

As a baseline, we implement a basic graph neural network. This architecture consists of just 3 layer MLP which operate on the node feature encodings. We implemented this method such that it has the same number of parameters as both the graph convolutional and graph attention networks. This network architecture allows us to investigate the impact of sharing information across nodes, including both convolution operations and fully connection attention operations.

Graph Convolutional Network

The first architecture we will compare is a graph convolutional network (GCN). These networks update the features of each node $\mathbf {x_{i}}$ using a weight matrix $\mathbf {W}$ . The lane features of neighboring nodes which are connected by non-zero edges $e_{i,j}$ are aggregated, weighted by the degree of both the target node $d_{j}$ and the source node $d_{i}$ . Formally, the graph convolution update can be expressed as

x_{i}^{\prime }=\sum _{j\in {\mathcal {N}}(i)\bigcup \{i\}}{\frac {e_{i,j}}{\sqrt {d_{i}d_{j}}}}\mathbf {W} x_{j}.

Our GCN is constructed using 3 graph convolutions, seperated by ReLU nonlinearities.

Graph Attention Network

A modification to the graph convolution network is graph attention (GAT). The main difference between GCN and GAT is the weighting of the neighbouring nodes. Instead of weighting via edge connections and degrees, the network learns a weight for each node pairing $\alpha _{i,j}$ according to the formula

\alpha _{i,j}={\frac {\exp({\text{LeakyReLU}}(\mathbf {a} ^{\top }[\mathbf {W} \mathbf {x} _{i}||\mathbf {W} \mathbf {x} _{j}])}{\sum _{k\in N{i}\bigcup \{i\}}\mathbf {a} ^{\top }\exp({\text{LeakyReLU}}([\mathbf {W} \mathbf {x} _{i}||\mathbf {W} \mathbf {x} _{k}])}}.

Here, the double bars indicate concatenation. Using these attention weights, the node features are updated according to the formula

\mathbf {x} _{i}^{\prime }=\alpha _{i,i}\mathbf {W} \mathbf {x} _{i}+\sum _{j\in {\mathcal {N}}(i)}\alpha _{i,j}\mathbf {W} \mathbf {x} _{j}.

Similarly to the GCN architecture, we use a 3 layer architecture, with ReLU non linearities between the GAT convolutions.

Fully Connected Attention Network

In VectorNet ^[4], the authors use a fully connected attention mechanism to share information between nodes in the roadgraph. They use scaled dot product attention between node features, following the convention introduced in^[5]. We implement our attention network with the same structure as the GCN and GAT network, with three scaled dot product self-attention layers, with ReLU non-linearities between.

Results

We trained each of the above networks from scratch on the Argoverse 1^[3] Dataset using the Adam optimizer^[6] for 10 epochs. The training objective was binary cross entropy loss, with a weight of 80 for positive labels corresponding to the empirical ratio of positive to negative labels. Each model was trained 3 times from random seeds. We measured the classification performance of each network using mean average precision (mAP), which is the integral of precision recall curve. Table 1 shows the observed classification mAP for each model, reported as the average of the three runs, with an error bar corresponding to the maximum deviation from the mean observed over the three training runs.

Table 1. Node Classification mAP for Varying Graph Neural Network Architectures
Model Architecture	Mean Average Precision
Baseline (GNN)	46.4% ± 0.6%
Graph Convolutional Network (GCN)	49.4% ± 0.3%
Graph Attention Network (GAT)	47.6% ± 0.4%
Fully Connected Attention Network	47.0% ± 0.4%

Examining the four architectures, we can see that the performance is quite similar across all architectures. However, the best performing model is the Graph Convolutional Network, which outperforms the GNN baseline by 3%. One possible reason for the close performance is that the location of the lane may be sufficient to predict the final position of the ego agent, independent of the nearby nodes. Another hypothesis is that the receptive field of the convolutions is insufficient to improve significantly over the GNN baseline. Each convolutional network in Table 1 contained only 3 convolutional layers, meaning the predictions for each node were only informed by nodes at most three connections away. To test whether more convolutions would further improve performance, we trained an additional set of models with 5 and 7 convolution operations.

Fig 1. Classification performance of various GNN architectures scale versus the number of layers on a roadgraph node classification task

Examining Figure 1, we can see that for all four model types, the performance is fairly similar across the network depth. Notably, the graph convolutional network performance does degrade somewhat as the network depth increases. This may be due to the over-smoothing problem which is common in graph convolutional networks.

Conclusions

Using graph convolutional networks for encoding roadgraph information produces a small improvement over both graph neural network and fully connected attention based approaches. Further study is required to investigate if using graph convolutional networks will lead to improvement in downstream tasks such as motion forecasting.

Annotated Bibliography

↑ Cui, H., Radosavljevic, V., Chou, F. C., Lin, T. H., Nguyen, T., Huang, T. K., ... & Djuric, N. (2019, May). Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 2090-2096). IEEE.
↑ Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C. (2020). Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11525-11533).
↑ ^{Jump up to: 3.0} ^3.1 ^3.2 Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8748-8757).
↑ Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C. (2020). Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11525-11533).
↑ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
↑ Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0

[1] Cui, H., Radosavljevic, V., Chou, F. C., Lin, T. H., Nguyen, T., Huang, T. K., ... & Djuric, N. (2019, May). Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 2090-2096). IEEE.

[2] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C. (2020). Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11525-11533).

[:0-3] {Jump up to: 3.0} ^3.1 ^3.2 Chang, M. F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., ... & Hays, J. (2019). Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8748-8757).

[4] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C. (2020). Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11525-11533).

[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

[6] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[1]

[2]

[3]

[4]

[5]

[6]