Summary of Graph Attention Networks
in Blog / Fyp
This paper introduces Graph Attention Networks (GATs), a novel neural network architecture based on masked self-attention layers for graph-structured data.
A Graph Attention Network is composed of multiple Graph Attention and Dropout layers, followed by a softmax or a logistic sigmoid function for single/multi-label classification.
Graph Attention Layer
A single Graph Attention Layer is parameterized by
a \, \epsilon\ R^{2F'}
andW \,\epsilon\ R^{F' x F}
, with inputh \,\epsilon\, R^{N x F}
(N
nodes andF
features per node) and outputh' \,\epsilon\, R^{N x F'}
.First, self-attention coefficient
e_{ij} =
LeakyReLu({a}^T [W\vec{h_i}\| W\vec{h_j}^T])
is computed, with\|
representing concatenation and LeakyReLu set with\alpha
of 0.2.e_{ij}
is the attention coefficient between nodei
and nodej
, which indicates the importance of nodej
’s features to nodei
.Masked attention is employed to maintain the graph structure - by computing
e_{ij}
only for nodesj \, \epsilon\ N_i
, whereN_i
is some neighbourhood of nodei
in the graph. In this paper, the neighbourhood size is fixed to 1 (and includes nodei
itself).The coefficients are then normalized (across all choices of
j
) using the following softmax function to make them easily comparable across different nodes:\alpha_{i,j} = softmax_j \, (e_{ij}) = \frac {exp(e_{ij})} {\sum_{k \, \epsilon\ N_i} exp(e_{ik})}
The final output feature for node
i
is then computed as:\vec{h_i'} = \sigma(\sum_{j \epsilon N_i} \alpha_{i,j} W\vec{h_j})
where\sigma
is the activation function.Multi-head attention is used to stabilize the learning process of self-attention, by concatenating the output of K independent attention mechanisms:
\vec{h_i'} = ||_{k=1}^K \sigma(\sum_{j \epsilon N_i} \alpha_{i,j}^k W^k\vec{h_j})
where\alpha_{i,j}^k
is the normalized attention coefficient computed with thek
-th attention mechanisma^k
andW^k
is the corresponding weight matrix. Hence, the output\vec{h_i}
will haveKF'
instead ofF'
features.For the final (prediction) layer of the network, averaging is first applied to the sum of the output from each attention head, before the final non-linearity is applied:
\vec{h_i'} = \sigma(\frac{1}{K} \sum_{k=1}^K \sum_{j \epsilon N_i} \alpha_{i,j}^k W^k\vec{h_j})
During training, dropout is applied to layer’s input and normalized attention coefficients,
\alpha_{ij}
. Hence, each node is exposed to a stochastically sampled neighbourhood.
Advantages of GATs
A single GAT attention head with
F'
features can be computed inO(\text{\textbar}V\text{\textbar}FF' + \text{\textbar}E\text{\textbar}F')
, on par with Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017)Unlike GCNs, GATs allows for (implicit) assigning of different importances to nodes in the same neighbourhood. Analyzing the weights might lead to benefits in interpretability.
The attention mechanism is applied in a shared manner across all edges in the graph, hence it does not require upfront access to the global graph strcture or features of all of its nodes.
The graph is not required to be undirected.
\alpha_{ij}
can be left out if edgej \rightarrow i
is not present.GAT can be used for both inductive (evaluated on graphs completely unseen during training) and transductive learning.
Experimental Setup
Transductive Learning (Cora & Citeseer): Two-layer GAT model, with
K
= 8 attention heads andF'
= 8 features in first layer, followed by an exponential linear unit (ELU). Second layer is used for classfication withC
output features (# of classes),K
= 1, followed by a softmax activation.L_2
regularization is set to 0.0005. Dropout withp
= 0.6 is applied to both layers’ inputs and normalized attention coefficients,\alpha_{ij}
.Transductive Learning (Pubmed): Two-layer GAT model, with
K
= 8 andF'
= 8 in first layer, followed by ELU nonlinearity. Second layer is used for classfication withC
output features andK
= 8, followed by a softmax activation.L_2
regularization is set to 0.001 and dropout is set top
= 0.6.Inductive Learning (PPI): Three-layer GAT model, with
K
= 4 andF'
= 256 features on first two layers, followed by ELU nonlinearity. Final layer hasK
= 6 andC
= 121, followed by a logistic sigmoid activation. No regularization is used as the training set is sufficiently large. Skip connections are added across the intermediate attentional layer.
Results
GATs outperformed GCNs on Cora and Citeseer by 1.5% and 1.6% respectively, and matching GCNs performance on the Pubmed dataset.
- GATs outperformed GraphSAGE by 20.5% on the PPI dataset.
Questions
Q: How would changing the neighbourhood size affect results? How about using different neighbourhood size in different layers?
Q: In the context of semi-supervised clustering, if signal embedding is used as input h_i
, and majority of the unlabelled nodes sharing the same h_i
, how can the attention co-efficient be effective computed?