[论文精读]Graph Neural Network-Based Anomaly Detection in Multivariate Time Series

论文网址：[2106.06947] Graph Neural Network-Based Anomaly Detection in Multivariate Time Series (arxiv.org)

论文代码：https://github.com/d-ailin/GDN

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 省流版

1.1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.4. Proposed Framework

2.4.1. Problem Statement

2.4.2. Overview

2.4.3. Sensor Embedding

2.4.4. Graph Structure Learning

2.4.5. Graph Attention-Based Forecasting

2.4.6. Graph Deviation Scoring

2.5. Experiments

2.5.1. Datasets

2.5.2. Baselines

2.5.3. Evaluation Metrics

2.5.4. Experimental Setup

2.5.5. RQ1. Accuracy

2.5.6. RQ2. Ablation

2.5.7. RQ3. Interpretability of Model

2.5.8. RQ4. Localizing Anomalies

2.6. Conclusion

3. Reference

1. 省流版

1.1. 心得

（1）还行还行，其实是比较简单的模型，而且介绍的非常清晰，而且有代码，看一下也不算亏

2. 论文逐段精读

2.1. Abstract

①Existing work: high-dimensional data solving improved by deep learning

②Challenge: existing methods do not explicitly learn the structure of existing relationships between variables

2.2. Introduction

①Mostly plenty of sensors help to anomaly detect

②Anomaly detection usually be classified as un-supervised problem due to the anomalies are unlabeled and variable

③They proposed Graph Deviation Network (GDN)

2.3. Related Work

（1）Anomaly Detection

①Methods are autoencoders (AE) and VAE as so on

（2）Multivariate Time Series Modelling

①This method models behavior of a multivariate time series based on its past behavior, including auto-regressive models, auto-regressive integrated moving average (ARIMA) models, CNN, LSTM, and GAN. However, it is difficult for them to handle complex and highly non-stationary time series.

（3）Graph Neural Networks

①Limited in multi sensor representation and unknown original graph structure

2.4. Proposed Framework

2.4.1. Problem Statement

①Number of sensors: $N$

②Pattern of data: time series $\mathbf{s}_{\mathrm{train}}=\left[\mathbf{s}_{\mathrm{train}}^{(1)},\cdots,\mathbf{s}_{\mathrm{train}}^{(T_{\mathrm{train}})}\right],\mathbf{s}_{\mathrm{train}}^{(t)} \in \mathbb{R}^{N}$ , it means each data comes from a sensor

③Time passing: $T_{train}$

④Training data: only normal data

⑤Testing method: got data from $N$ sensors but only separate set of $T_{test}$ time ticks $\mathbf{s}_{\mathrm{test}}=\begin{bmatrix}\mathbf{s}_{\mathrm{test}}^{(1)},\cdots,\mathbf{s}_{\mathrm{test}}^{(T_{\mathrm{test}})}\end{bmatrix}$

⑥Output: list/array in $T_{test}$ length with binary result, $\mathsf{a}(t)\in\{0,1\}$ where 1 indicate anomalous

2.4.2. Overview

①List the following four parts

②Overall framework:

2.4.3. Sensor Embedding

①Representation of each sensor: $\mathbf{v_{i}}\in\mathbb{R}^{d},\mathrm{~for~}i\in\{1,2,\cdots,N\}$

2.4.4. Graph Structure Learning

①Graph construction: directed graph applied

②Nodes: sensors

③Edges: dependency relationships (An edge from one sensor to another indicates that the first sensor is used for modelling the behavior of the second sensor)

④Adjacency matrix: $A$

⑤Prior information represented by candidate relations:

$\mathcal{C}_i\subseteq\{1,2,\cdots,N\}\setminus\{i\}$

if there is no prior information, the candidate relations of sensor $i$ is all the sensor except for $i$ itself

⑥Sensor similarity:

$\begin{aligned}&e_{ji}=\frac{\mathbf{v_{i}}^{\top}\mathbf{v_{j}}}{\|\mathbf{v_{i}}\|\cdot\|\mathbf{v_{j}}\|}\mathrm{~for~}j\in\mathcal{C}_{i}\\&A_{ji}=1\{j\in\mathsf{TopK}(\{e_{ki}:k\in\mathcal{C}_{i}\})\}\end{aligned}$

the edge is calculated by the cosine similarity fomular and they select the top $k$ similar one

2.4.5. Graph Attention-Based Forecasting

①Predicted/expected action at time $t$ based on the input:

$\mathbf{x^{(t)}}:=\begin{bmatrix}\mathbf{s^{(t-w)}},\mathbf{s^{(t-w+1)}},\cdots,\mathbf{s^{(t-1)}}\end{bmatrix}, \mathbf{x^{(t)}}\in\mathbb{R}^{N\times w}$

where $w$ denotes the size of sliding window, they need to predict the information $\mathbf{s^{(t)}}$

（1）Feature Extractor

①Aggregation methods:

$\mathbf{z}_{i}^{(t)}=\mathrm{ReLU}\left(\alpha_{i,i}\mathbf{W}\mathbf{x}_{i}^{(t)}+\sum_{j\in\mathcal{N}(i)}\alpha_{i,j}\mathbf{W}\mathbf{x}_{j}^{(t)}\right)$

where $\alpha_{i,i}$ are attention coefficients and they are calculated by:

$\begin{aligned} \mathbf{g}_{i}^{(t)}& =\mathbf{v}_{i}\oplus\mathbf{Wx}_{i}^{(t)} \\ \pi\left(i,j\right)& =\mathrm{LeakyReLU}\left(\mathbf{a}^{\top}\left(\mathbf{g}_{i}^{(t)}\oplus\mathbf{g}_{j}^{(t)}\right)\right) \\ \alpha_{i,j}& =\frac{\exp\left(\pi\left(i,j\right)\right)}{\sum_{k\in\mathcal{N}(i)\cup\{i\}}\exp\left(\pi\left(i,k\right)\right)}, \end{aligned}$

where $\bigoplus$ denotes concatenation, $\mathbf{a}$ denotes the vector of learned coefficients for
the attention mechanism

（2）Output Layer

①New features obtained of each nodes: $\{\mathbf{z}_{1}^{(t)},\cdots,\mathbf{z}_{N}^{(t)}\}$

②Predicted value at time $t$ :

$\mathbf{\hat{s}}^{(\mathbf{t})}=f_\theta\left(\begin{bmatrix}\mathbf{v}_1\circ\mathbf{z}_1^{(t)},\cdots,\mathbf{v}_N\circ\mathbf{z}_N^{(t)}\end{bmatrix}\right)$

where $\circ$ denotes element-wise multiply and $f$ is the fully connected layer

③Loss function (MSE):

$L_{\mathrm{MSE}}=\frac{1}{T_{\mathrm{train}}-w}\sum_{t=w+1}^{T_{\mathrm{train}}}\left\|\mathbf{\hat{s}}^{(\mathbf{t})}-\mathbf{s}^{(\mathbf{t})}\right\|_{2}^{2}$