【论文_序列转换模型架构_20230802v7】Attention Is All You Need 【Transformer】

https://arxiv.org/abs/1706.03762
20170612 v1

代码实现_notebook

在这里插入图片描述

∗Equal contribution. Listing order is random.
Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea.
提出用 self-attention 替代 RNNs，并开始努力评估这一想法。
Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work.
设计并实现了第一个 Transformer 模型，并参与了这项工作的各个核心方面。
Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every
detail.
提出了缩放的点积注意、多头注意和无参数位置表示，并参与了几乎每一个细节。
Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor.
在我们的原始代码库和 tensor2tensor 中设计、实现、调整和评估了无数的模型变体。
Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations.
还试验了新的模型变体，负责我们的初始代码库，以及高效推理和可视化。
Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
花了无数个漫长的日子来设计和实现 tensor2tensor 的各个部分，更换了我们早期的代码库，极大地改善了结果并大大加快了我们的研究。

文章目录

摘要
1 引言
2 背景
3 模型架构
- 3.1 编码器和解码器堆叠
- 3.2 Attention
- - 3.2.1 Scaled Dot-Product Attention
  - 3.2.2 Multi-Head Attention
  - 3.2.3 Attention 在我们的模型中的应用
- 3.3 Position-wise 前馈网络
- 3.4 Embeddings 和 Softmax
- 3.5 Positional Encoding 位置编码
4 Why Self-Attention
5 训练
- 5.1 训练数据和 Batching
- 5.2 硬件和时间表
- 5.3 优化器
- 5.4 正则化
6 结果
- 6.1 机器翻译
- 6.2 模型变体
- 6.3 English Constituency Parsing 英语选区解析
7 结论
致谢
参考文献
注意可视化

摘要

↓ 〔一个新的简单的 sequence transduction序列转换模型架构，Transformer：性能更好，更具并行性，需要更少的训练时间。〕

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.
主流的序列 transduction 模型是基于复杂的循环或卷积神经网络，包括一个编码器和一个解码器。
The best performing models also connect the encoder and decoder through an attention mechanism.
表现最好的模型还通过注意机制连接编码器和解码器。
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
我们提出了一个新的简单的网络架构，Transformer，完全基于注意机制，完全摒弃循环和卷积。
Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
在两个机器翻译任务上的实验表明，这些模型在质量上更优越，同时更具并行性，并且需要更少的训练时间。
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.
我们的模型在 WMT 2014 英语-德语翻译任务上实现了 28.4 BLEU，比现有的最佳结果（包括 ensembles ）提高了 2 个 BLEU 以上。
On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature.
在 WMT 2014 英法翻译任务中，我们的模型在 8 个 GPUs 上训练 3.5 天后，建立了一个新的单模型最先进的 BLEU 分数 41.8，这是文献中最佳模型的训练成本的一小部分。
We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
我们通过将 Transformer 成功地应用于具有大量和有限训练数据的英语选区解析，证明了它可以很好地泛化到其它任务。

1 引言

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5].
循环神经网络，特别是长短时记忆[13] 和门控循环[7]神经网络，已经被牢固地确立为序列建模和 transduction 问题（如语言建模和机器翻译）的最先进方法[35,2,5]。
Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
从那以后，大量的努力继续推动循环语言模型和编码器-解码器架构的边界[38,24,15]。

Recurrent models typically factor computation along the symbol positions of the input and output sequences.
循环模型通常沿输入和输出序列的符号位置进行因子计算。
Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$ , as a function of the previous hidden state $h_{t-1}$ and the input for position $t$ .
将位置与计算时间中的步对齐，它们生成一个隐藏状态序列 $h_t$ ，是前一个隐藏状态 $h_{t-1}$ 和位置 $t$ 的输入的函数。
This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.
这种固有的顺序性排除了训练示例中的并行化，这在较长的序列长度下变得至关重要，因为内存约束限制了跨示例的 batching。
Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter.
最近的研究通过因式分解技巧[21] 和条件计算[32]显著提高了计算效率，同时也提高了后者的模型性能。
The fundamental constraint of sequential computation, however, remains.
然而，顺序计算的基本约束仍然存在。〔顺序计算 ——> 无法并行〕

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19].
注意机制已经成为各种任务中的序列建模和 transduction 模型的必要组成部分，允许对依赖关系进行建模，而不考虑它们在输入或输出序列中的距离[2,19]。
In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
然而，在除少数情况 [27] 外的所有情况下，这种注意机制都与循环网络结合使用。

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.
在这项工作中，我们提出了 Transformer，一种避免循环的模型架构，完全依赖于注意机制来得到输入和输出之间的全局依赖关系。
The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
Transformer 允许更多的并行化，并且在 8 个 P100 GPUs 上经过 12 小时的训练后，可以达到翻译质量的新的最先进水平。

2 背景

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions.
减少顺序计算的目标也构成了 Extended Neural GPU[16]、ByteNet[18] 和 ConvS2S[9] 的基础，它们都使用卷积神经网络作为基本构建块，并行计算所有输入和输出位置的隐藏表示。
In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet.
在这些模型中，将两个任意输入或输出位置的信号关联起来所需的运算量随着位置之间的距离增加而增长，ConvS2S 为线性增长，ByteNet 为对数增长。
This makes it more difficult to learn dependencies between distant positions [12].
这使得学习远距离位置之间的依赖关系变得更加困难 [12]。
In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
在 Transformer 中，这被减少到一个恒定的操作数量，尽管其代价是由于平均注意加权的位置而降低了有效分辨率，我们用 3.2 节中描述的多头注意抵消了这一影响。

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
自注意，有时被称为内注意，是一种将单个序列的不同位置联系起来以计算该序列的表示的注意机制。
Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
self-attention 已经被成功地应用于阅读理解、摘要总结、文本蕴涵和学习任务无关的句子表征等多种任务中 [4,27,28,22]。

End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
端到端记忆网络基于循环注意机制，而不是顺序对齐的循环，并且在简单语言问答和语言建模任务上表现良好 [34]。

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.
然而，据我们所知，Transformer 是第一个完全依赖于 self-attention 来计算其输入和输出表示的 transduction 模型，不使用顺序对齐的 RNNs 或卷积。
In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
在后续部分中，我们将描述 Transformer，激励 self-attention，并讨论它相对于 [17,18] 和 [9] 等模型的优势。

3 模型架构

Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35].
大多数有竞争力的神经序列 transduction 模型具有编码器-解码器结构 [5,2,35]。
Here, the encoder maps an input sequence of symbol representations $x_1, ..., x_n)$ to a sequence of continuous representations $\bm{z} = (z_1,\cdots,z_n)$ .
这里，编码器将一个符号表示的输入序列 $x_1, ..., x_n)$ 映射到一个连续表示的序列 $\bm{z} = (z_1,\cdots,z_n)$ 。
Given $\bm{z}$ , the decoder then generates an output sequence $(y_1, \cdots,y_m)$ of symbols one element at a time.
给定 $\bm{z}$ ，解码器然后生成一个符号的输出序列 $(y_1, \cdots,y_m)$ ，每次一个元素。
At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
在每一步中，模型都是自回归的，在生成下一个符号时，使用之前生成的符号作为额外输入。

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
Transformer 遵循这个整体架构，编码器和解码器都使用堆叠的 self-attention、逐点的，完全连接的层，分别如图 1 的左半部分和右半部分所示。

在这里插入图片描述

3.1 编码器和解码器堆叠

Encoder: The encoder is composed of a stack of $N = 6$ identical layers.
Each layer has two sub-layers.
编码器由 6 个相同的层堆叠而成。每一层有两个子层。
The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
第一个子层是多头自注意机制，第二个子层是简单的、逐个位置完全连接的前馈网络。
We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1].
我们在两个子层的每一层周围都使用了一个残差连接[11]，然后是层标准化[1]。
That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
也就是说，每个子层的输出是 LayerNorm(x + Sublayer(x)) ，其中 Sublayer(x) 是子层本身实现的函数。
To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_\text{model}=512$ .
为了方便这些残差连接，模型中的所有子层以及 embedding 层产生维度为 $d_\text{model}=512$ 的输出。

Decoder: The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
解码器也由 6 个相同的层堆叠而成。除了每个编码器层中的两个子层之外，解码器插入第三个子层，该子层对编码器 stack 的输出执行多头注意。
Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.
与编码器类似，我们在每个子层周围使用残差连接，然后进行层标准化。
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.
我们还修改了解码器 stack 中的自注意子层，以防止位置关注后续位置。
This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$ .
这种 masking，加上输出 embeddings 偏移一个位置的事实，确保了位置 $i$ 的预测只能依赖于小于 $i$ 的位置的已知输出。

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
注意函数可以描述为将 查询和一组键值对 映射到输出，其中查询、键、值和输出都是向量。
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
输出被计算为值的加权和，其中分配给每个值的权重是由查询与相应键的兼容性函数计算的。

3.2.1 Scaled Dot-Product Attention

We call our particular attention “Scaled Dot-Product Attention” (Figure 2).
我们称这种特殊的注意为 “Scaled Dot-Product Attention”（图 2）。
The input consists of queries and keys of dimension $d_k$ , and values of dimension $d_v$ .
输入包括 维度都是 $d_k$ 的查询和键，维度为 $d_v$ 的值。
We compute the dot products of the query with all keys, divide each by $\sqrt {d_k}$ , and apply a softmax function to obtain the weights on the values.
我们计算查询与所有键的点积，每个点积除以 $\sqrt {d_k}$ ，并应用 softmax 函数来获得值的权重。

在这里插入图片描述

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$ .
在实践中，我们同时计算一组查询的注意函数，它们被打包成一个矩阵 Q。
The keys and values are also packed together into matrices $K$ and $V$ .
键和值也打包成矩阵 $K$ 和 $V$ 。
We compute the matrix of outputs as:
我们计算输出矩阵为：

$\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V~~~~~~~~~~(1)$

Q：queries ，查询
K：keys 键
V： values 值

The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention.
两个最常用的注意函数是 加性注意[2] 和点积（乘法）注意。
Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$ .
除了 scaling factor 为 $\frac{1}{\sqrt{d_k}}$ 之外，点积注意与我们的算法相同。
Additive attention computes the compatibility function using a feed-forward network with a single hidden layer.
加性注意使用一个具有单个隐藏层的前馈网络来计算 compatibility function 。
While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
虽然两者在理论复杂性上相似，但在实践中，点积注意更快，更节省空间，因为它可以使用高度优化的矩阵乘法代码来实现。

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3].
虽然对于较小的 $d_k$ 值，这两种机制的表现相似，但在不扩大 $d_k$ 的值的情况下，加法注意优于点积注意。
We suspect that for large values of $d_k$ , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4.
我们怀疑，对于较大的 $d_k$ 值，点积的大小会变大，从而将 softmax 函数推入具有极小梯度的区域

脚注 4 To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1.
为了说明为什么点积变得很大，假设 $q$ 和 $k$ 的分量是均值为 0，方差为 1 的独立随机变量。
Then their dot product, $=\sum\limits_{i=1}^{d_k}q_ik_i$ , has mean 0 and variance $d_k$ .
它们的点积 $=\sum\limits_{i=1}^{d_k}q_ik_i$ 均值为 0，方差为 $d_k$

To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$ 。
为了抵消这个影响，我们将点积乘以 $\frac{1}{\sqrt{d_k}}$

3.2.2 Multi-Head Attention

Instead of performing a single attention function with $d_\text{model}$ -dimensional keys, values and queries,we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k$ , $d_k$ and $d_v$ , dimensions, respectively.
我们发现，与其使用 $d_\text{model}$ 维度的键、值和查询来执行单一的注意函数，将查询、键和值 $h$ 次线性投影到 $d_k$ 、 $d_k$ 和 $d_v$ 维度是有益的。
On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$ -dimensional output values.
然后，在查询、键和值的每个投影版本上，我们并行地执行注意函数，产生 $d_v$ 维的输出值。
These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2.
将它们连接起来并再次进行投影，得到最终值，如图 2 所示。

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.
多头注意让模型在不同位置共同注意来自 不同表示子空间 的信息。
With a single attention head, averaging inhibits this.
对于单一注意头，平均会抑制这一点。

$\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,\cdots,\text{head}_h)W^O$

其中 $\text{head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V)$

其中 projections投影是 $W_i^Q\in {\mathbb R}^{d_\text{model}\times d_k},~~W_i^K\in {\mathbb R}^{d_\text{model}\times d_k},~~W_i^V\in {\mathbb R}^{d_\text{model}\times d_v},~~W^O\in {\mathbb R}^{hd_v\times d_\text{model}}$

In this work we employ $h = 8$ parallel attention layers, or heads.
For each of these we use $d_k=d_v=d_\text{model}/h=64$ .
在这项工作中，我们使用 $h = 8$ 平行注意层，或头。
对于每个 head，我们使用 $d_k=d_v=d_\text{model}/h=64$ 。
Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
由于每个 head 的维数降低，因此总计算成本与全维的 single-head attention单头注意相似。

3.2.3 Attention 在我们的模型中的应用

The Transformer uses multi-head attention in three different ways:
Transformer 以三种不同的方式使用多头注意：
———— In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.
在 “编码器-解码器注意”层 中，查询来自前一个解码器层，而记忆键和值来自编码器的输出。
This allows every position in the decoder to attend over all positions in the input sequence.
这使得解码器中的每个位置都注意输入序列中的所有位置。
This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as[38, 2, 9].
这模仿了序列到序列模型中典型的编码器-解码器注意机制，如 [38,2,9]。
———— The encoder contains self-attention layers.
编码器包含自注意层。
In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder.
在自注意层中，所有的键、值和查询都来自同一个地方，在这种情况下，是编码器中前一层的输出。
Each position in the encoder can attend to all positions in the previous layer of the encoder.
编码器中的每个位置都可以注意编码器前一层中的所有位置。
———— Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position.
类似地，解码器中的自注意层使得解码器中的每个位置注意到解码器中的所有位置直至并包括该位置。
We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.
我们需要阻止解码器中的向左信息流以保持自回归特性。〔 ✅ 自回归特性是啥特性，为什么要求信息不能向左流？↓ 〕
We implement this inside of scaled dot-product attention by masking out (setting to -∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
我们通过 masking out（令其为 -∞）softmax 输入中对应于非法连接的所有值来实现 scaled dot-product attention缩放点积注意力。参见图 2。

———————— 补充 Start
自回归特性可以简单理解为 “基于历史生成未来”。例如，在语言生成任务中，模型生成一个词时，只会考虑之前已经生成的词，而不会考虑尚未生成的词。这种特性确保了生成过程的顺序性和因果性。

在这里插入图片描述

———————— 补充 End

3.3 Position-wise 前馈网络

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
除了注意子层外，编码器和解码器中的每一层都包含一个全连接的前馈网络，该网络分别相同地应用于每个位置。
This consists of two linear transformations with a ReLU activation in between.
这包括两个线性转换，中间有一个 ReLU 激活。

$\text{FFN}(x)=\text{max}(0, xW_1+b_1)W_2+b_2~~~~~~~~~~(2)$

While the linear transformations are the same across different positions, they use different parameters from layer to layer.
虽然线性变换在不同位置上是相同的，但它们在每一层之间使用不同的参数。
Another way of describing this is as two convolutions with kernel size 1.
另一种描述它的方式是两个核大小为 1 的卷积。
The dimensionality of input and output is $d_\text{model} = 512$ , and the inner-layer has dimensionality $d_\text{ff}=2048$ .
输入和输出的维数 $d_\text{model} = 512$ ，内层的维数 $d_\text{ff}=2048$ 。

3.4 Embeddings 和 Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_\text{model}$ .
与其它序列 transduction 模型类似，我们使用习得的 embeddings 将输入 tokens 和输出 tokens 转换为维度为 $d_\text{model}$ 的向量。
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.
我们还使用通常习得的线性变换和 softmax 函数将解码器输出转换为预测的 next-token 概率。
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30].
在我们的模型中，我们在两个 embedding 层和 pre-softmax 线性变换之间共享相同的权重矩阵，类似于 [30]。
In the embedding layers, we multiply those weights by $\sqrt{d_\text{model}}$
在 embedding 层中，我们将这些权重乘以 $\sqrt{d_\text{model}}$

3.5 Positional Encoding 位置编码

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.
由于我们的模型不包含 recurrence 和卷积，为了使模型利用序列的顺序，我们必须注入一些关于序列中 tokens 的相对或绝对位置的信息。
To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks.
为此，我们在编码器和解码器堆栈底部的输入 embeddings 中添加了“位置编码”。
The positional encodings have the same dimension $d_\text{model}$ as the embeddings, so that the two can be summed.
位置编码与 embeddings 具有相同的维数 $d_\text{model}$ ，因此两者可以相加。
There are many choices of positional encodings, learned and fixed [9].
有许多位置编码的选择，学习和固定[9]。

In this work, we use sine and cosine functions of different frequencies:
在这项工作中，我们使用了不同频率的正弦和余弦函数：

$PE_{(pos,2i)}=sin(\frac{pos}{10000^{\frac{2i}{d_\text{model}}}})~~~~~~$ ✔ 可以让模型外推到比训练期间遇到的序列长度更长的序列

$PE_{(pos,2i\textcolor{blue}{+1})}=\textcolor{blue}{cos}(\frac{pos}{10000^{\frac{2i}{d_\text{model}}}})$

其中 pos 是位置，i 是维度。
也就是说，位置编码的每一个维度对应于一个正弦波。
波长形成从 2π 到 10000·2π 的几何级数。
我们选择这个函数是因为我们假设它可以让模型很容易地通过相对位置学习，因为对于任何固定的偏移量 k， $PE_{pos+k}$ 可以表示为 $PE_{pos}$ 的线性函数。

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)).
我们还尝试使用习得的位置 embeddings [9] 代替，发现这两个版本产生的结果几乎相同（见表 3 行 (E) ）。
We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
我们选择正弦版本是因为它可以让模型外推到比训练期间遇到的序列长度更长的序列。

4 Why Self-Attention

In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1,\cdots, x_n)$ to another sequence of equal length $z_1,...,z_n)$ , with $x_i, z_i \in {\mathbb R}^d$ , such as a hidden layer in a typical sequence transduction encoder or decoder.
在本节中，我们将自注意层的各个方面与循环层和卷积层进行比较，这些层通常用于将一个可变长度的符号表示序列 $(x_1,\cdots, x_n)$ 映射到另一个相等长度的序列 $z_1，…，z_n)$ ，其中 $x_i, z_i \in {\mathbb R}^d$ ，例如典型序列 transduction 编码器或解码器中的隐藏层。
Motivating our use of self-attention we consider three desiderat.
我们认为有 3 个缺少的东西激励我们使用自关注。

One is the total computational complexity per layer.
一个是每层的总计算复杂度。
Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
另一个是可以并行化的计算量，通过所需的最小顺序运算来衡量。

The third is the path length between long-range dependencies in the network.
第三个是网络中长范围依赖关系之间的路径长度。
Learning long-range dependencies is a key challenge in many sequence transduction tasks.
学习长范围依赖关系 是许多序列 transduction 任务中的关键挑战。
One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.
影响学习这种依赖关系能力的一个关键因素是网络中向前和向后信号必须经过的路径长度。
The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12].
输入和输出序列中任意位置组合之间的路径越短，学习长范围依赖关系就越容易[12]。
Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
因此，我们还比较了由不同层类型组成的网络中任意两个输入和输出位置之间的最大路径长度。

As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O (n)$ sequential operations.
如表 1 所示，自关注层用固定数量的顺序执行计算连接所有位置，而循环层需要 $O (n)$ 顺序操作。
In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length $n$ is smaller than the representation dimensionality $d$ , which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations.
就计算复杂度而言，当序列长度 $n$ 小于表示维数 $d$ 时，自注意层比循环层更快，这是机器翻译中最先进的模型使用的句子表示的最常见情况，例如词块[38] 和字节对[31]表示。
To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size $r$ in the input sequence centered around the respective output position.
为了提高涉及非常长的序列的任务的计算性能，可以将自注意限制为只考虑以各自输出位置为中心的输入序列中大小为 $r$ 的邻域。
This would increase the maximum path length to $O (n / r)$ .
这将使最大路径长度增加到 $O (n / r)$ 。
We plan to investigate this approach further in future work.
我们计划在未来的工作中进一步研究这种方法。🌱

在这里插入图片描述

A single convolutional layer with kernel width $k < n$ does not connect all pairs of input and output positions.
一个核宽度为 $k < n$ 的卷积层并不能连接所有的输入和输出位置对。〔网络中任意两个位置之间最长路径的长度增加〕
Doing so requires a stack of $O (n / k)$ convolutional layers in the case of contiguous kernels, or $O(\log_k(n))$ in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network.
这样做在连续核的情况下需要 $O (n / k)$ 卷积层的堆栈，或者在膨胀卷积 [18]的情况下需要 $O(\log_k(n))$ 卷积层的堆栈，增加网络中任意两个位置之间最长路径的长度。
Convolutional layers are generally more expensive than recurrent layers, by a factor of $k$ .
卷积层通常比循环层昂贵 k 倍 。
Separable convolutions [6], however, decrease the complexity considerably, to $O(k·n·d +n·d^2)$ .
然而，可分离卷积[6]大大降低了复杂度，为 $O(k·n·d +n·d^2)$ 。
Even with $k = n$ , however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model.
然而，即使令 $k = n$ ，可分离卷积的复杂度等于自注意层和逐点前馈层的组合，这是我们在模型中采用的方法。

As side benefit, self-attention could yield more interpretable models.
作为附带好处，自注意可以产生更多可解释的模型。
We inspect attention distributions from our models and present and discuss examples in the appendix.
我们从我们的模型中检查注意力分布，在附录中给出并讨论示例。
Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences.
不仅单独注意力头清楚地学会执行不同的任务，许多注意力头似乎表现出与句子的句法和语义结构相关的行为。

5 训练

This section describes the training regime for our models.
本节描述了我们模型的训练机制。

5.1 训练数据和 Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs.
我们在标准的 WMT 2014 英语-德语数据集上进行训练，该数据集由大约 450 万句对组成。
Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens.
句子使用字节对编码[3] 进行编码，具有大约 37000 个 tokens 的共享源-目标词汇表。
For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38].
对于英语-法语，我们使用了更大的 WMT 2014 英语-法语数据集，该数据集由 36M 个句子组成，并将 tokens 拆分为 32000 个单词块的词汇[38]。
Sentence pairs were batched together by approximate sequence length.
句子对按 近似序列长度 进行批处理。
Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
每个训练批包含一组句子对，其中包含大约 25000 个源 tokens 和 25000 个目标 tokens。

5.2 硬件和时间表

We trained our models on one machine with 8 NVIDIA P100 GPUs.
我们在一台带有 8 个 NVIDIA P100 GPUs 的机器上训练我们的模型。
For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds.
对于使用本文中描述的超参数的基础模型，每个训练步骤大约需要 0.4 秒。
We trained the base models for a total of 100,000 steps or 12 hours.
我们对基础模型进行了总共 10 万步或 12 小时的训练。
For our big models,(described on the bottom line of table 3), step time was 1.0 seconds.
对于我们的大型模型（如表 3 所示），步长为 1.0 秒。
The big models were trained for 300,000 steps(3.5 days).
大模型训练了 30 万步（3.5 天）。

5.3 优化器

We used the Adam optimizer [20] with $β_1 = 0.9, β_2 = 0.98$ and $\epsilon = 10^{-9}$ .
我们使用 Adam 优化器[20]，其中 $β_1 = 0.9, β_2 = 0.98$ 和 $\epsilon = 10^{-9}$ 。
We varied the learning rate over the course of training, according to the formula:
在训练过程中，我们根据以下公式改变了学习率：

$lrate=d^{-0.5}_\text{model}·\min(step\_num^{-0.5},step\_num·warmup\_steps^{-1.5})~~~~~~~~~~(3)$

This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number.
这对应于在第一个 warmup_steps 训练步骤中线性增加学习率，然后按步数的倒数平方根成比例地降低学习率。
We used warmup_steps = 4000.
我们令 warmup_steps = 4000

5.4 正则化

We employ three types of regularization during training:
我们在训练中使用三种类型的正则化：

Residual Dropout
We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized.
我们将 dropout[33] 应用于每个子层的输出，然后将其添加到子层输入并标准化。
In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks.
此外，我们将 dropout 应用于编码器和解码器堆栈中的 embeddings 和位置编码之和。
For the base model, we use a rate of $P_\text{drop} =0.1$ .
对于基础模型，我们使用 $P_\text{drop} =0.1$ 的比率。

Label Smoothing
During training, we employed label smoothing of value $\epsilon_{ls}$ = 0.1 [36].
在训练过程中，我们使用值 $\epsilon_{ls}$ = 0.1[36] 的标签平滑。
This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
这损害了困惑度，因为模型学会了变得更不确定，但提高了准确性和 BLEU 分数。〔评估机器翻译质量的指标，越高越好〕

6 结果

6.1 机器翻译

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4.
在 WMT 2014 英语转德语翻译任务中，大的 transformer 模型（表 2 中的Transformer (big)）比之前报道的最佳模型（包括集成）高出 2.0 BLEU 以上，建立了新的最先进的 BLEU 分数 28.4。
The configuration of this model is listed in the bottom line of Table 3.
该模型的配置列在表 3 的底线。
Training took 3.5 days on 8 P100 GPUs.
训练时间为 3.5 天，使用的是 8 个 P100 GPUs。
Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
甚至我们的基础模型也超过了所有以前发表的模型和集成，训练成本只是任何竞争模型的一小部分。

在这里插入图片描述

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model.
在 WMT 2014 英语转法语翻译任务上，我们大的模型获得了 41.0 的 BLEU 分数，优于之前发布的所有单一模型，而训练成本不到之前最先进模型的 1/4。〔集成模型没超过〕
The Transformer (big) model trained for English-to-French used dropout rate Pdrop=0.1, instead of 0.3.
为英语转法语训练的 Transformer(big) 模型使用的 dropout 率 Pdrop=0.1，而不是 0.3。

For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals.
对于基础模型，我们使用通过平均最后 5 个检查点获得的单个模型，这些检查点每隔 10 分钟写入一次。
For the big models, we averaged the last 20 checkpoints.
对于大的模型，我们取最后 20 个检查点的平均值。
We used beam search with a beam size of 4 and length penalty $\alpha = 0.6$ [38].
我们使用束搜索，波束大小为 4，长度惩罚 $\alpha = 0.6$ [38]。
These hyperparameters were chosen after experimentation on the development set.
这些超参数是在开发集上实验后选择的。
We set the maximum output length during inference to input length +50, but terminate early when possible [38].
我们在推理期间将最大输出长度设置为输入长度 +50，但在可能的情况下提前终止 [38]。

Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature.
表 2 总结了我们的结果，并将我们的翻译质量和训练成本与文献中的其它模型架构进行了比较。
We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU $^5$ .
我们通过将训练时间、使用的 GPUs 数量和每个 GPU 的持续单精度浮点容量的估计值乘积来估计用于训练模型的浮点运算次数。

$^5$ We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.
K80、K40、M40 和 P100 的 TFLOPS 分别为 2.8、3.7、6.0 和 9.5。

6.2 模型变体

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013.
为了评估 Transformer 不同组件的重要性，我们以不同的方式改变了我们的基础模型，在开发集 newstest2013 上测量了英语到德语翻译的性能变化。
We used beam search as described in the previous section, but no checkpoint averaging.
We present these results in Table 3.
我们使用前一节描述的束搜索，但没有使用检查点平均。
我们在表 3 中展示了这些结果。

在这里插入图片描述

Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model.
表 3：Transformer 架构的变体。未列出的值与基础模型的值相同。
All metrics are on the English-to-German translation development set, newstest2013.
所有指标都是基于英语到德语的翻译开发集 newstest2013。
Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
根据我们的字节对编码，列出的困惑是 per-wordpiece，不应该与 per-word 困惑进行比较。

In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2.
在表 3 行(A) 中，我们在保持计算量不变的情况下，改变注意头的数量以及注意键和值维度，如 3.2.2 节所述。
While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
虽然单头注意比最佳设置差 0.9 BLEU，但过多的头也会降低质量。

In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality.
在表 3 行(B) 中，我们观察到减小注意键大小 $d_k$ 会损害模型质量。
This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial.
这表明确定兼容性并不容易，一个比点积更复杂的兼容性函数可能是有益的。
We further observe in rows © and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting.
我们在行( C) 和行(D) 中进一步观察到，正如预期的那样，更大的模型更好，并且 dropout 对于避免过拟合非常有帮助。
In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
在行(E) 中，我们用习得的位置 embeddings[9] 替换正弦位置编码，并观察到与基础模型几乎相同的结果。

6.3 English Constituency Parsing 英语选区解析

To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing.
为了评估 Transformer 是否可以泛化到其它任务，我们对英语选区解析进行了实验。
This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input.
这项任务提出了具体的挑战：输出受到强烈的结构限制，并且比输入长得多。
Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
此外，RNN 序列到序列模型还不能在小数据体系中获得最先进的结果 [37]。

We trained a 4-layer transformer with $d_\text{model}$ =1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences.
我们用 $d_\text{model}$ =1024 在 Penn Treebank [25] 的 Wall Street Journal (WSJ) 部分训练了一个 4 层 transformer ，大约训练了 40K 个句子。
We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37].
我们还在半监督设置中训练它，使用更大的高置信度和 BerkleyParser 语料库，大约有 1700 万个句子。
We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
我们仅在 WSJ 设置中使用了 16K tokens 的词汇表，在半监督设置中使用了 32K tokens 的词汇表。

We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model.
我们只进行了少量的实验来选择 Section 22 开发集上的 dropout、注意和残差（第 5.4 节）、学习率和束大小，所有其他参数从英语转德语的基础翻译模型保持不变。
During inference, we increased the maximum output length to input length + 300.
在推理过程中，我们将最大输出长度增加到输入长度 + 300。
We used a beam size of 21 and $\alpha$ = 0.3 for both WSJ only and the semi-supervised setting.
对于仅 WSJ 和半监督设置，我们使用了 21 的光束大小和 $\alpha$ = 0.3。

Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
表 4 中的结果显示，尽管缺乏针对特定任务的调优，我们的模型的性能出奇地好，除了 Recurrent Neural Network Grammar [8] 之外，它的结果比之前报道的所有模型都要好。

在这里插入图片描述

In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences.
与 RNN 序列到序列模型[37]相比，Transformer 即使只在包含 40K 个句子的 WSJ 训练集上训练，其性能也优于 BerkeleyParser[29]。

7 结论

↓ 【一句话介绍自己的工作：亮点 + 主要 idea】

In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
在这项工作中，我们提出了 Transformer，第一个完全基于注意的序列转导模型，用多头自注意取代了编码器-解码器架构中最常用的循环层。

↓ 【优势 (训练更快) + 测试基准的结果 (实现 SOTA)】

For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
对于翻译任务，Transformer 的训练速度明显快于基于循环层或卷积层的架构。
On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art.
在 WMT 2014 的英语到德语和 WMT 2014 的英语到法语翻译任务上，我们都达到了一个新的最先进水平。
In the former task our best model outperforms even all previously reported ensembles.
在前一个任务中，我们的最佳模型甚至优于所有先前报道的模型的集成。

↓ 【后续的研究计划】

We are excited about the future of attention-based models and plan to apply them to other tasks.
我们对基于注意的模型的未来感到兴奋，并计划将其应用于其它任务。
We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.
我们计划将 Transformer 扩展到涉及文本以外的输入和输出模态的问题，并研究局部的、受限的注意机制，以高效地处理大量的输入和输出，如图像、音频和视频。〔可参考 diffusion transformer (DiT) 〕
Making generation less sequential is another research goals of ours.
更少顺序的生成 是我们的另一个研究目标。〔更随机的生成？生成需要更少的顺序计算？〕

The code we used to train and evaluate our models is available at https://github.com/tensorflow/tensor2tensor.
我们用来训练和评估模型的代码可以在 https://github.com/tensorflow/tensor2tensor 上找到。

致谢

Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.
我们感谢 Nal Kalchbrenner 和 Stephan Gouws 富有成效的评论、更正和启发。

参考文献

注意可视化

在这里插入图片描述

【论文_序列转换模型架构_20230802v7】Attention Is All You Need 【Transformer】

文章目录

摘要

1 引言

2 背景