[transformer] Attention is all you need

2025/3/13 6:33:20 来源：https://blog.csdn.net/sinat_30618203/article/details/140462911 浏览: 次关键词：[transformer] Attention is all you need

1、目的

提出一种新的网络结构，不用CNN或者RNN，只基于self-attention

2、方法

* Norm: Layer Normalization

1）Encoder

-> self-attention的k、v、q来自上一个encoder层

2）Decoder

-> 由于每个位置i的预测只能参考i以前的输出结果（auto-regressive），因此输出的embedding有一个位置的偏移。masked multi-head attention可以遮盖掉softmax的input中的illegal connections

-> encoder-decoder attention的q来自前一个decoder层，k和v来自encoder的输出

3）Attention

-> Scaled Dot-Product Attention

-> Multi-Head Attention

由于每个head的dimension减少了，总的计算量和single-head attention相近

4）Point-wise Feed-Forward Network

5）Embeddings and softmax

两个embedding层和pre-softmax linear transformation共享参数。其中，embedding层的weight会乘 $\sqrt{d_{model}}$

6）Positional Encoding

其中pos是postion，i是dimension

关于位置编码的分析，可以参考让研究人员绞尽脑汁的Transformer位置编码 - 科学空间|Scientific Spaces

3、优点

可以用constant number of sequentially executed operations来连接所有的位置。当sequence长度n小于representation维度d时，self-attention比recurrent layer要快；对于特别长的句子，可以限定只在r范围内计算self-attention

[transformer] Attention is all you need

1、目的

2、方法

3、优点

相关资讯

热文排行

最新新闻

推荐新闻

热搜词