【深度学习】Transformer 的常见的位置编码有哪些

Transformer 位置编码（Positional Encoding）主要用于弥补 自注意力机制（Self-Attention） 对位置信息的忽略，常见的方案有以下几种：

1. 绝对位置编码（Absolute Positional Encoding）

绝对位置编码是最早在原始 Transformer 论文（《Attention Is All You Need》）中提出的方式，它在每个 token 位置加入一个固定的向量，用于表示其位置信息。

(1) 三角函数编码（Sinusoidal Positional Encoding）

公式：
$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$
$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$
其中：

$p os$ 是 token 在序列中的位置索引，
$i$ 是 embedding 维度索引，
$d_{\text{model}}$ 是模型维度（如 512）。

🔹 特点：

位置编码与 token embedding 相加后输入 Transformer，提供全局位置信息。
由于三角函数具有周期性，它可以外推到未见过的长度（泛化性较强）。
在绝对位置基础上，同时保留了相对位置信息，即相邻 token 具有相似的编码。

🔹 代码实现（PyTorch）：

import torch
import mathdef positional_encoding(seq_len, d_model):pe = torch.zeros(seq_len, d_model)position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))pe[:, 0::2] = torch.sin(position * div_term)pe[:, 1::2] = torch.cos(position * div_term)return pepos_encoding = positional_encoding(50, 512)  # 50个位置，512维度

2. 相对位置编码（Relative Positional Encoding, RPE）

在绝对位置编码中，每个 token 的位置都是固定的，但有时候我们更关心的是 token 之间的相对位置，比如 NLP 任务中的依存关系。

(1) 相对位置偏置（Relative Position Bias）

这种方法在 Transformer-XL 和 T5 等模型中被广泛使用：
$\alpha_{ij} = \frac{q_i k_j^T}{\sqrt{d_k}} + b_{ij}$
其中：

$q_i, k_j$ 是 Query 和 Key，
$b_{ij}$ 是基于相对位置的偏置项。

🔹 特点：

不需要额外的位置向量，只是在注意力计算时引入偏置。
适用于 Transformer-XL、T5 这类架构，提高了泛化能力。

🔹 代码实现（PyTorch，T5 方式）：

import torch
class RelativePositionBias(torch.nn.Module):def __init__(self, num_heads, max_position=512):super().__init__()self.relative_bias = torch.nn.Embedding(2 * max_position - 1, num_heads)def forward(self, qlen, klen):context_position = torch.arange(qlen, dtype=torch.long)[:, None]memory_position = torch.arange(klen, dtype=torch.long)[None, :]relative_position = memory_position - context_position + (self.relative_bias.num_embeddings // 2)return self.relative_bias(relative_position)

3. 可学习的位置编码（Learnable Positional Encoding）

在 BERT 和 ViT（Vision Transformer） 等模型中，位置编码不使用固定公式，而是让模型自己学习合适的位置编码：
$\text{Embedding}(\text{position}, d_{\text{model}})$

直接用一个可训练的嵌入层（类似于 word embedding），让模型自动学习最优的位置表示。
训练过程中，位置编码会随着任务进行调整，提供更灵活的位置信息。

🔹 代码实现（PyTorch）：

import torch.nn as nnclass LearnablePositionalEncoding(nn.Module):def __init__(self, max_seq_len, d_model):super().__init__()self.pos_embedding = nn.Embedding(max_seq_len, d_model)def forward(self, positions):return self.pos_embedding(positions)

🔹 优缺点：

✅ 灵活性强，可适配不同任务。
❌ 不具备外推能力，训练时没见过的序列长度可能无法泛化。

4. 旋转位置编码（Rotary Positional Embeddings, RoPE）

RoPE 是 GPT-4、LLaMA 等现代大模型的主流方法，它基于旋转变换引入相对位置信息。

核心思想

将位置编码融入 Query 和 Key 之间的点积计算，而不是显式地加到 token embedding 上。
使用复数旋转矩阵来编码位置。

公式（简化版）：
$\text{RoPE}(q, k) = (q e^{i\theta_{pos}}) \cdot (k e^{i\theta_{pos}})$
其中：

$\theta_{pos}$ 由三角函数定义，使得不同位置的 embedding 通过旋转操作嵌入相对位置信息。

🔹 代码实现（简化版）：

import torchdef rotary_positional_embedding(x, theta):cos_theta = torch.cos(theta)sin_theta = torch.sin(theta)x1, x2 = x[..., ::2], x[..., 1::2]  return torch.cat([x1 * cos_theta - x2 * sin_theta, x1 * sin_theta + x2 * cos_theta], dim=-1)

🔹 优缺点

✅ 相对位置编码，泛化性更强。
✅ 适用于长文本场景，如 GPT-4 和 LLaMA。
❌ 比固定位置编码计算复杂度更高。

5. 复合位置编码（Hybrid Positional Encoding）

一些模型（如 ALiBi, LLaMA-2）结合了不同类型的位置编码：

ALiBi（Attention Linear Bias）：使用线性偏差项代替显式的位置向量。
T5 相对位置 + RoPE 结合，用于更强大的自回归任务。

总结

方法	位置表示	计算方式	是否可外推（超长文本）	适用场景
三角函数位置编码（Sinusoidal PE）	绝对	固定公式	✅	原始 Transformer
相对位置偏置（Relative PE）	相对	注意力偏置	✅	Transformer-XL, T5
可学习位置编码（Learnable PE）	绝对	训练可变	❌	BERT, ViT
RoPE（旋转位置编码）	相对	旋转变换	✅	GPT-4, LLaMA
ALiBi（Attention Linear Bias）	相对	线性偏置	✅	长文本任务