如何计算kv cache的缓存大小

符号定义

首先，定义一些符号：

( B )：批大小（Batch Size）
( L )：序列长度（Sequence Length），在您的问题中，( L = 1 )
( N )：Transformer 层数（Number of Transformer Layers）
( H )：注意力头数（Number of Attention Heads）
( D )：每个注意力头的维度（Dimension per Head），即 ( $\text{Hidden Size} / H$ )
( S )：数据类型大小（Size of Data Type），以字节为单位。例如：
FP32（32位浮点数）：( S = 4 ) 字节
FP16（16位浮点数）：( S = 2 ) 字节

KV 缓存的内存计算

对于每一层的多头注意力机制，我们需要存储 键（Key） 和 值（Value） 的缓存。对于每一层，键和值的缓存大小计算如下：

键缓存（Key Cache）大小：

$\text{Size}_{\text{Key}} = B \times H \times L \times D \times S$

值缓存（Value Cache）大小：

$\text{Size}_{\text{Value}} = B \times H \times L \times D \times S$

因此，每一层的 KV 缓存总大小为：

$\text{Size}{\text{KV per layer}} = \text{Size}{\text{Key}} + \text{Size}_{\text{Value}} = 2 \times B \times H \times L \times D \times S$

由于模型有 ( N ) 层，因此总的 KV 缓存大小为：

$\text{Total Size}{\text{KV}} = N \times \text{Size}{\text{KV per layer}} = 2 \times B \times H \times L \times D \times N \times S$

具体示例计算

假设以下参数：

批大小：( B = 1 )

序列长度：( L = 1 ) （即 token 数为 1）

层数：( N = 12 ) （例如，一个小型的 Transformer）

隐藏层尺寸：( $\text{Hidden Size} = 768$ )

注意力头数：( H = 12 )

每个头的维度：

$\frac{\text{Hidden Size}}{H} = \frac{768}{12} = 64$

数据类型：FP32，( S = 4 ) 字节

现在，我们计算每一层的 KV 缓存大小：

$\begin{align*} \text{Size}_{\text{KV per layer}} &= 2 \times B \times H \times L \times D \times S \ &= 2 \times 1 \times 12 \times 1 \times 64 \times 4 \ &= 2 \times 1 \times 12 \times 1 \times 64 \times 4 \ &= 2 \times 12 \times 64 \times 4 \ &= 2 \times 12 \times 64 \times 4 \ &= 6144\ \text{字节} \end{align*}$

总的 KV 缓存大小：

$\begin{align*} \text{Total Size}{\text{KV}} &= N \times \text{Size}{\text{KV per layer}} \ &= 12 \times 6144 \ &= 73728\ \text{字节} \end{align*}$

即大约 72 KB。

虽然这个数字看起来不大，但在大型模型中，参数会显著增大。例如，考虑一个具有以下参数的大型模型：

层数：( N = 96 )

隐藏层尺寸：( $\text{Hidden Size} = 12288$ )

注意力头数：( H = 96 )

每个头的维度：

$\frac{12288}{96} = 128$

数据类型：FP16，( S = 2 ) 字节

计算每一层的 KV 缓存大小：

$\begin{align*} \text{Size}_{\text{KV per layer}} &= 2 \times B \times H \times L \times D \times S \ &= 2 \times 1 \times 96 \times 1 \times 128 \times 2 \ &= 2 \times 96 \times 128 \times 2 \ &= 2 \times 96 \times 128 \times 2 \ &= 49,152\ \text{字节} \end{align*}$

总的 KV 缓存大小：

$\begin{align*} \text{Total Size}{\text{KV}} &= N \times \text{Size}{\text{KV per layer}} \ &= 96 \times 49,152 \ &= 4,719,616\ \text{字节} \end{align*}$