（代码中使用拆分的方式实现多头注意力）详解Transformer中Self-Attention以及Multi-Head Attention

冷不防 2022-09-11 08:11 123阅读 0赞

原文链接：[https://blog.csdn.net/qq\_37541097/article/details/117691873][https_blog.csdn.net_qq_37541097_article_details_117691873]

原文名称：Attention Is All You Need  
原文链接：[https://arxiv.org/abs/1706.03762][https_arxiv.org_abs_1706.03762]

如果不想看文章的可以看下我在b站上录的视频：[https://b23.tv/gucpvt][https_b23.tv_gucpvt]

最近Transformer在CV领域很火，Transformer是2017年Google在`Computation and Language`上发表的，当时主要是针对自然语言处理领域提出的（之前的RNN模型记忆长度有限且无法并行化，只有计算完

t
 
 
 i
 
 
 
 
 t_i
 
 
 ti时刻后的数据才能计算
 
 
 
 
 
 t
 
 
 
 i
 
 
 +
 
 
 1
 
 
 
 
 
 t_{i+1}
 
 
 ti+1时刻的数据，但Transformer都可以做到）。在这篇文章中作者提出了<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Self-Attention</code>的概念，然后在此基础上提出<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Multi-Head Attention</code>，所以本文对<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Self-Attention</code>以及<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Multi-Head Attention</code>的理论进行详细的讲解。在阅读本文之前，建议大家先去看下李弘毅老师讲的Transformer的内容。本文的内容是基于李弘毅老师讲的内容加上自己阅读一些源码进行的总结。

--------------------

### 文章目录 ###

*  前言
 *  Self-Attention
 *  Multi-Head Attention
 *  Self-Attention与Multi-Head Attention计算量对比
 *  Positional Encoding

--------------------

# 前言 #

如果之前你有在网上找过self-attention或者transformer的相关资料，基本上都是贴的原论文中的几张图以及公式，如下图，讲的都挺抽象的，反正就是看不懂（可能我太菜的原因）。就像李弘毅老师课程里讲到的"不懂的人再怎么看也不会懂的"。那接下来本文就结合李弘毅老师课上的内容加上原论文的公式来一个个进行详解。

![attention is all you need][]

--------------------

# Self-Attention #

下面这个图是我自己画的，为了方便大家理解，假设输入的序列长度为2，输入就两个节点

x
 
 
 1
 
 
 
 ,
 
 
 
 x
 
 
 2
 
 
 
 
 x_1, x_2
 
 
 x1,x2，然后通过Input Embedding也就是图中的
 
 
 
 
 f
 
 
 (
 
 
 x
 
 
 )
 
 
 
 f(x)
 
 
 f(x)将输入映射到
 
 
 
 
 
 a
 
 
 1
 
 
 
 ,
 
 
 
 a
 
 
 2
 
 
 
 
 a_1, a_2
 
 
 a1,a2。紧接着分别将
 
 
 
 
 
 a
 
 
 1
 
 
 
 ,
 
 
 
 a
 
 
 2
 
 
 
 
 a_1, a_2
 
 
 a1,a2分别通过三个变换矩阵
 
 
 
 
 
 W
 
 
 q
 
 
 
 ,
 
 
 
 W
 
 
 k
 
 
 
 ,
 
 
 
 W
 
 
 v
 
 
 
 
 W_q, W_k, W_v
 
 
 Wq,Wk,Wv（这三个参数是可训练的，是共享的）得到对应的
 
 
 
 
 
 q
 
 
 i
 
 
 
 ,
 
 
 
 k
 
 
 i
 
 
 
 ,
 
 
 
 v
 
 
 i
 
 
 
 
 q^i, k^i, v^i
 
 
 qi,ki,vi（这里在源码中是直接使用全连接层实现的，这里为了方便理解，忽略偏执）。

![self-attention][]

其中

* q
 
 
 
 q
 
 
 q代表query，后续会去和每一个
 
 
 
 
 k
 
 
 
 k
 
 
 k进行匹配</li><li>
 
 
 
 
 k
 
 
 
 k
 
 
 k代表key，后续会被每个
 
 
 
 
 q
 
 
 
 q
 
 
 q匹配</li><li>
 
 
 
 
 v
 
 
 
 v
 
 
 v代表从
 
 
 
 
 a
 
 
 
 a
 
 
 a中提取得到的信息</li></ul>

假设

a
 
 
 1
 
 
 
 =
 
 
 (
 
 
 1
 
 
 ,
 
 
 1
 
 
 )
 
 
 ,
 
 
 
 a
 
 
 2
 
 
 
 =
 
 
 (
 
 
 1
 
 
 ,
 
 
 0
 
 
 )
 
 
 ,
 
 
 
 W
 
 
 q
 
 
 
 =
 
 
 
 (
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 0
 
 
 ,
 
 
 1
 
 
 
 
 )
 
 
 
 
 a_1=(1, 1), a_2=(1,0), W^q= \binom{1, 1}{0, 1}
 
 
 a1=(1,1),a2=(1,0),Wq=(0,11,1)那么： 
 
 
 
 
 
 q
 
 
 1
 
 
 
 =
 
 
 (
 
 
 1
 
 
 ,
 
 
 1
 
 
 )
 
 
 
 (
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 0
 
 
 ,
 
 
 1
 
 
 
 
 )
 
 
 
 =
 
 
 (
 
 
 1
 
 
 ,
 
 
 2
 
 
 )
 
 
 ,
 
 
    
 
 
 
 q
 
 
 2
 
 
 
 =
 
 
 (
 
 
 1
 
 
 ,
 
 
 0
 
 
 )
 
 
 
 (
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 0
 
 
 ,
 
 
 1
 
 
 
 
 )
 
 
 
 =
 
 
 (
 
 
 1
 
 
 ,
 
 
 1
 
 
 )
 
 
 
 q^1 = (1, 1) \binom{1, 1}{0, 1} =(1, 2) , \ \ \ q^2 = (1, 0) \binom{1, 1}{0, 1} =(1, 1) 
 
 
 q1=(1,1)(0,11,1)=(1,2),   q2=(1,0)(0,11,1)=(1,1) 前面有说Transformer是可以并行化的，所以可以直接写成： 
 
 
 
 
 
 (
 
 
 
 
 q
 
 
 1
 
 
 
 
 q
 
 
 2
 
 
 
 
 )
 
 
 
 =
 
 
 
 (
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 1
 
 
 ,
 
 
 0
 
 
 
 
 )
 
 
 
 
 (
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 0
 
 
 ,
 
 
 1
 
 
 
 
 )
 
 
 
 =
 
 
 
 (
 
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 )
 
 
 
 
 \binom{q^1}{q^2} = \binom{1, 1}{1, 0} \binom{1, 1}{0, 1} = \binom{1, 2}{1, 1} 
 
 
 (q2q1)=(1,01,1)(0,11,1)=(1,11,2) 同理我们可以得到
 
 
 
 
 (
 
 
 
 
 k
 
 
 1
 
 
 
 
 k
 
 
 2
 
 
 
 
 )
 
 
 
 \binom{k^1}{k^2}
 
 
 (k2k1)和
 
 
 
 
 (
 
 
 
 
 v
 
 
 1
 
 
 
 
 v
 
 
 2
 
 
 
 
 )
 
 
 
 \binom{v^1}{v^2}
 
 
 (v2v1)，那么求得的
 
 
 
 
 (
 
 
 
 
 q
 
 
 1
 
 
 
 
 q
 
 
 2
 
 
 
 
 )
 
 
 
 \binom{q^1}{q^2}
 
 
 (q2q1)就是原论文中的
 
 
 
 
 Q
 
 
 
 Q
 
 
 Q，
 
 
 
 
 (
 
 
 
 
 k
 
 
 1
 
 
 
 
 k
 
 
 2
 
 
 
 
 )
 
 
 
 \binom{k^1}{k^2}
 
 
 (k2k1)就是
 
 
 
 
 K
 
 
 
 K
 
 
 K，
 
 
 
 
 (
 
 
 
 
 v
 
 
 1
 
 
 
 
 v
 
 
 2
 
 
 
 
 )
 
 
 
 \binom{v^1}{v^2}
 
 
 (v2v1)就是
 
 
 
 
 V
 
 
 
 V
 
 
 V。接着先拿
 
 
 
 
 
 q
 
 
 1
 
 
 
 
 q^1
 
 
 q1和每个
 
 
 
 
 k
 
 
 
 k
 
 
 k进行match，点乘操作，接着除以
 
 
 
 
 
 d
 
 
 
 
 \sqrt{d}
 
 
 d
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

\-10,-9.5,-14c0,-2,0.3,-3.3,1,-4c1.3,-2.7,23.83,-20.7,67.5,-54c44.2,-33.3,65.8,  
\-50.3,66.5,-51c1.3,-1.3,3,-2,5,-2c4.7,0,8.7,3.3,12,10s173,378,173,378c0.7,0,  
35.3,-71,104,-213c68.7,-142,137.5,-285,206.5,-429c69,-144,104.5,-217.7,106.5,  
\-221c5.3,-9.3,12,-14,20,-14H400000v40H845.2724s-225.272,467,-225.272,467  
s-235,486,-235,486c-2.7,4.7,-9,7,-19,7c-6,0,-10,-1,-12,-3s-194,-422,-194,-422  
s-65,47,-65,47z M834 80H400000v40H845z">  
得到对应的

d
 
 
 
 
 \sqrt{d}
 
 
 d
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

α
 
 
 
 1
 
 
 ,
 
 
 i
 
 
 
 
 
 \alpha_{1, i}
 
 
 α1,i： 
 
 
 
 
 
 α
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 =
 
 
 
 
 
 q
 
 
 1
 
 
 
 ⋅
 
 
 
 k
 
 
 1
 
 
 
 
 
 d
 
 
 
 
 =
 
 
 
 
 1
 
 
 ×
 
 
 1
 
 
 +
 
 
 2
 
 
 ×
 
 
 0
 
 
 
 
 2
 
 
 
 
 =
 
 
 0.71
 
 
 
 
 α
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 =
 
 
 
 
 
 q
 
 
 1
 
 
 
 ⋅
 
 
 
 k
 
 
 2
 
 
 
 
 
 d
 
 
 
 
 =
 
 
 
 
 1
 
 
 ×
 
 
 0
 
 
 +
 
 
 2
 
 
 ×
 
 
 1
 
 
 
 
 2
 
 
 
 
 =
 
 
 1.41
 
 
 
 \alpha_{1, 1} = \frac{q^1 \cdot k^1}{\sqrt{d}}=\frac{1\times 1+2\times 0}{\sqrt{2}}=0.71 \\ \alpha_{1, 2} = \frac{q^1 \cdot k^2}{\sqrt{d}}=\frac{1\times 0+2\times 1}{\sqrt{2}}=1.41 
 
 
 α1,1=d
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

q
 
 
 2
 
 
 
 
 q^2
 
 
 q2去匹配所有的
 
 
 
 
 k
 
 
 
 k
 
 
 k能得到
 
 
 
 
 
 α
 
 
 
 2
 
 
 ,
 
 
 i
 
 
 
 
 
 \alpha_{2, i}
 
 
 α2,i，统一写成矩阵乘法形式： 
 
 
 
 
 
 (
 
 
 
 
 
 α
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
   
 
 
 
 α
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 
 
 
 α
 
 
 
 2
 
 
 ,
 
 
 1
 
 
 
 
   
 
 
 
 α
 
 
 
 2
 
 
 ,
 
 
 2
 
 
 
 
 
 
 )
 
 
 
 =
 
 
 
 
 
 (
 
 
 
 
 q
 
 
 1
 
 
 
 
 q
 
 
 2
 
 
 
 
 )
 
 
 
 
 
 (
 
 
 
 
 k
 
 
 1
 
 
 
 
 k
 
 
 2
 
 
 
 
 )
 
 
 
 T
 
 
 
 
 
 d
 
 
 
 
 
 \binom{\alpha_{1, 1} \ \ \alpha_{1, 2}}{\alpha_{2, 1} \ \ \alpha_{2, 2}}=\frac{\binom{q^1}{q^2}\binom{k^1}{k^2}^T}{\sqrt{d}} 
 
 
 (α2,1  α2,2α1,1  α1,2)=d
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

(
 
 
 
 α
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 ,
 
 
 
 α
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 )
 
 
 
 (\alpha_{1, 1}, \alpha_{1, 2})
 
 
 (α1,1,α1,2)和
 
 
 
 
 (
 
 
 
 α
 
 
 
 2
 
 
 ,
 
 
 1
 
 
 
 
 ,
 
 
 
 α
 
 
 
 2
 
 
 ,
 
 
 2
 
 
 
 
 )
 
 
 
 (\alpha_{2, 1}, \alpha_{2, 2})
 
 
 (α2,1,α2,2)分别进行softmax处理得到
 
 
 
 
 (
 
 
 
 
 α
 
 
 ^
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 ,
 
 
 
 
 α
 
 
 ^
 
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 )
 
 
 
 (\hat\alpha_{1, 1}, \hat\alpha_{1, 2})
 
 
 (α^1,1,α^1,2)和
 
 
 
 
 (
 
 
 
 
 α
 
 
 ^
 
 
 
 
 2
 
 
 ,
 
 
 1
 
 
 
 
 ,
 
 
 
 
 α
 
 
 ^
 
 
 
 
 2
 
 
 ,
 
 
 2
 
 
 
 
 )
 
 
 
 (\hat\alpha_{2, 1}, \hat\alpha_{2, 2})
 
 
 (α^2,1,α^2,2)，这里的
 
 
 
 
 
 α
 
 
 ^
 
 
 
 
 \hat{\alpha}
 
 
 α^相当于计算得到针对每个
 
 
 
 
 v
 
 
 
 v
 
 
 v的权重。到这我们就完成了
 
 
 
 
 
 A
 
 
 t
 
 
 t
 
 
 e
 
 
 n
 
 
 t
 
 
 i
 
 
 o
 
 
 n
 
 
 
 (
 
 
 Q
 
 
 ,
 
 
 K
 
 
 ,
 
 
 V
 
 
 )
 
 
 
 {\rm Attention}(Q, K, V)
 
 
 Attention(Q,K,V)公式中
 
 
 
 
 
 s
 
 
 o
 
 
 f
 
 
 t
 
 
 m
 
 
 a
 
 
 x
 
 
 
 (
 
 
 
 
 Q
 
 
 
 K
 
 
 T
 
 
 
 
 
 
 d
 
 
 k
 
 
 
 
 
 )
 
 
 
 {\rm softmax}(\frac{QK^T}{\sqrt{d_k}})
 
 
 softmax(dk
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

![self-attention][self-attention 1]  
上面已经计算得到

α
 
 
 
 \alpha
 
 
 α，即针对每个
 
 
 
 
 v
 
 
 
 v
 
 
 v的权重，接着进行加权得到最终结果： 
 
 
 
 
 
 b
 
 
 1
 
 
 
 =
 
 
 
 
 α
 
 
 ^
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 ×
 
 
 
 v
 
 
 1
 
 
 
 +
 
 
 
 
 α
 
 
 ^
 
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 ×
 
 
 
 v
 
 
 2
 
 
 
 =
 
 
 (
 
 
 0.33
 
 
 ,
 
 
 0.67
 
 
 )
 
 
 
 
 b
 
 
 2
 
 
 
 =
 
 
 
 
 α
 
 
 ^
 
 
 
 
 2
 
 
 ,
 
 
 1
 
 
 
 
 ×
 
 
 
 v
 
 
 1
 
 
 
 +
 
 
 
 
 α
 
 
 ^
 
 
 
 
 2
 
 
 ,
 
 
 2
 
 
 
 
 ×
 
 
 
 v
 
 
 2
 
 
 
 =
 
 
 (
 
 
 0.50
 
 
 ,
 
 
 0.50
 
 
 )
 
 
 
 b_1 = \hat{\alpha}_{1, 1} \times v^1 + \hat{\alpha}_{1, 2} \times v^2=(0.33, 0.67) \\ b_2 = \hat{\alpha}_{2, 1} \times v^1 + \hat{\alpha}_{2, 2} \times v^2=(0.50, 0.50) 
 
 
 b1=α^1,1×v1+α^1,2×v2=(0.33,0.67)b2=α^2,1×v1+α^2,2×v2=(0.50,0.50) 统一写成矩阵乘法形式： 
 
 
 
 
 
 (
 
 
 
 
 b
 
 
 1
 
 
 
 
 b
 
 
 2
 
 
 
 
 )
 
 
 
 =
 
 
 
 (
 
 
 
 
 
 
 α
 
 
 ^
 
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
   
 
 
 
 
 α
 
 
 ^
 
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 
 
 
 
 α
 
 
 ^
 
 
 
 
 2
 
 
 ,
 
 
 1
 
 
 
 
   
 
 
 
 
 α
 
 
 ^
 
 
 
 
 2
 
 
 ,
 
 
 2
 
 
 
 
 
 
 )
 
 
 
 
 (
 
 
 
 
 v
 
 
 1
 
 
 
 
 v
 
 
 2
 
 
 
 
 )
 
 
 
 
 \binom{b_1}{b_2} = \binom{\hat\alpha_{1, 1} \ \ \hat\alpha_{1, 2}}{\hat\alpha_{2, 1} \ \ \hat\alpha_{2, 2}}\binom{v^1}{v^2} 
 
 
 (b2b1)=(α^2,1  α^2,2α^1,1  α^1,2)(v2v1) <img src="https://img-blog.csdnimg.cn/2021060816151080.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM3NTQxMDk3,size_16,color_FFFFFF,t_70#pic_center" alt="self-attention">到这，<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Self-Attention</code>的内容就讲完了。总结下来就是论文中的一个公式： 
 
 
 
 
 
 A
 
 
 t
 
 
 t
 
 
 e
 
 
 n
 
 
 t
 
 
 i
 
 
 o
 
 
 n
 
 
 
 (
 
 
 Q
 
 
 ,
 
 
 K
 
 
 ,
 
 
 V
 
 
 )
 
 
 =
 
 
 
 s
 
 
 o
 
 
 f
 
 
 t
 
 
 m
 
 
 a
 
 
 x
 
 
 
 (
 
 
 
 
 Q
 
 
 
 K
 
 
 T
 
 
 
 
 
 
 d
 
 
 k
 
 
 
 
 
 )
 
 
 V
 
 
 
 {\rm Attention}(Q, K, V)={\rm softmax}(\frac{QK^T}{\sqrt{d_k}})V 
 
 
 Attention(Q,K,V)=softmax(dk
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

--------------------

# Multi-Head Attention #

刚刚已经聊完了Self-Attention模块，接下来再来看看Multi-Head Attention模块，实际使用中基本使用的还是Multi-Head Attention模块。原论文中说使用多头注意力机制能够联合来自不同head部分学习到的信息。`Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.`其实只要懂了Self-Attention模块Multi-Head Attention模块就非常简单了。

首先还是和Self-Attention模块一样将

a
 
 
 i
 
 
 
 
 a_i
 
 
 ai分别通过
 
 
 
 
 
 W
 
 
 q
 
 
 
 ,
 
 
 
 W
 
 
 k
 
 
 
 ,
 
 
 
 W
 
 
 v
 
 
 
 
 W^q, W^k, W^v
 
 
 Wq,Wk,Wv得到对应的
 
 
 
 
 
 q
 
 
 i
 
 
 
 ,
 
 
 
 k
 
 
 i
 
 
 
 ,
 
 
 
 v
 
 
 i
 
 
 
 
 q^i, k^i, v^i
 
 
 qi,ki,vi，然后再根据使用的head的数目
 
 
 
 
 h
 
 
 
 h
 
 
 h进一步把得到的
 
 
 
 
 
 q
 
 
 i
 
 
 
 ,
 
 
 
 k
 
 
 i
 
 
 
 ,
 
 
 
 v
 
 
 i
 
 
 
 
 q^i, k^i, v^i
 
 
 qi,ki,vi均分成
 
 
 
 
 h
 
 
 
 h
 
 
 h份。比如下图中假设
 
 
 
 
 h
 
 
 =
 
 
 2
 
 
 
 h=2
 
 
 h=2然后
 
 
 
 
 
 q
 
 
 1
 
 
 
 
 q^1
 
 
 q1拆分成
 
 
 
 
 
 q
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 
 q^{1,1}
 
 
 q1,1和
 
 
 
 
 
 q
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 
 q^{1,2}
 
 
 q1,2，那么
 
 
 
 
 
 q
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 
 q^{1,1}
 
 
 q1,1就属于head1，
 
 
 
 
 
 q
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 
 q^{1,2}
 
 
 q1,2属于head2。

![multi-head][]  
看到这里，如果读过原论文的人肯定有疑问，论文中不是写的通过

W
 
 
 i
 
 
 Q
 
 
 
 ,
 
 
 
 W
 
 
 i
 
 
 K
 
 
 
 ,
 
 
 
 W
 
 
 i
 
 
 V
 
 
 
 
 W^Q_i, W^K_i, W^V_i
 
 
 WiQ,WiK,WiV映射得到每个head的
 
 
 
 
 
 Q
 
 
 i
 
 
 
 ,
 
 
 
 K
 
 
 i
 
 
 
 ,
 
 
 
 V
 
 
 i
 
 
 
 
 Q_i, K_i, V_i
 
 
 Qi,Ki,Vi吗： 
 
 
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 i
 
 
 
 =
 
 
 
 A
 
 
 t
 
 
 t
 
 
 e
 
 
 n
 
 
 t
 
 
 i
 
 
 o
 
 
 n
 
 
 
 (
 
 
 Q
 
 
 
 W
 
 
 i
 
 
 Q
 
 
 
 ,
 
 
 K
 
 
 
 W
 
 
 i
 
 
 K
 
 
 
 ,
 
 
 V
 
 
 
 W
 
 
 i
 
 
 V
 
 
 
 )
 
 
 
 head_i = {\rm Attention}(QW^Q_i, KW^K_i, VW^V_i) 
 
 
 headi=Attention(QWiQ,KWiK,VWiV) 但我在github上看的一些源码中就是简单的进行均分，其实也可以将
 
 
 
 
 
 W
 
 
 i
 
 
 Q
 
 
 
 ,
 
 
 
 W
 
 
 i
 
 
 K
 
 
 
 ,
 
 
 
 W
 
 
 i
 
 
 V
 
 
 
 
 W^Q_i, W^K_i, W^V_i
 
 
 WiQ,WiK,WiV设置成对应值来实现均分，比如下图中的Q通过
 
 
 
 
 
 W
 
 
 1
 
 
 Q
 
 
 
 
 W^Q_1
 
 
 W1Q就能得到均分后的
 
 
 
 
 
 Q
 
 
 1
 
 
 
 
 Q_1
 
 
 Q1。

![multi-head][multi-head 1]  
通过上述方法就能得到每个

h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 i
 
 
 
 
 head_i
 
 
 headi对应的
 
 
 
 
 
 Q
 
 
 i
 
 
 
 ,
 
 
 
 K
 
 
 i
 
 
 
 ,
 
 
 
 V
 
 
 i
 
 
 
 
 Q_i, K_i, V_i
 
 
 Qi,Ki,Vi参数，接下来针对每个head使用和Self-Attention中相同的方法即可得到对应的结果。 
 
 
 
 
 
 A
 
 
 t
 
 
 t
 
 
 e
 
 
 n
 
 
 t
 
 
 i
 
 
 o
 
 
 n
 
 
 
 (
 
 
 
 Q
 
 
 i
 
 
 
 ,
 
 
 
 K
 
 
 i
 
 
 
 ,
 
 
 
 V
 
 
 i
 
 
 
 )
 
 
 =
 
 
 
 s
 
 
 o
 
 
 f
 
 
 t
 
 
 m
 
 
 a
 
 
 x
 
 
 
 (
 
 
 
 
 
 Q
 
 
 i
 
 
 
 
 K
 
 
 i
 
 
 T
 
 
 
 
 
 
 d
 
 
 k
 
 
 
 
 
 )
 
 
 
 V
 
 
 i
 
 
 
 
 {\rm Attention}(Q_i, K_i, V_i)={\rm softmax}(\frac{Q_iK_i^T}{\sqrt{d_k}})V_i 
 
 
 Attention(Qi,Ki,Vi)=softmax(dk
 <svg width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">
 <path d="M95,702c-2.7,0,-7.17,-2.7,-13.5,-8c-5.8,-5.3,-9.5,

![multi-head][multi-head 2]  
接着将每个head得到的结果进行concat拼接，比如下图中

b
 
 
 
 1
 
 
 ,
 
 
 1
 
 
 
 
 
 b_{1,1}
 
 
 b1,1（
 
 
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 1
 
 
 
 
 head_1
 
 
 head1得到的
 
 
 
 
 
 b
 
 
 1
 
 
 
 
 b_1
 
 
 b1）和
 
 
 
 
 
 b
 
 
 
 1
 
 
 ,
 
 
 2
 
 
 
 
 
 b_{1,2}
 
 
 b1,2（
 
 
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 2
 
 
 
 
 head_2
 
 
 head2得到的
 
 
 
 
 
 b
 
 
 1
 
 
 
 
 b_1
 
 
 b1）拼接在一起，
 
 
 
 
 
 b
 
 
 
 2
 
 
 ,
 
 
 1
 
 
 
 
 
 b_{2,1}
 
 
 b2,1（
 
 
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 1
 
 
 
 
 head_1
 
 
 head1得到的
 
 
 
 
 
 b
 
 
 2
 
 
 
 
 b_2
 
 
 b2）和
 
 
 
 
 
 b
 
 
 
 2
 
 
 ,
 
 
 2
 
 
 
 
 
 b_{2,2}
 
 
 b2,2（
 
 
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 2
 
 
 
 
 head_2
 
 
 head2得到的
 
 
 
 
 
 b
 
 
 2
 
 
 
 
 b_2
 
 
 b2）拼接在一起。

![multi-head][multi-head 3]  
接着将拼接后的结果通过

W
 
 
 O
 
 
 
 
 W^O
 
 
 WO（可学习的参数）进行融合，如下图所示，融合后得到最终的结果
 
 
 
 
 
 b
 
 
 1
 
 
 
 ,
 
 
 
 b
 
 
 2
 
 
 
 
 b_1, b_2
 
 
 b1,b2。

![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM3NTQxMDk3_size_16_color_FFFFFF_t_70_pic_center]  
到这，`Multi-Head Attention`的内容就讲完了。总结下来就是论文中的两个公式：

M
 
 
 u
 
 
 l
 
 
 t
 
 
 i
 
 
 H
 
 
 e
 
 
 a
 
 
 d
 
 
 
 (
 
 
 Q
 
 
 ,
 
 
 K
 
 
 ,
 
 
 V
 
 
 )
 
 
 =
 
 
 
 C
 
 
 o
 
 
 n
 
 
 c
 
 
 a
 
 
 t
 
 
 (
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 1
 
 
 
 ,
 
 
 .
 
 
 .
 
 
 .
 
 
 ,
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 h
 
 
 
 )
 
 
 
 
 W
 
 
 O
 
 
 
 
 
 w
 
 
 h
 
 
 e
 
 
 r
 
 
 e
 
 
  
 
 
 h
 
 
 e
 
 
 a
 
 
 
 d
 
 
 i
 
 
 
 =
 
 
 A
 
 
 t
 
 
 t
 
 
 e
 
 
 n
 
 
 t
 
 
 i
 
 
 o
 
 
 n
 
 
 
 (
 
 
 Q
 
 
 
 W
 
 
 i
 
 
 Q
 
 
 
 ,
 
 
 K
 
 
 
 W
 
 
 i
 
 
 K
 
 
 
 ,
 
 
 V
 
 
 
 W
 
 
 i
 
 
 V
 
 
 
 )
 
 
 
 {\rm MultiHead}(Q, K, V) = {\rm Concat(head_1,...,head_h)}W^O \\ {\rm where \ head_i = Attention}(QW_i^Q, KW_i^K, VW_i^V) 
 
 
 MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)

--------------------

# Self-Attention与Multi-Head Attention计算量对比 #

在原论文章节3.2.2中最后有说两者的计算量其实差不多。`Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.`下面做了个简单的实验，这个model文件大家先忽略哪来的。这个`Attention`就是实现`Multi-head Attention`的方法，其中包括上面讲的所有步骤。

* 首先创建了一个`Self-Attention`模块（单头）`a1`，然后把proj变量置为Identity（Identity对应的是`Multi-Head Attention`中最后那个
 
 W
 
 
 o
 
 
 
 
 W^o
 
 
 Wo的映射，单头中是没有的，所以置为Identity即不做任何操作）。</li><li>再创建一个<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Multi-Head Attention</code>模块（多头）<code onclick="mdcp.copyCode(event)" style="user-select: auto;">a2</code>，然后设置8个head。</li><li>创建一个随机变量，注意shape</li><li>使用fvcore分别计算两个模块的FLOPs</li></ul>

import torch
    from fvcore.nn import FlopCountAnalysis
    
    from model import Attention
    
    
    def main():
        # Self-Attention
        a1 = Attention(dim=512, num_heads=1)
        a1.proj = torch.nn.Identity()  # remove Wo
    
        # Multi-Head Attention
        a2 = Attention(dim=512, num_heads=8)
    
        # [batch_size, num_tokens, total_embed_dim]
        t = (torch.rand(32, 1024, 512),)
    
        flops1 = FlopCountAnalysis(a1, t)
        print("Self-Attention FLOPs:", flops1.total())
    
        flops2 = FlopCountAnalysis(a2, t)
        print("Multi-Head Attention FLOPs:", flops2.total())
    
    
    if __name__ == '__main__':
        main()
    
     
      
     
      
       1
       2
       3
       4
       5
       6
       7
       8
       9
       10
       11
       12
       13
       14
       15
       16
       17
       18
       19
       20
       21
       22
       23
       24
       25
       26

终端输出如下， 可以发现确实两者的FLOPs差不多，`Multi-Head Attention`比`Self-Attention`略高一点：

Self-Attention FLOPs: 60129542144
    Multi-Head Attention FLOPs: 68719476736
    
     
      
     
      
       1
       2

其实两者FLOPs的差异只是在最后的

W
 
 
 O
 
 
 
 
 W^O
 
 
 WO上，如果把<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Multi-Head Attentio</code>的
 
 
 
 
 
 W
 
 
 O
 
 
 
 
 W^O
 
 
 WO也删除（即把<code onclick="mdcp.copyCode(event)" style="user-select: auto;">a2</code>的proj也设置成Identity），可以看出两者FLOPs是一样的：

Self-Attention FLOPs: 60129542144
    Multi-Head Attention FLOPs: 60129542144
    
     
      
     
      
       1
       2

--------------------

# Positional Encoding #

如果仔细观察刚刚讲的Self-Attention和Multi-Head Attention模块，在计算中是没有考虑到位置信息的。假设在Self-Attention模块中，输入

a
 
 
 1
 
 
 
 ,
 
 
 
 a
 
 
 2
 
 
 
 ,
 
 
 
 a
 
 
 3
 
 
 
 
 a_1, a_2, a_3
 
 
 a1,a2,a3得到
 
 
 
 
 
 b
 
 
 1
 
 
 
 ,
 
 
 
 b
 
 
 2
 
 
 
 ,
 
 
 
 b
 
 
 3
 
 
 
 
 b_1, b_2, b_3
 
 
 b1,b2,b3。对于
 
 
 
 
 
 a
 
 
 1
 
 
 
 
 a_1
 
 
 a1而言，
 
 
 
 
 
 a
 
 
 2
 
 
 
 
 a_2
 
 
 a2和
 
 
 
 
 
 a
 
 
 3
 
 
 
 
 a_3
 
 
 a3离它都是一样近的而且没有先后顺序。假设将输入的顺序改为
 
 
 
 
 
 a
 
 
 1
 
 
 
 ,
 
 
 
 a
 
 
 3
 
 
 
 ,
 
 
 
 a
 
 
 2
 
 
 
 
 a_1, a_3, a_2
 
 
 a1,a3,a2，对结果
 
 
 
 
 
 b
 
 
 1
 
 
 
 
 b_1
 
 
 b1是没有任何影响的。下面是使用Pytorch做的一个实验，首先使用<code onclick="mdcp.copyCode(event)" style="user-select: auto;">nn.MultiheadAttention</code>创建一个<code onclick="mdcp.copyCode(event)" style="user-select: auto;">Self-Attention</code>模块（<code onclick="mdcp.copyCode(event)" style="user-select: auto;">num_heads=1</code>），注意这里在正向传播过程中直接传入
 
 
 
 
 Q
 
 
 K
 
 
 V
 
 
 
 QKV
 
 
 QKV，接着创建两个顺序不同的
 
 
 
 
 Q
 
 
 K
 
 
 V
 
 
 
 QKV
 
 
 QKV变量t1和t2（主要是将
 
 
 
 
 
 q
 
 
 2
 
 
 
 ,
 
 
 
 k
 
 
 2
 
 
 
 ,
 
 
 
 v
 
 
 2
 
 
 
 
 q^2, k^2, v^2
 
 
 q2,k2,v2和
 
 
 
 
 
 q
 
 
 3
 
 
 
 ,
 
 
 
 k
 
 
 3
 
 
 
 ,
 
 
 
 v
 
 
 3
 
 
 
 
 q^3, k^3, v^3
 
 
 q3,k3,v3的顺序换了下），分别将这两个变量输入Self-Attention模块进行正向传播。

import torch
    import torch.nn as nn
    
    
    m = nn.MultiheadAttention(embed_dim=2, num_heads=1)
    
    t1 = [[[1., 2.],   # q1, k1, v1
           [2., 3.],   # q2, k2, v2
           [3., 4.]]]  # q3, k3, v3
    
    t2 = [[[1., 2.],   # q1, k1, v1
           [3., 4.],   # q3, k3, v3
           [2., 3.]]]  # q2, k2, v2
    
    q, k, v = torch.as_tensor(t1), torch.as_tensor(t1), torch.as_tensor(t1)
    print("result1: \n", m(q, k, v))
    
    q, k, v = torch.as_tensor(t2), torch.as_tensor(t2), torch.as_tensor(t2)
    print("result2: \n", m(q, k, v))
    
     
      
     
      
       1
       2
       3
       4
       5
       6
       7
       8
       9
       10
       11
       12

[https_blog.csdn.net_qq_37541097_article_details_117691873]: https://blog.csdn.net/qq_37541097/article/details/117691873
[https_arxiv.org_abs_1706.03762]: https://arxiv.org/abs/1706.03762
[https_b23.tv_gucpvt]: https://b23.tv/gucpvt
[attention is all you need]: /images/20220828/31f4ede8821e4de590b38f5d63d0b054.png
[self-attention]: /images/20220828/6fc4ed022b11441b8dfdf08e6b4ee3e8.png
[self-attention 1]: /images/20220828/5b0922cac2694dc9996aaeca4f375564.png
[multi-head]: /images/20220828/c6965a929b3b48689f1bb9bb663b865f.png
[multi-head 1]: /images/20220828/0fd02d09553f44a0b1d7c1012a3389fe.png
[multi-head 2]: /images/20220828/bc4e030e3b7a45189452ce9bcacccf2e.png
[multi-head 3]: /images/20220828/1f5d9f18aa554e7cbd5672abb2711a5c.png
[watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM3NTQxMDk3_size_16_color_FFFFFF_t_70_pic_center]: /images/20220828/e9a30fabc8e441979d965d36b57f0132.png