[code] Transformer For Summarization Source Code Reading

Basic Information

作者:李丕绩(腾讯AI Lab)

模型:Transformer + copy mechanism for abstractive summarization

数据集:CNN/Daily Mail

Parameters

WARNING: IN DEBUGGING MODE
USE COPY MECHANISM
USE COVERAGE MECHANISM
USE AVG NLL as LOSS
USE LEARNABLE W2V EMBEDDING
RNN TYPE: transformer
idx_gpu: 0
norm_clip: 2  # gradient clipping by norm
dim_x: 512
dim_y: 512
len_x: 401
len_y: 101
num_x: 1
num_y: 1
hidden_size: 512
d_ff: 1024
num_heads: 8  # 8头注意力机制
dropout: 0.2
num_layers: 4
label_smoothing: 0.1
alpha: 0.9
beta: 5
batch_size: 5
testing_batch_size: 1
min_len_predict: 35
max_len_predict: 120
max_byte_predict: None
testing_print_size: 500
lr: 0.15
beam_size: 4
max_epoch: 50
print_time: 20  # 每个 epoch 打印信息,以及保存模型的次数
save_epoch: 1
dict_size: 50003  # vocabulary 大小
pad_token_idx: 0
loading train set...
num_files =  13
num_batches =  3

Model Structure

Model(
  (tok_embed): Embedding(50003, 512, padding_idx=0)
  (pos_embed): LearnedPositionalEmbedding(
    (weights): Embedding(1024, 512)
  )
  (enc_layers): ModuleList(
    (0): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
    (1): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
    (2): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
    (3): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
    )
  )
  (dec_layers): ModuleList(
    (0): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
    (1): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
    (2): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
    (3): TransformerLayer(
      (self_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (fc1): Linear(in_features=512, out_features=1024, bias=True)
      (fc2): Linear(in_features=1024, out_features=512, bias=True)
      (attn_layer_norm): LayerNorm()
      (ff_layer_norm): LayerNorm()
      (external_attn): MultiheadAttention(
        (out_proj): Linear(in_features=512, out_features=512, bias=True)
      )
      (external_layer_norm): LayerNorm()
    )
  )
  (attn_mask): SelfAttentionMask()
  (emb_layer_norm): LayerNorm()
  (word_prob): WordProbLayer(
    (external_attn): MultiheadAttention(
      (out_proj): Linear(in_features=512, out_features=512, bias=True)
    )
    (proj): Linear(in_features=1536, out_features=50003, bias=True)
  )
  (smoothing): LabelSmoothing()
)

模型结构:

1. 嵌入表示:token embedding,positional embedding
2. encoder:3个blocks(8头注意力)
3. decoder:3个blocks(8头注意力)
4. mask attention层
5. layer normalization:层标准化
6. word probability layer:映射到单词表上的概率分布

Source Code Analysis

1. prepare_data.py

处理数据,得到的结果如下:

拿test set举例,数据集中有11489对 article-summary。数据结构如下:

所有的数据对组成一级列表,每对数据由两个子列表组成,分别代表article 和 summary

article列表和summary中,分为两个子列表,第一个是分词后的序列,第二个是分词之前的原始文本

2. model.py

利用pytorch框架构建了Transformer summarizer模型:

2.1. __initial__

设置模型参数,几个比较重要的:

使用了copy mechanism 和 coverage mechanism;
使用NLL作为loss function;
d_ff size = 1024;
context size = 512;
hidden size = 512;

定义了几个比较常用的结构:

label smoothing
可学习的token embedding,positional embedding
word probability layer
embedding layer normalization 

2.2. structure of encoder & decoder

可以看到encoder,decoder中都包含多个(4个)基本modules。每个module的结构如下:

其中包含了子模块,子模块的下级模块如下:self-attention结构,两个全连接层,attention normalization layer,feed forward nomalization layer:

self-attention 模块结构如下图:

2.2.1. embedding(transformer.py)

token embedding 用的是 nn.Embedding 在训练的时候进行学习。模块化参数:vocabulary size = 50003,embedding dim = 512

positional embedding 用的还nn.Embedding,随着训练进行学习。模块参数:init_size = 1024(最大的position),embedding dim = 512;参数用正态分布进行随机初始化

作者同样实现了类 SinusoidalPositionalEmbedding 可以将 learnable positional embedding 变为 fixed,即论文中加入位置信息的方式

2.2.2. encoder & decoder

# encoder
self.enc_layers = nn.ModuleList()
for i in range(self.num_layers):
    self.enc_layers.append(TransformerLayer(self.dim_x, self.d_ff, self.num_heads, self.dropout))

# decoder
self.dec_layers = nn.ModuleList()
for i in range(self.num_layers):
    self.dec_layers.append(TransformerLayer(self.dim_x, self.d_ff, self.num_heads, self.dropout, with_external=True))
        

torch可以用nn.ModuleList()来增加模块的层,此处num_layer = 4 ,即加入4层Transformer stack作为encoder。

定义基本单元的时候TransformerLayer的初始化定义不同。

2.3. encoding

流程:

  1. 获得嵌入表示(token embedding + positional embedding)
  2. 层标准化
  3. dropout
  4. padding mask
  5. 对于N层encoder stack进行编码(利用Transformer单元;Transformer单元之间参数不共享),上一层encoder的输出做为下一层encoder的输入(layer的输入是:序列的嵌入表示x,和padding mask)
  6. 返回最终的编码向量x

2.4. decoding

decoding 此处分为两种情况:

  1. coverage mechanism + copy mechanism
  2. coverage emchanism

流程:

  1. 前者需要比后者多传入两个参数:x_ext, max_ext_len;即拓展此表中的词也录入id了,max_ext_len表示OOV词的个数
  2. 获得嵌入表示(token embedding + positional embedding)
  3. 层标准化
  4. dropout
  5. padding mask
  6. 对于N层decoder stack进行编码,上一层decoder的输出做为下一层decoder的输入(layer的输入是:序列的嵌入表示x,和padding mask,self attention mask,external memories,external padding mask)
  7. 利用final decoder state进行word probability distribution的计算。分为两种计算方式,在WordProb中实现

2.5. word probability projection (WordProbLayer.py)

进行条件判别:

  1. copy mechanism:

    1. 利用external_attention函数(class MultiheadAttention)计算attention。
    2. 输入的是query = decoder final hidden states;key = value = encoder hidden states,返回的是(attention output, attention weights)
    3. 将解码状态,解码输入(的嵌入表示),external_attention的输出进行concatenation;进行线性映射;经过softmax,得到pred
    4. 如果source article中出现了OOV,max_ext_len>0,则需要将pred拼接上一个全零张量,以至于pred的dim3的维度等于fixed vocabulary size + number of OOV
    5. 设置gate,将解码状态,解码输入(的嵌入表示),external_attention的输出进行concatenation;进行线性映射;经过sigmoid,得到gate
    6. 最终的概率分布:pred = gate * pred + (1-gate) * attention_weights
  2. no copy:利用全连接层进行简单的线性映射,再经过softmax激活函数得到单词表上的概率分布
函数参数:scatter_add_(dim, ?indexTensor, ?otherTensor)?→ 输出Tensor
函数用法:selfTensor.scatter_add_(dim, ?indexTensor, ?otherTensor)
# 该函数将?otherTensor?的所有值加到?selfTensor?中,加入位置由?indexTensor?指明

2.6. loss

2.6.1. label_smoothing_loss (label_smoothing.py)

def label_smotthing_loss(self, y_pred, y, y_mask, avg=True):
    seq_len, bsz = y.size()

    y_pred = T.log(y_pred.clamp(min=1e-8))
    loss = self.smoothing(y_pred.view(seq_len * bsz, -1), y.view(seq_len * bsz, -1))
    if avg:
        return loss / T.sum(y_mask)
    else:
        return loss / bsz

loss function的实现中用到了一个clamp夹逼函数:

torch.clamp(input, min, max, out=None) → Tensor

作用是将inpout tensor的数值限制在min到max之间,大于max即为max,小于min即为min,在两者之间不变

然后y_pred(预测的单词对应的概率值)与真实值y送入smoothing函数,作者写了一个类class LabelSmoothing

初始化类的时候需要传入label_smoothing_factor,即padding位在vocabulary中的索引。利用target计算出model prob;

返回model prob与y_pred之间的KL divergence作为loss

2.6.2. negative_log_likelihood

def nll_loss(self, y_pred, y, y_mask, avg=True):
    cost = -T.log(T.gather(y_pred, 2, y.view(y.size(0), y.size(1), 1)))
    cost = cost.view(y.shape)
    y_mask = y_mask.view(y.shape)
    if avg:
        cost = T.sum(cost * y_mask, 0) / T.sum(y_mask, 0)
        else:
            cost = T.sum(cost * y_mask, 0)
            cost = cost.view((y.size(1), -1))
            return T.mean(cost) 

3. transformer.py

3.1. class Transformer

class TransformerLayer(nn.Module):

    def __init__(self, embed_dim, ff_embed_dim, num_heads, dropout, with_external=False, weights_dropout = True):
        super(TransformerLayer, self).__init__()
        self.self_attn = MultiheadAttention(embed_dim, num_heads, dropout, weights_dropout)
        self.fc1 = nn.Linear(embed_dim, ff_embed_dim)
        self.fc2 = nn.Linear(ff_embed_dim, embed_dim)
        self.attn_layer_norm = LayerNorm(embed_dim)
        self.ff_layer_norm = LayerNorm(embed_dim)
        self.with_external = with_external
        self.dropout = dropout
        if self.with_external:
            self.external_attn = MultiheadAttention(embed_dim, num_heads, dropout, weights_dropout)
            self.external_layer_norm = LayerNorm(embed_dim)
        self.reset_parameters()

    def reset_parameters(self):
        nn.init.normal_(self.fc1.weight, std=0.02)
        nn.init.normal_(self.fc2.weight, std=0.02)
        nn.init.constant_(self.fc1.bias, 0.)
        nn.init.constant_(self.fc2.bias, 0.)

    def forward(self, x, kv = None,
                self_padding_mask = None, self_attn_mask = None,
                external_memories = None, external_padding_mask=None,
                need_weights = False):
        # x: seq_len x bsz x embed_dim
        residual = x  # 残差:add & norm操作中,需要先将residual,以及residual经过feed forward得到的output进行求和,再进行norm的计算
        if kv is None:
            x, self_attn = self.self_attn(query=x, key=x, value=x, key_padding_mask=self_padding_mask, attn_mask=self_attn_mask, need_weights = need_weights)
        else:
            x, self_attn = self.self_attn(query=x, key=kv, value=kv, key_padding_mask=self_padding_mask, attn_mask=self_attn_mask, need_weights = need_weights)

        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.attn_layer_norm(residual + x)  # 先将residual,以及residual经过feed forward得到的output进行求和,再进行norm的计算

        if self.with_external:
            residual = x
            x, external_attn = self.external_attn(query=x, key=external_memories, value=external_memories, key_padding_mask=external_padding_mask, need_weights = need_weights)
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = self.external_layer_norm(residual + x)
        else:
            external_attn = None

        residual = x
        x = gelu(self.fc1(x))  # 高斯误差线性单元 Gaussian Error Linear Units(GELU)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.fc2(x)  # 全连接层
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.ff_layer_norm(residual + x)

        return x, self_attn, external_attn

基本结构:

  1. self attention层(class MultiheadAttention):初始化Transformer单元的时候,需要指定number of head
  2. external attention层(class MultiheadAttention):如果参数with_external为True,计算attention的时候需要考虑外界的输入。即不再是query = key = value了,而key 和 value可能来自source端,这在decoder中需要用到。
  3. 全连接层(两个),形状:(embed_dim, ff_embed_dim),(ff_embed_dim, embed_dim)
  4. attention layer normalization + feedforward normalization(class LayerNorm
  5. dropout
  6. parameter initialization:主要针对的是两个全连接层的weights和bias

功能(forward函数):

  1. 记录下residual。在add & norm操作中,需要先将经过计算得到的output与residual进行求和,再进行normalization
  2. self_attention,用的是class MultiheadAttention,需要提供给forward函数:query,key,value;以及key_padding_mask,atten_mask
  3. dropout
  4. add & attention normalization
  5. 如果参数with_external为True,需要额外进行:
    1. external attention
    2. dropout
    3. add & attention normalization
  6. 再记录residual
  7. 经过全连接层f1
  8. 高斯误差线性单元 Gaussian Error Linear Units(GELU)
  9. dropout
  10. 经过全连接层f2
  11. dropout
  12. add & feedforward normalization

其中激活函数GLUE的定义如下:

def gelu(x):
    cdf = 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
    return cdf*x

3.2. class MultiheadAttention

初始化MultiheadAttention需要几个参数:

  1. attention head count:8
  2. embed_dim (dim_x) : 512
  3. ff_embed_dim:1024

head_dim,即每个注意力头的维度的计算方式为:head_dim = embed_dim // num_heads,前者必须可以为后者所整除

attention 用的还是论文中scaled attention,scaling参数是head_dim开方

对收入、输出的映射:

  1. in_proj_weight:(3*embed_dim, embed_dim)
  2. in_proj_bia:(3*embed_dim),先定义出来,前1/3是Query的,中间1/3是Key的,最后1/3是Value的。后面定义了一个函数_in_proj,根据传入的参数确定需要对qkv中的那几个进行映射,取出来就行了。但是输入映射的参数肯定是从in_proj_weight,in_proj_bia中取的
  3. out_proj:(embed_dim, embed_dim)

对具体情况进行判别,对应地对输入的qkv进行输入映射:

  1. 如果qkv相同,则是self-attention
  2. 如果qkv不同,但是kv相同,则是encoder-decoder attention
  3. 如果qkv均不同,则是一般的attention

对attention weights进行mask,使用的是方法masked_fill_ 输入一个ByteTensor,其中元素为1的位置,对应Tensor中元素会被置0。

对attention weights进行mask,再进行softmax,再进行dropout

attention output是attention weights与value进行bmm (batch matrix multiply),结果再包上一层dropout

进行一次输出映射,得到MultiheadAttention的输出

3.3. class LayerNorm

def forward(self, x):
    u = x.mean(-1, keepdim=True)
    s = (x - u).pow(2).mean(-1, keepdim=True)
    x = (x - u) / torch.sqrt(s + self.eps)
    return self.weight * x + self.bias

流程:

  1. 计算均值
  2. 计算方差
  3. 减去方差,除以标准差
  4. 经过一个线性映射(fully connected layer)

3.4. Positional Embedding

定义了两个类:

  1. 可学习的位置编码:class LearnedPositionalEmbedding
  2. 由正弦函数给出的固定位置编码:class SinusoidalPositionalEmbedding定义与Attention is All You Need 论文中一致

原文地址:https://www.cnblogs.com/lauspectrum/p/11229125.html

时间: 2024-10-07 01:39:37

[code] Transformer For Summarization Source Code Reading的相关文章

在 Win7 环境使用 hyperledger fabric source code 模拟 IBM Bluemix Blockchain Service

在 Win7 环境使用 hyperledger fabric source code 模拟 IBM Bluemix Blockchain Service 标签: blockchain区块链 2016-10-13 07:05 949人阅读 评论(1) 收藏 举报  分类: blockchain(12)  Docker(5)  区块链(6)  版权声明:本文为博主原创文章,未经博主允许不得转载. 目录(?)[+] 在 Win7 环境使用 hyperledger/fabric source code

Tips for newbie to read source code

作者:ll kid链接:https://zhuanlan.zhihu.com/p/23796485来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出处. This post is first posted on my WeChat public account: GeekArtT Reading source code is always one big part for software engineers. Just like the writers to learn

Memcached source code analysis -- Analysis of change of state--reference

This article mainly introduces the process of Memcached, libevent structure of the main thread and worker thread based on the processing of the connection state of mutual conversion (not involving data access operations), the main business logic is t

Memcached source code analysis (threading model)--reference

Look under the start memcahced threading process memcached multi-threaded mainly by instantiating multiple libevent, are a main thread and n workers thread is the main thread or workers thread all through the the libevent management network event, in

Finding Comments in Source Code Using Regular Expressions

Many text editors have advanced find (and replace) features. When I’m programming, I like to use an editor with regular expression search and replace. This feature is allows one to find text based on complex patterns rather than based just on literal

CRC32 Source Code

/* The Quest Operating System * Copyright (C) 2005-2010 Richard West, Boston University * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Softwar

【开源】海看源代码统计工具 Haikan Source Code Counter

Haikan Source Code Counter 海看源代码统计工具 BY 杭州海看网络科技有限公司 ------------------- github上的地址: https://github.com/haikanwhf/HaikanSourceCodeCounter ------------------ 海看源代码统计工具V1.7.rar

About building ant & install ant on centos7 {ant source code 1.94}

? ? ? ? ? ? ? hamcrest-junit-2.0.0.0.jar java-hamcrest-2.0.0.0.jar ? copy to ant-sourceCodeDir/lib/optimal ? ? ? ? About building ant & install ant on centos7 {ant source code 1.94}

退役笔记一#MySQL = lambda sql : sql + ' Source Code 4 Explain Plan '

Mysql 查询执行过程 大致分为4个阶段吧: 语法分析(sql_parse.cc<词法分析, 语法分析, 语义检查 >) >>sql_resolver.cc # JOIN.prepare 生成逻辑查询plan(sql_optimizer.cc) >># JOIN.optimize 生成物理查询plan(sql_planner.cc) run the explain plan(sql_executor.cc) 退役笔记一#MySQL = lambda sql : sql