transformers的bert预训练模型的返回值简要描述

一般使用transformers做bert finetune时，经常会编写如下类似的代码：

outputs = self.bert(input_ids,
                               attention_mask=attention_mask,
                               token_type_ids=token_type_ids,
                               position_ids=position_ids,
                               head_mask=head_mask)

在BertModel(BertPreTrainedModel)中，对返回值outputs的解释如下：

r"""
    Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
        **last_hidden_state**: ``torch.FloatTensor`` of shape ``(batch_size, sequence_length, hidden_size)``
            Sequence of hidden-states at the output of the last layer of the model.
        **pooler_output**: ``torch.FloatTensor`` of shape ``(batch_size, hidden_size)``
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you‘re often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        **hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
            list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
            of shape ``(batch_size, sequence_length, hidden_size)``:
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        **attentions**: (`optional`, returned when ``config.output_attentions=True``)
            list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
"""

这里的pooler_output指的是输出序列最后一个隐层，即CLS标签。查看forward函数的源码，最后返回的部分代码如下：

        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output)

        outputs = (sequence_output, pooled_output,) + encoder_outputs[
            1:
        ]  # add hidden_states and attentions if they are here
        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)

可以看到sequence_output进入了一个pooler层，这个pooler层结构如下：

class BertPooler(nn.Module):
    def __init__(self, config):
        super(BertPooler, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.Tanh()

    def forward(self, hidden_states):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        first_token_tensor = hidden_states[:, 0]
        pooled_output = self.dense(first_token_tensor)
        pooled_output = self.activation(pooled_output)
        return pooled_output

所以bert的model并不是简单的组合返回。一般说来，如果需要用bert做句子级的任务，可以使用pooled_output结果做baseline；进一步的微调可以使用last_hidden_state的结果。

last_hidden_state的结构如下所示：

第0列为CLS，对应句向量，其他列对应词向量。

原文地址：https://www.cnblogs.com/webbery/p/12167552.html

时间： 2024-10-06 18:20:04

transformers的bert预训练模型的返回值简要描述的相关文章

BERT预训练模型的演进过程！(附代码)

1. 什么是BERT BERT的全称是Bidirectional Encoder Representation from Transformers,是Google2018年提出的预训练模型,即双向Transformer的Encoder,因为decoder是不能获要预测的信息的.模型的主要创新点都在pre-train方法上,即用了Masked LM和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation. Bert最近很火,应该是最近最火爆的A

BERT 预训练模型及文本分类

BERT 预训练模型及文本分类介绍如果你关注自然语言处理技术的发展,那你一定听说过 BERT,它的诞生对自然语言处理领域具有着里程碑式的意义.本次试验将介绍 BERT 的模型结构,以及将其应用于文本分类实践. 知识点语言模型和词向量 BERT 结构详解 BERT 文本分类 BERT 全称为 Bidirectional Encoder Representations from Transformer,是谷歌在 2018 年 10 月发布的语言表示模型.BERT 通过维基百科和书籍语料组成的庞

理解 ES6 语法中 yield 关键字的返回值

在 ES6 中新增了生成器函数的语法,本文解释了生成器函数内 yield 关键字的返回值. 描述根据语法规范,yield 关键字用来暂停和继续执行一个生成器函数.当外部调用生成器的 next() 方法时,yield 关键字右侧的表达式才会执行. 执行结果会转化为一个对象(包含两个属性, value 和 done),作为 next() 方法的返回值. 对于 var foo = yield expression 语句,yield 左侧变量 foo 的值将在下一次调用 next() 方法时获得,并

预训练模型：XLNet 和他的先辈们

预训练模型在CV中,预训练模型如ImagNet取得很大的成功,而在NLP中之前一直没有一个可以承担此角色的模型,目前,预训练模型如雨后春笋,是当今NLP领域最热的研究领域之一. 预训练模型属于迁移学习,即在某一任务上训练的模型,经过微调(finetune)可以应用到其它任务上. 在NLP领域,最早的预训练模型可以说是word2vec, Mikolov应用语言模型进行训练,产生的词向量(word embeddings)可以用于其他任务上,这样的词向量在目标任务上,可以固定不变,也可以随着模型训练

Pytorch——GPT-2 预训练模型及文本生成

介绍在本次将学习另一个有着优秀表现的预训练模型:GPT-2 模型,以及使用它进行文本生成任务实践. 知识点 GPT-2 的核心思想 GPT-2 模型结构详解 GPT-2 进行文本生成 OpenAI 在论文 Improving Language Understanding by Generative Pre-Training 中提出了 GPT 模型.GPT 模型是由单向 Transformer 的解码器构建的模型,OpenAI 团队在一个非常大的书籍数据集 the Toronto Book Co

zabbix增加手机端4个url地址的返回值

由同事提供4个需要监控的url地址 GET类型: http://10.15.24.61:809/UserCenterService.svc/getAccountInfo/563/9638 POST类型: http://10.15.24.61:809/ProductService/userInvestVarietyYjsList/4/0/563/1/9638/1.0 http://10.15.24.61:809/ProductService/userInvestVarietyYjsList/3/0

C# 7.0 新特性1：基于Tuple的“多”返回值方法

本文基于Roslyn项目中的Issue:#347 展开讨论. 回顾首先,提出一个问题,C#中,如何使一个方法可返回"多个"返回值? 我们先来回顾一下C#6.0 及更早版本的做法. 在C#中,通常我们有以下4种方式使一个方法返回多条数据. 使用 KeyValue 组合 1 static void Main(string[] args) 2 { 3 int int1 = 15; 4 int int2 = 25; 5 var result = Add_Multiply(int1, int2

(原创)c#学习笔记06--函数02--变量的作用域02--参数和返回值与全局数据

6.2.2 参数和返回值与全局数据本节将详细介绍如何通过全局数据以及参数和返回值与函数交换数据.先看看下面的代码: class Program { static void ShowDouble(ref int val) { val *= 2; Console.WriteLine("val doubled = {0}", val); } static void Main(string[] args) { int val = 5; Console.WriteLine("val

linux shell自定义函数(定义、返回值、变量作用域)介绍

http://www.jb51.net/article/33899.htm linux shell自定义函数(定义.返回值.变量作用域)介绍 linux shell 可以用户定义函数,然后在shell脚本中可以随便调用.下面说说它的定义方法,以及调用需要注意那些事项. 一.定义shell函数(define function) 语法: [ function ] funname [()] { action; [return int;] } 说明: 1.可以带function fun() 定义,也可以