1503.02531-Distilling the Knowledge in a Neural Network.md

原来交叉熵还有一个tempature，这个tempature有如下的定义：

q_i=\frac{e^{z_i/T}}{\sum_j{e^{z_j/T}}}

其中T就是tempature，一般这个T取值就是1,如果提高：

In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2))
Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505423])

In [7]: mx.nd.softmax(mx.nd.array([1,2,3,4]))
Out[7]: 

[0.0320586 0.08714432 0.23688284 0.6439143 ]
<NDArray 4 @cpu(0)>

也就是

Using a higher value for T produces a softer probability distribution over classes.

拥有更高的tempature的系统，其entropy会更高，也就是混乱性更高，方向不趋于一致，而这种不一致性，其实是一种信息，

可以描述数据中更多结构的信息。大模型通过强制的正则化，使得最后输出的信息，entropy更低。因此

Our more general solution, called “distillation”, is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model is actually a special case of distillation.

~~也就是在训练大模型的时候就强制高tempature？但是感觉这样会更加重这种问题才对？~~

训练大模型的时候，正常训练。其logits使用的时候，用高T，小模型训练的时候，也使用高T，但是验证的时候，使用T1.

In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.

可以同时使用softlabel和数据集的label来做训练，但是softlabel使用不同的T的时候，需要将softlabel的loss相应的乘以$T^2$

使用softtarget的好处是，softtarget携带了更多的信息，因此可以用更少的数据来训练。

多个大模型蒸馏出来的模型，可能比多个模型组合有更好的性能。

多个模型如何蒸馏？用多个模型的输出，作为最终蒸馏模型的target，多个target的loss相加。也就是一种多任务学习。

confusion matrix 这个东西可以被用来探查模型最容易弄错的是哪些分类。

看错了，似乎论文最后只是在讨论训练多个speciallist model，但是并没有谈到如何把这些models组合回一个大模型。这可能是个问题。

原文地址：https://www.cnblogs.com/dwsun/p/9297200.html

时间： 2024-08-30 16:38:33

1503.02531-Distilling the Knowledge in a Neural Network.md的相关文章

[Paper Review]Distilling the Knowledge in a Neural Network,2015

Analogy: Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of travelling and reproduction. The proble

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1

3.Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1 http://blog.csdn.net/sunbow0 Spark MLlib Deep Learning工具箱,是根据现有深度学习教程<UFLDL教程>中的算法,在SparkMLlib中的实现.具体Spark MLlib Deep Learning(深度学习)目录结构: 第一章Neural Net(NN) 1.源码 2.源码解析 3.实例第二章D

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.2

3.Spark MLlib Deep Learning Convolution Neural Network(深度学习-卷积神经网络)3.2 http://blog.csdn.net/sunbow0 第三章Convolution Neural Network (卷积神经网络) 2基础及源码解析 2.1 Convolution Neural Network卷积神经网络基础知识 1)基础知识: 自行google,百度,基础方面的非常多,随便看看就可以,只是很多没有把细节说得清楚和明白: 能把细节说清

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.3

3.Spark MLlib Deep Learning Convolution Neural Network(深度学习-卷积神经网络)3.3 http://blog.csdn.net/sunbow0 第三章Convolution Neural Network (卷积神经网络) 3实例 3.1 测试数据按照上例数据,或者新建图片识别数据. 3.2 CNN实例 //2 测试数据 Logger.getRootLogger.setLevel(Level.WARN) valdata_path="/use

ufldl学习笔记与编程作业：Convolutional Neural Network(卷积神经网络)

ufldl出了新教程,感觉比之前的好,从基础讲起,系统清晰,又有编程实践. 在deep learning高质量群里面听一些前辈说,不必深究其他机器学习的算法,可以直接来学dl. 于是最近就开始搞这个了,教程加上matlab编程,就是完美啊. 新教程的地址是:http://ufldl.stanford.edu/tutorial/ 本节学习地址:http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/ 一直没更

Convolution Neural Network (CNN) 原理与实现

本文结合Deep learning的一个应用,Convolution Neural Network 进行一些基本应用,参考Lecun的Document 0.1进行部分拓展,与结果展示(in python). 分为以下几部分: 1. Convolution(卷积) 2. Pooling(降采样过程) 3. CNN结构 4. 跑实验下面分别介绍. PS:本篇blog为ese机器学习短期班参考资料(20140516课程),本文只是简要讲最naive最simple的思想,重在实践部分,原理课上详述.

机器学习公开课笔记(4)：神经网络(Neural Network)——表示

动机(Motivation) 对于非线性分类问题,如果用多元线性回归进行分类,需要构造许多高次项,导致特征特多学习参数过多,从而复杂度太高. 神经网络(Neural Network) 一个简单的神经网络如下图所示,每一个圆圈表示一个神经元,每个神经元接收上一层神经元的输出作为其输入,同时其输出信号到下一层,其中每一层的第一个神经元称为bias unit,它是额外加入的其值为1,通常用+1表示,下图用虚线画出. 符号说明: $a_i^{(j)}$表示第j层网络的第i个神经元,例如下图$a_1^{(

What are the advantages of ReLU over sigmoid function in deep neural network?

The state of the art of non-linearity is to use ReLU instead of sigmoid function in deep neural network, what are the advantages? I know that training a network when ReLU is used would be faster, and it is more biological inspired, what are the other

Deep learning与Neural Network

该文章转自深度学习微信公众号深度学习是机器学习研究中的一个新的领域,其动机在于建立.模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本.深度学习是无监督学习的一种. 深度学习的概念源于人工神经网络的研究.含多隐层的多层感知器就是一种深度学习结构.深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征,以发现数据的分布式特征表示. Deep learning本身算是machine learning的一个分支,简单可以理解为neural network的发展.大