[Paper Review]Distilling the Knowledge in a Neural Network,2015

Analogy: Many insects have a larval form that is optimized for extracting energy and nutrients from the environment and a completely different adult form that is optimized for the very different requirements of travelling and reproduction.

The problem in Machine Learning: In large-scale machine learning,
we typically use very similar models for the training stage and the
deployment stage despite their very different requirements: For tasks
like speech and object recognition, training must extract structure from
very large, highly redundant datasets but it does not need to operate
in real time and it can use a huge amount of computation. Deployment to a
large number of users, however, has much more stringent requirements on
latency and computational resources.

What is the knowledge in Neural Network?
A conceptual block that may have prevented more investigation of this
very promising approach is that we tend to identify the knowledge in a
trained model with the learned parameter values and this makes it hard
to see how we can change the form of the model but keep the same
knowledge. A more abstract view of the knowledge, that frees it from any
particular instantiation, is that it is a learned mapping from input
vectors to output vectors.

Why knowledge transfer? 
For tasks like MNIST in which the cumbersome model almost always
produces the correct answer with very high confidence, much of the
information about the learned function resides in the ratios of very
small probabilities in the soft targets. For example, one version of a 2
may be given a probability of 10−6 of being a 3 and 10−9 of being a 7
whereas for another version it may be the other way around. This is
valuable information that defines a rich similarity structure over the
data (i. e. it says which 2’s look like 3’s and which look like 7’s) but
it has very little influence on the cross-entropy cost function during
the transfer stage because the probabilities are so close to zero.

Simple transfer:  An obvious way to transfer the generalization
ability of the cumbersome model to a small model is to use the class
probabilities produced by the cumbersome model as “soft targets” for
training the small model.

Redefine the Softmax Function:
A term called Temperature T is added into Softmax Function as following.
The larger the T is, the smoother the result probability matrix will
be.

Distillation:
In the simplest form of distillation, knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax. The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature of 1.

https://arxiv.org/pdf/1503.02531.pdf

原文地址:https://www.cnblogs.com/rhyswang/p/12235699.html

时间: 2024-08-20 15:21:19

[Paper Review]Distilling the Knowledge in a Neural Network,2015的相关文章

1503.02531-Distilling the Knowledge in a Neural Network.md

原来交叉熵还有一个tempature,这个tempature有如下的定义: $$ q_i=\frac{e^{z_i/T}}{\sum_j{e^{z_j/T}}} $$ 其中T就是tempature,一般这个T取值就是1,如果提高: In [6]: np.exp(np.array([1,2,3,4])/2)/np.sum(np.exp(np.array([1,2,3,4])/2)) Out[6]: array([0.10153632, 0.1674051 , 0.27600434, 0.45505

读paper:Deep Convolutional Neural Network using Triplets of Faces, Deep Ensemble, andScore-level Fusion for Face Recognition

今天给大家带来一篇来自CVPR 2017关于人脸识别的文章. 文章题目:Deep Convolutional Neural Network using Triplets of Faces, Deep Ensemble, and 摘要: 文章动机:人脸识别在一个没有约束的环境下,在计算机视觉中是一个非常有挑战性的问题.同一个身份的人脸当呈现不同的装饰,不同的姿势和不同的表情都可以使人脸看起来完全不同.这种相同身份的变化可以压倒不同身份的变化,这样给人脸识别带来更大的挑战,特别是在没有约束的环境下.

[Paper Review]EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES,2015

Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks’ vulnerability to adversarial perturbation is their linear nature. Linear behavior in high-dimensional sp

(转)The Neural Network Zoo

转自:http://www.asimovinstitute.org/neural-network-zoo/ THE NEURAL NETWORK ZOO POSTED ON SEPTEMBER 14, 2016 BY FJODOR VAN VEEN With new neural network architectures popping up every now and then, it's hard to keep track of them all. Knowing all the abb

neural network for Malware Classification(Reprinted)

catalogue 0. 引言 1. Byte-sequence N-grams 2. Opcodes N-grams 3. API and Functions calls 4. Use of registers 5. Call Graphs 6. Malware as an Image 7. Detection of malware using dynamic behavior and Windows audit logs 8. 其他方法: Novel Feature Extraction,

人群计数:Single-Image Crowd Counting via Multi-Column Convolutional Neural Network(CVPR2016)

本博文主要是CVPR2016的<Single-Image Crowd Counting via Multi-Column Convolutional Neural Network>这篇文章的阅读笔记,以及对人群计数领域做一个简要介绍. Abstract 这篇论文开发了一种可以从一个单幅的图像中准确地估计任意人群密度和任意角度的人群数目.文章提出了一种简单有效的的多列卷积神经网络结构(MCNN)将图像映射到其人群密度图上.该方法允许输入任意尺寸或分辨率的图像,每列CNN学习得到的特征可以自适应由

neural network and deep learning笔记(1)

neural network and deep learning 这本书看了陆陆续续看了好几遍了,但每次都会有不一样的收获.DL领域的paper日新月异,每天都会有很多新的idea出来,我想,深入阅读经典书籍和paper,一定可以从中发现remian open的问题,从而有不一样的视角. PS:blog主要摘取书中重要内容简述. 摘要部分 Neural networks, a beautiful biologically-inspired programming paradigm which e

【转帖】【面向代码】学习 Deep Learning(一)Neural Network

最近一直在看Deep Learning,各类博客.论文看得不少 但是说实话,这样做有些疏于实现,一来呢自己的电脑也不是很好,二来呢我目前也没能力自己去写一个toolbox 只是跟着Andrew Ng的UFLDL tutorial 写了些已有框架的代码(这部分的代码见github) 后来发现了一个matlab的Deep Learning的toolbox,发现其代码很简单,感觉比较适合用来学习算法 再一个就是matlab的实现可以省略掉很多数据结构的代码,使算法思路非常清晰 所以我想在解读这个too

Recurrent Neural Network(循环神经网络)

Reference:   Alex Graves的[Supervised Sequence Labelling with RecurrentNeural Networks] Alex是RNN最著名变种,LSTM发明者Jürgen Schmidhuber的高徒,现加入University of Toronto,拜师Hinton. 统计语言模型与序列学习 1.1 基于频数统计的语言模型 NLP领域最著名的语言模型莫过于N-Gram. 它基于马尔可夫假设,当然,这是一个2-Gram(Bi-Gram)模