Maxout Networks

Maxout Networks

Researching for my master thesis I tried to understand the paper by Goodfellow et al. on the Maxout Units. I found it very hard understanding the details and thought a clear explanation in combination with a nice figure would be really helpful. So this is my shot at doing so.

Please note, that everything explaned here was not developed by me, but is just an explanation of the paper by Goodfellow et al.

Key infos about Maxout

  • Maxout is an activation function
  • supposed to be combined with dropout
  • that minimizes the model averaging approximation error when using dropout
  • is a piecewise linear approximation to an arbitrary convex function

Definition

hi(x)zij=maxj∈[1,k](zij)=xTW…ij+bij

W∈Rd×m×k,b∈Rm×k

   
h Maxout function
i index of neuron in the layer
x input
k number of linear pieces
W 4D tensor of learned weights
b matrix of learned biases

Illustration

Now, here is how a single layer with three Maxout units looks like. Try hovering units with the mouse to better see the connection scheme.

xzhInput UnitsFully-connected UnitsMax-pooling Units

“But wait, this looks more like at least two layers!”

Yes indeed, this is an important fact about Maxout. The activation function is implemented using a small sub-network who’s parameters are learned aswell (“Did somebody say ‘Network in Network’?”). So a single Maxout layer actually consists of two parts (The first layer in the image is just the input layer. This doesn’t necessarily mean that this is the very first layer in the whole network, but can be the output of a previous layer, too).

The first is the linear part. It is a fully-connected layer with no activation function (which is referred to as affine), thus each unit in this layer just computes the weighted sum of all inputs, which is defined in the second part of the Maxout definition above:

xTW…ij+bij

Attentive readers might notice the missing biases in the image above. Well done.

The three-dimensional tensor W contains the weights of this first part. The dots in the equation mean that all elements from the first dimension are taken. Consequently, W…ij is the weight vector of the unit in row i and column j.

It might be also confusing, that in the figure above the units in this first part are aranged in a two-dimensional grid. The first dimension of this grid (number of rows) matches the number of input units, whereas the second dimension (number of columns) is the hyperparameter k, it is chosen when the whole architecture is defined. This parameter controls the complexity of the Maxout activation function. The higher the k the more accurately any convex function can be approximated. Basically each column of units in this first part performs linear regression.

The second part is much easier. It is just doing max-pooling over each row of the first part, i.e. taking the maximum of the output of each row.

An easy example

Consider the function f(x)=x2.

We can approximate this function with a single Maxout unit that uses three linear pieces k=3. We can also say that the Maxout unit uses three hidden units.

This Maxout unit would look like this (biases included this time):

xzhInput UnitsFully-connected UnitsMax-pooling UnitsBiases

Each hidden unit calculates:

zj=x⋅wj+bj

This is a simple linear function. The max-pooling unit takes the maximum of these three linear functions.

Take a look at this picture. It shows the x2 function and three linear functions that could be learned by the Maxout unit.

Finally, try imagining how this would look like with 4, 5, 6 or an arbitrary number of linear functions.

时间: 2024-12-25 18:53:31

Maxout Networks的相关文章

深度学习方法(十):卷积神经网络结构变化——Maxout Networks,Network In Network,Global Average Pooling

技术交流QQ群:433250724,欢迎对算法.技术感兴趣的同学加入. 最近接下来几篇博文会回到神经网络结构的讨论上来,前面我在"深度学习方法(五):卷积神经网络CNN经典模型整理Lenet,Alexnet,Googlenet,VGG,Deep Residual Learning"一文中介绍了经典的CNN网络结构模型,这些可以说已经是家喻户晓的网络结构,在那一文结尾,我提到"是时候动一动卷积计算的形式了",原因是很多工作证明了,在基本的CNN卷积计算模式之外,很多简

论文笔记 《Maxout Networks》 && 《Network In Network》

论文笔记 <Maxout Networks> && <Network In Network> 发表于 2014-09-22   |   1条评论 出处 maxout:http://arxiv.org/pdf/1302.4389v4.pdfNIN:http://arxiv.org/abs/1312.4400 参考 maxout和NIN具体内容不作解释下,可以参考:Deep learning:四十五(maxout简单理解)Network In Network 各用一句话

NEGOUT: SUBSTITUTE FOR MAXOUT UNITS

NEGOUT: SUBSTITUTE FOR MAXOUT UNITS Maxout [1] units are well-known and frequently used tools for Deep Neural Networks. For whom does not know, with a basic explanation, a Maxout unit is a set of internal activation units competing with each other fo

Dropout &amp; Maxout

[ML] My Journal from Neural Network to Deep Learning: A Brief Introduction to Deep Learning. Part. Eight Dropout & Maxout This is the 8th post of a series of posts I planned about a journal of myself studying deep learning in Professor Bhiksha Raj's

CSDN日报20170311——《程序员每天累成狗,是为了什么》

[程序人生]程序员每天累成狗,是为了什么 作者:郭小北 程序员可以投入的资本就是:身体和脑力,说白了都是出卖劳动力换取回报,也就是钱.我们大部分人都是凡人,或许当初是基于兴趣和理想去做一件事,入一门行,但随着阅历的丰富,年龄的增长,责任感的叠加你工作就是为了钱啊,因为在这个物质的社会,你连家都养不了,何来生活的更好? [物联网]Android Things --SDK框架 作者:王玉成 物联网应用开发与手机和平板的应用开发有一些区别,那么Android Things与Android又有哪些差别呢

ML简史

原文地址:http://www.52ml.net/15427.html 图 1 机器学习时间线 在科学技术刚刚萌芽的时候,科学家Blaise Pascal和Von Leibniz就想到了有朝一日能够实现人工智能.即让机器拥有像人一样的智能. 机器学习是AI中一条重要的发展线,在工业界和学术界都异常火爆.企业.大学都在投入大量的资源来做机器学习方面的研究.最近,机器学习在很多任务上都有了重大的进步,达到或者超越了人类的水平(例如,交通标志的识别[1],ML达到了98.98%,已超越了人类). 图1

ReLU 和sigmoid 函数对比

详细对比请查看:http://www.zhihu.com/question/29021768/answer/43517930 . 激活函数的作用: 是为了增加神经网络模型的非线性.否则你想想,没有激活函数的每层都相当于矩阵相乘.就算你叠加了若干层之后,无非还是个矩阵相乘罢了.所以你没有非线性结构的话,根本就算不上什么神经网络. 2. 为什么ReLU效果好: 重点关注这章6.6节:Piecewise Linear Hidden Unitshttp://www.iro.umontreal.ca/~b

文献 | 2010-2016年被引用次数最多的深度学习论文(修订版)

本来来自 :http://blog.csdn.net/u010402786/article/details/51682917 一.书籍 Deep learning (2015) 作者:Bengio 下载地址:http://www.deeplearningbook.org/ 二.理论 1.在神经网络中提取知识 Distilling the knowledge in a neural network 作者:G. Hinton et al. 2.深度神经网络很易受骗:高信度预测无法识别的图片 Deep

深度学习阅读列表 Deep Learning Reading List

Reading List List of reading lists and survey papers: Books Deep Learning, Yoshua Bengio, Ian Goodfellow, Aaron Courville, MIT Press, In preparation. Review Papers Representation Learning: A Review and New Perspectives, Yoshua Bengio, Aaron Courville