(转)ResNet, AlexNet, VGG, Inception: Understanding various architectures of Convolutional Networks

ResNet, AlexNet, VGG, Inception: Understanding various architectures of Convolutional Networks

by KOUSTUBH

  this blog from: http://cv-tricks.com/cnn/understand-resnet-alexnet-vgg-inception/

  

  Convolutional neural networks are fantastic for visual recognition tasks. Good ConvNets are beasts with millions of parameters and many hidden layers. In fact, a bad rule of thumb is: ‘higher the number of hidden layers, better the network’. AlexNet, VGG, Inception, ResNet are some of the popular networks. Why do these networks work so well? How are they designed? Why do they have the structures they have? One wonders. The answer to these questions is not trivial and certainly, can’t be covered in one blog post. However, in this blog, I shall try to discuss some of these questions. Network architecture design is a complicated process and will take a while to learn and even longer to experiment designing on your own. But first, let’s put things in perspective:

  Why are ConvNets beating traditional computer vision?

  

  图像识别(Image classification)的任务是将给定的图像正确的分类为预先定义的种类。传统方法,将该过程分为两个模块:feature extraction 以及 classification

     Feature Extraction:involves extracting a higher level of information from raw pixel values that can capture the distinction among the categories involved. This feature extraction is done in an unsupervised manner wherein the classes of the image have nothing to do with information extracted from pixels. Some of the traditional and widely used features are GIST, HOG, SIFT, LBP etc. After the feature is extracted, a classification module is trained with the images and their associated labels. A few examples of this module are SVM, Logistic Regression, Random Forest, decision trees etc.

  但是这个流程的问题在于:特征提取的过程无法根据 classes 和 images 进行微调(the feature extraction cannot be tweaked according to the classes and images)。所以,如果这个选中的特征缺乏表达性来区分种类,模型分类的准确度可定不会很好,不管你采用哪种分类的策略。A common theme among the state of the art following the traditional pipeline has been, to pick multiple feature extractors and club them inventively to get a better feature. But this involves too many heuristics as well as manual labor to tweak parameters according to the domain to reach a decent level of accuracy. By decent I mean, reaching close to human level accuracy. That’s why it took years to build a good computer vision system(like OCR, face verification, image classifiers, object detectors etc), that can work with a wide variety of data encountered during practical application, using traditional computer vision. We once produced better results using ConvNets for a company(a client of my start-up) in 6 weeks, which took them close to a year to achieve using traditional computer vision.

  

  Another problem with this method is that it is completely different from how we humans learn to recognize things. Just after birth, a child is incapable of perceiving his surroundings, but as he progresses and processes data, he learns to identify things. This is the philosophy behind deep learning, wherein no hard-coded feature extractor is built in. It combines the extraction and classification modules into one integrated system and it learns to extract, by discriminating representations from the images and classify them based on supervised data.

  One such system is multilayer perceptrons aka neural networks which are multiple layers of neurons densely connected to each other. A deep vanilla neural network has such a large number of parameters involved that it is impossible to train such a system without overfitting the model due to the lack of a sufficient number of training examples. But with Convolutional Neural Networks(ConvNets), the task of training the whole network from the scratch can be carried out using a large dataset like ImageNet. The reason behind this is, sharing of parameters between the neurons and sparse connections in convolutional layers. It can be seen in this figure 2. In the convolution operation, the neurons in one layer are only locally connected to the input neurons and the set of parameters are shared across the 2-D feature map.

  In order to understand the design philosophy of ConvNets, one must ask: What is the objective here ?  

  a. Accuracy :

  If you are building an intelligent machine, it is absolutely critical that it must be as accurate as possible. One fair question to ask here is that ‘accuracy not only depends on the network but also on the amount of data available for training’. Hence, these networks are compared on a standard dataset called ImageNet.

  ImageNet project is an ongoing effort and currently has 14,197,122 images from 21841 different categories. Since 2010, ImageNet has been running an annual competition in visual recognition where participants are provided with 1.2 million images belonging to 1000 different classes from Imagenet data-set. So, each network architecture reports accuracy using these 1.2 million images of 1000 classes.

  b. Computation:

  Most ConvNets have huge memory and computation requirements, especially while training. Hence, this becomes an important concern. Similarly, the size of the final trained model becomes an important to consider if you are looking to deploy a model to run locally on mobile. As you can guess, it takes a more computationally intensive network to produce more accuracy. So, there is always a trade-off between accuracy and computation.

  Apart from these, there are many other factors like ease of training, the ability of a network to generalize well etc. The networks described below are the most popular ones and are presented in the order that they were published and also had increasingly better accuracy from the earlier ones.

AlexNet

This architecture was one of the first deep networks to push ImageNet Classification accuracy by a significant stride in comparison to traditional methodologies. It is composed of 5 convolutional layers followed by 3 fully connected layers, as depicted in Figure 1.

AlexNet, proposed by Alex Krizhevsky, uses ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid function which was the earlier standard for traditional neural networks. ReLu is given by

f(x) = max(0,x)

The advantage of the ReLu over sigmoid is that it trains much faster than the latter because the derivative of sigmoid becomes very small in the saturating region and therefore the updates to the weights almost vanish(Figure 4). This is called vanishing gradient problem.

In the network, ReLu layer is put after each and every convolutional and fully-connected layers(FC).

Another problem that this architecture solved was reducing the over-fitting by using a Dropout layer after every FC layer. Dropout layer has a probability,(p), associated with it and is applied at every neuron of the response map separately. It randomly switches off the activation with the probability p, as can be seen in figure 5.

Why does DropOut work?

The idea behind the dropout is similar to the model ensembles. Due to the dropout layer, different sets of neurons which are switched off, represent a different architecture and all these different architectures are trained in parallel with weight given to each subset and the summation of weights being one. For n neurons attached to DropOut, the number of subset architectures formed is 2^n. So it amounts to prediction being averaged over these ensembles of models. This provides a structured model regularization which helps in avoiding the over-fitting. Another view of DropOut being helpful is that since neurons are randomly chosen, they tend to avoid developing co-adaptations among themselves thereby enabling them to develop meaningful features, independent of others.

VGG16

This architecture is from VGG group, Oxford. It makes the improvement over AlexNet by replacing large kernel-sized filters(11 and 5 in the first and second convolutional layer, respectively) with multiple 3X3 kernel-sized filters one after another. With a given receptive field(the effective area size of input image on which output depends), multiple stacked smaller size kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the network which enables it to learn more complex features, and that too at a lower cost.

For example, three 3X3 filters on top of each other with stride 1 ha a receptive size of 7, but the number of parameters involved is 3*(9C^2) in comparison to 49C^2 parameters of kernels with a size of 7. Here, it is assumed that the number of input and output channel of layers is C.Also, 3X3 kernels help in retaining finer level properties of the image. The network architecture is given in the table.

You can see that in VGG-D, there are blocks with same filter size applied multiple times to extract more complex and representative features. This concept of blocks/modules became a common theme in the networks after VGG.

The VGG convolutional layers are followed by 3 fully connected layers. The width of the network starts at a small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer. It achieves the top-5 accuracy of 92.3 % on ImageNet.

时间: 2024-08-29 22:24:32

(转)ResNet, AlexNet, VGG, Inception: Understanding various architectures of Convolutional Networks的相关文章

0 - Visualizing and Understanding Convolutional Networks(阅读翻译)

卷积神经网络的可视化理解(Visualizing and Understanding Convolutional Networks) 摘要(Abstract) 近来,大型的卷积神经网络模型在Imagenet数据集上表现出了令人印象深刻的效果,但是现如今大家并没有很清楚地理解为什么它们有如此好的效果,以及如何改善其效果.在这篇文章中,我们对这两个问题均进行了讨论.我们介绍了一种创新性的可视化技术可以深入观察中间的特征层函数的作用以及分类器的行为.作为一项类似诊断性的技术,可视化操作可以使我们找到比

[论文解读]CNN网络可视化——Visualizing and Understanding Convolutional Networks

概述 虽然CNN深度卷积网络在图像识别等领域取得的效果显著,但是目前为止人们对于CNN为什么能取得如此好的效果却无法解释,也无法提出有效的网络提升策略.利用本文的反卷积可视化方法,作者发现了AlexNet的一些问题,并在AlexNet基础上做了一些改进,使得网络达到了比AlexNet更好的效果.同时,作者用"消融方法"(ablation study)分析了图片各区域对网络分类的影响(通俗地说,"消融方法"就是去除图片中某些区域,分析网络的性能). 反卷积神经网络(D

Very Deep Convolutional Networks for Large-Scale Image Recognition—VGG论文翻译

Very Deep Convolutional Networks for Large-Scale Image Recognition Karen Simonyan∗ & Andrew Zisserman+ Visual Geometry Group, Department of Engineering Science, University of Oxford {karen,az}@robots.ox.ac.uk 摘要 在这项工作中,我们研究了在大规模的图像识别环境下卷积网络的深度对识别的准确率

经典网络结构(LeNet , AlexNet , VGG , GoogLeNet)剖析

github博客传送门 csdn博客传送门 参考: https://my.oschina.net/u/876354/blog/1797489 LeNet C1层(卷积层):[email protected]×28 (1)特征图大小 ->(32-5+1)×(32-5+1)= 28×28 (2)参数个数 -> 5×5+1)×6= 156 其中5×5为卷积核参数,1为偏置参数 (3)连接数 -> 该层的连接数为(5×5+1)×6×28×28=122304 S2层(下采样层,也称池化层):[em

VGG:VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION学习

牛津大学 visual geometry group(VGG)Karen Simonyan 和Andrew Zisserman 于14年发表的论文.论文地址:https://arxiv.org/pdf/1409.1556.pdf.与alex的文章虽然都采用层和每层之间用pooling层分开,最后三层FC层(Fully Connected全连接层).但是AlexNet每层仅仅含有一个Convolution层,VGG每层含有多个(2~4)个Convolution层.AlexNet的filter的大小

论文笔记 Visualizing and Understanding Convolutional Networks

之前,我知道可以可视化CNN,也只是知道有这么一回事情.至于它是"怎么做的.其原理是什么.给我们的指导意义是什么",也不清楚.说白了,就是我知道有"CNN可视化",仅仅停留在"知道"层面!但当自己需要运用.理解其他CNN可视化技术时,才晓得将这篇paper精读一下. Background 1)在很多分类任务中(如手写字符识别.人脸识别,以及极具挑战性的Imagenet Classification),CNN取得了极好的性能.但是,CNN是怎么做到

VGG——Very deep convolutional networks for large-scale image recognition

1. 摘要 在使用非常小(3×3)的卷积核情况下,作者对逐渐增加网络的深度进行了全面的评估,通过设置网络层数达 16-19 层,最终效果取得了显著提升. 2. 介绍 近来,卷积神经网络在大规模图像识别领域取得了巨大的成功,这一方面归功于大规模公开数据的出现,另一方面则是计算能力的提升.在 AlexNet 的基础上大家进行了很多的尝试来进行改进,一条线是在卷积层利用更小的感知窗口和更小的步长,另一条线则是在整张图片上进行训练然后测试的时候采用多尺度.在本文中,作者则集中于卷积神经网络的另一个方面-

Visualing and understanding convolutional networks

这篇文章主要是基于Alex的CNN代码利用可视化技术将卷积神经网络每层学习到的特征以人眼可见的方式变现出来,即Feature Visualization,并试图提出改进.相当于是卷积神经网络的逆过程. 主要框架如下图: 主要利用到的技术有unpooling ,rectification,filtering(逆滤波) 主要的分析过程有: 1.Architecture Selection 发现的问题:The first layer lters are a mix of extremely high

ZFNet: Visualizing and Understanding Convolutional Networks

目录 论文结构 反卷积 ZFnet的创新点主要是在信号的"恢复"上面,什么样的输入会导致类似的输出,通过这个我们可以了解神经元对输入的敏感程度,比如这个神经元对图片的某一个位置很敏感,就像人的鼻子对气味敏感,于是我们也可以借此来探究这个网络各层次的功能,也能帮助我们改进网络. 论文结构 input: \(3 \times 224 \times 224\), filter size: 7, filter count: 96, stride: 2, padding: 1, 我觉得是要补一层