CS231n笔记 Lecture 9, CNN Architectures

Review: LeNet-5

1998 by LeCun, one conv layer.

Case Study: AlexNet [Krizhevsky et al. 2012]

It uses a lot of mordern techniques where is still limited to historical issues (seperated feature maps, norm layers). Kind of obsolete, but it is the first CNN-based winner in ILSVRC.

ZFNet [Zeiler and Fergus, 2013]

Based on AlexNet but:

  • CONV1: change from (11x11 stride 4) to (7x7 stride 2)
  • CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

Play with hyperparameters. 0.0

ImageNet top 5 error: 16.4% -> 11.7%

After that we have deeper networks.

Case Study: VGGNet [Simonyan and Zisserman, 2014]

Small filters, Deeper networks.

Only use 3 by 3 filters with stride 1, pad 1 and 2 by 2 max pool with stride 2.

Why such setting? Same effective receptive field (after stacks of layers, each nueron actually looks at the same size of pixels as when applying 7 by 7 filters) but can get deeper, thus retain more non-linearities.

Details:
- ILSVRC’14 2nd in classification, 1st in localization
- Similar training procedure as Krizhevsky 2012
- No Local Response Normalisation (LRN)
- Use VGG16 or VGG19 (VGG19 only slightly better, more memory)
- Use ensembles for best results (kind of common practice)
- FC7 features generalize well to other tasks

Case Study: GoogLeNet [Szegedy et al., 2014]

Stacks inception modules and avoids FC layers, faster and smaller.

Inception modules. Concatenate the output of filters with different sizes and max pooling (all padded to make them have the same spacial size), and then use a 1 by 1 filter (called "bottleneck") to reduces channels thus reduces complexity.

Full GoogLeNet architecture: Stem Network (vanila conv nets to get started) + Stacked Inception Modules + Classifier output (no expensive FC layers). Besides, use auxiliary classification outputs to inject additional gradient at lower layers.

Case Study: ResNet [He et al., 2015]

Very deep networks using residual connections

  • 152-layer model for ImageNet
  • ILSVRC’15 classification winner (3.57% top 5 error. even better than human) Swept all classification and detection competitions in ILSVRC’15 and COCO’15!
  • Use Residual Block to make the optimization easier for deep nets.
    • Instead of making the layers to learn to approximate H(x) (e.g. the function for output), try to let them learning to approximate F(x) = H(x) - x, which is called residual mapping. It means we use the layers to learn some delta on top of out input. Why? Their hypothesis... Another explaination: better gradient flow, make shallow layers also update.
    • Can use 1 by 1 filters to improve efficiency and padding to make depth wise sum possible.
  • Full ResNet architecture
    • Stack residual blocks
    • Every residual block has two 3x3 conv layers
    • Periodically, double # of filters and downsample spatially using stride 2 (/2 in each dimension)
    • Additional conv layer at the beginning
    • No FC layers at the end (only FC 1000 to output classes)
  • Training ResNet in practice:
    • Batch Normalization after every CONV layer
    • Xavier/2 initialization from He et al.
    • SGD + Momentum (0.9)
    • Learning rate: 0.1, divided by 10 when validation error plateaus
    • Mini-batch size 256
    • Weight decay of 1e-5
    • No dropout used

Comparisons

Best: Inception-v4, ResNet + GoogLeNet. Efficient: GoogLeNet. ResNet: Moderate efficiency depending on model, high accuracy.

More architectures

Network in Network (NiN) [Lin et al. 2014]

Precursor to GoogLeNet and ResNet "bottleneck" layers, use micronetwork after conv layer to get more abstract features.

Identity Mappings in Deep Residual Networks [He et al. 2016]

From he again. not very clear...

Wide Residual Networks [Zagoruyko et al. 2016]

Emphersis on residual parts instead of depth, so more filters within a residual block, making the network shallower thus more parallelizable.

Aggregated Residual Transformations for Deep Neural Networks (ResNeXt) [Xie et al. 2016]

From creators again. Increases width of residual block through multiple parallel pathways (“cardinality”) similar to Inception module.

Deep Networks with Stochastic Depth [Huang et al. 2016]

Drop layer outs and bypass with identity function, thus shorter nets and better gradient flow during training.

FractalNet: Ultra-Deep Neural Networks without Residuals [Larsson et al. 2017]

Shallow to deep, and residual mapping is not necessary. Both shallow and deep paths to output.

Densely Connected Convolutional Networks [Huang et al. 2017]

DenseNet. Use Dense blocks where each layer is connected to every other layer in feedforward fashion.  Alleviates vanishing gradient, strengthens feature propagation, encourages feature reuse

SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and <0.5Mb Model Size [Iandola et al. 2017]

原文地址:https://www.cnblogs.com/ichn/p/8503094.html

时间: 2024-09-30 19:09:37

CS231n笔记 Lecture 9, CNN Architectures的相关文章

CS231n笔记 Lecture 11, Detection and Segmentation

Other Computer Vision Tasks Semantic Segmentation. Pixel level, don't care about instances. Classification + Localization. Single object. Object Detection. Multiple object. Instance Segmentation. Multiple object. Semantic Segmentation Simple idea: sl

CS231n笔记4-Data Preprocessing, Weights Initialization与Batch Normalization

Data Preprocessing, Weights Initialization与Batch Normalization Data Preprocessing Weights Initialization与Batch Normalization 数据预处理Data Preprocessing 权重初始化Weights Initialization 让权重初始化为0 0方差1e-2标准差 0方差1标准差 Xavier Initialization 再改进 批归一化Batch Normaliza

七月算法--12月机器学习在线班-第十九次课笔记-深度学习--CNN

七月算法--12月机器学习在线班-第十九次课笔记-深度学习--CNN 七月算法(julyedu.com)12月机器学习在线班学习笔记http://www.julyedu.com 1,卷积神经网络-CNN 基础知识 三个要点 1: 首先将输入数据看成三维的张量(Tensor) 2: 引入Convolution(卷积)操作,单元变成卷积核,部分连接共享权重 3:引入Pooling(采样)操作,降低输入张量的平面尺寸 ,1.1 张量(Tensor) 高,宽度,深度,eg:彩色图像:rgb,3个深度,图

cs231n笔记:线性分类器

cs231n线性分类器学习笔记,非翻译,根据自己的学习情况总结出的内容: 线性分类 本节介绍线性分类器,该方法可以自然延伸到神经网络和卷积神经网络中,这类方法主要有两部分组成,一个是评分函数(score function):是原始数据和类别分值的映射,另一个是损失函数:它是用来衡量预测标签和真是标签的一致性程度.我们将这类问题转化为优化问题,通过修改参数来最小化损失函数. 首先定义一个评分函数,这个函数将输入样本映射为各个分类类别的得分,得分的高低代表该样本属于该类别可能性的高低.现在假设有一个

cs231n笔记 (一) 线性分类器

线性分类器用作图像分类主要有两部分组成:一个是假设函数, 它是原始图像数据到类别的映射.另一个是损失函数,该方法可转化为一个最优化问题,在最优化过程中,将通过更新假设函数的参数值来最小化损失函数值. 从图像到标签分值的参数化映射 该方法的第一部分就是定义一个评分函数,这个函数将图像的像素值映射为各个分类类别的得分,得分高低代表图像属于该类别的可能性高低.下面会利用一个具体例子来展示该方法.现在假设有一个包含很多图像的训练集 $x_i \in \mathbb{R}^D$,每个图像都有一个对应的分类

机器学习技法笔记-Lecture 13 Deep learning

一些挑战: 网络结构的选择.CNN的想法是对邻近的输入,有着相同的weight. 模型复杂度. 最优化的初始点选择.pre-training 计算复杂度. 包含pre-training的DL框架 如何做pre-training? 下面介绍了一种方式. weight可以看做是对x做特征转换,那么希望在第一次转换后(从0层到1层)仍然能保持足够多的原来的信息,那么再从1层回到0层,应该得到近似的结果. 这种NN叫做autoencoder,两层分别是编码和解码的操作,来逼近 identity func

卷积神经网络经验-CS231n笔记

课程note中讲了一些工程经验,感觉很有用,记下来供自己以后查阅 相比于大的滤波器,小滤波器更受青睐.小滤波器参数更少.计算量更小.能够表达更多的特征,做反向传播时需要的内存更少. 通常不会考虑创建一个新的网络结构.一般都会找一些在ImageNet上有较好表现的预训练网络,下载下来然后做finetune input layer通常是2的倍数,比如32(CIFAR-10),96(STL-10) conv layer通常使用小滤波器(3x3 或 5x5),stride=1,做padding保证卷积层

机器学习技法笔记-Lecture 12 Neural network

介绍神经网络的基本概念以及Backprop的推导. 输入为x, 首先对x做一个线性变换到s,然后对s做(通常是非线性的)变换,作为下一层的输入.如果对s也是线性的,那整个网络就是线性的,就不需要那么多层了. 对 s 经常使用的一个是双曲余弦的变换tanh 在离原点比较远的地方,它比较像阶梯函数,在接近原点的地方,接近线性函数. 网络的计算过程 可以分两步看,一个是算score,一个是做transform. 如何学习网络中的权重呢? 学习的最终目标仍然是让Ein最小,Ein可以看做是关于全部的w_

机器学习基石笔记-Lecture 3 Types of learning

介绍了机器学习中的几类问题划分. 半监督学习能够避免标记成本昂贵的问题. 强化学习,可以看做是从反馈机制中来学习. 在线学习,数据一个接一个地产生并交给算法模型线上迭代. 主动学习,机器能针对自己没有信心的数据提问,得到答案后再学习. 针对特征空间也有分类,比如具体的特征.原始的(个人理解是人为可提取的)特征和抽象的(个人理解是难以提炼的)特征.