场景分类(scene classification) 摘录

Low-level :

- SIFT : It describes a patch by the histograms of gradients computed over a 4 × 4 spatial grid. The
gradients are then quantized into eight bins so the final feature vector has a dimension of 128 (4×4×8).
- LBP : Some works adopt LBP to extract texture information from aerial images, see [54,58]. For a
patch, it first compares the pixel to its 8 neighbors: when the neighbor’s value is less than the center
pixel’s, output \1", otherwise, output \0". This gives an 8-bit decimal number to describe the center
pixel. The LBP descriptor is obtained by computing the histogram of the decimal numbers over the
patch and results in a feature vector with 256 dimensions.
- Color histogram : Color histograms (CH) are used for extracting the spectral information of aerial
scenes . In our experiments, color histogram descriptors are computed separately in three
channels of the RGB color space. Each channel is quantized into 32 bins to form a total histogram
feature length of 96 by simply concatenation of the three channels.
- GIST : Unlike aforementioned descriptors that focus on local information, GIST represents the dominant spatial structure of a scene by a set of perceptual dimensions (naturalness, openness, roughness,
expansion, ruggedness) based on the spatial envelope model [61] and thus widely used for describing
scenes . This descriptor is implemented by convolving the gray image with multi-scale (with the
number of S) and multi-direction (with the number of D) Gabor filters on a 4 × 4 spatial grid. By
concatenating the mean vector of each grid, we get the GIST descriptor of an image with 16 × S × D
dimensions.

Mid-Level:

- Bag of Visual Words (BoVW)  model an image by leaving out the spatial information and representing it with the frequencies of local visual words [57]. BoVW model and its variants are widely used
in scene classification . The visual words are often produced by clustering
local image descriptors to form a dictionary (with a given size K), e.g. using k-means algorithm.
- Spatial Pyramid Matching (SPM)  uses a sequence of increasingly coarser grids to build a spatial
pyramid (with L levels) coding of local image descriptors. By concatenating the weighted local image
features in each subregion at different scales, one can get a (4L-31)×K dimension global feature vector
which is much longer than BoVW with the same size of dictionary (K).
- Locality-constrained Linear Coding (LLC)  is an effective coding scheme adapted from sparse coding
methods . It utilizes the locality constraints to code each local descriptor into its localcoordinate system by modifying the sparsity constraints [70,86]. The final feature can be generated by
max pooling of the projected coordinates with the same size of dictionary.
- Probabilistic Latent Semantic Analysis (pLSA)  is a way to improve the BoVW model by topic models. A latent variable called topic is introduced and defined as the conditional probability distribution
of visual words in the dictionary. It can serve as a connection between the visual words and images.
By describing an image with the distribution of topics (the number of topics is set to be T ), one can
solve the influence of synonym and polysemy meanwhile reduce the feature dimension to be T .
- Latent Dirichlet allocation (LDA) is a generative topic model evolved from pLSA with the main
difference that it adds a Dirichlet prior to describe the latent variable topic instead of the fixed Gaussian
distribution, and is also widely used for scene classification . As a result, it can handel
the problem of overfitting and also increase the robustness. The dimension of final feature vector is the
same with the number of topics T .
- Improved Fisher kernel (IFK)  uses Gaussian Mixture Model (GMM) to encode local image features  and achieves good performance in scene classification. In essence, the feature of an
image got by Fisher vector encoding method is a gradient vector of the log-likelihood. By computing
and concatenating the partial derivatives of the mean and variance of the Gaussian functions, the final
feature vector is obtained with the dimension of 2 × K × F (where F indicates the dimension of the
local feature descriptors and K denotes the size of the dictionary).
- Vector of Locally Aggregated Descriptors (VLAD)  can be seen as a simplification of the IFK
method which aggregates descriptors based on a locality criterion in feature space. It uses the
non-probabilistic k-means clustering to generate the dictionary by taking the place of GMM model
in IFK. When coding each local patch descriptor to its nearest neighbor in the dictionary, the differences between them in each dimension are accumulated and resulting in an image feature vector with
dimension of K × F .

High-Level:

- CaffeNet: Caffe (Convolutional Architecture for Fast Feature Embedding)  is one of the most
commonly used open-source frameworks for deep learning (deep convolutional neural networks in particular). The reference model - CaffeNet, which is almost a replication of ALexNet [88] that is proposed
12
for the ILSVRC 2012 competition . The main differences are: (1) there is no data argumentation
during training; (2) the order of normalization and pooling are switched. Therefore, it has quite similar
performances to the AlexNet, see [4, 41]. For this reason, we only test CaffeNet in our experiment.
The architecture of CaffeNet comprises 5 convolutional layers, each followed by a pooling layer, and 3
fully connected layers at the end. In our work, we directly use the pre-trained model obtained using
the ILSVRC 2012 dataset [78], and extract the activations from the first fully-connected layer, which
results in a vector of 4096 dimensions for an image.
- VGG-VD-16: To investigate the effect of the convolutional network depth on its accuracy in the largescale image recognition setting, [89] gives a thorough evaluation of networks by increasing depth using
an architecture with very small (3 × 3) convolution filters, which shows a significant improvement on
the accuracies, and can be generalised well to a wide range of tasks and datasets. In our work, we use
one of its best-performing models, named VGG-VD-16, because of its simpler architecture and slightly
better results. It is composed of 13 convolutional layers and followed by 3 fully connected layers, thus
results in 16 layers. Similarly, we extract the activations from the first fully connected layer as the
feature vectors of the images.
- GoogLeNet: This model [81] won the ILSVRC-2014 competition [78]. Its main novelty lies in the
design of the "Inception modules", which is based on the idea of "network in network" [90]. By using
the Inception modules, GoogLeNet has two main advantages: (1) the utilization of filters of different
sizes at the same layer can maintain multi-scale spatial information; (2) the reduction of the number
of parameters of the network makes it less prone to overfitting and allows it to be deeper and wider.
Specifically, GoogLeNet is a 22-layer architecture with more than 50 convolutional layers distributed
inside the inception modules. Different from the above CNN models, GoogLeNet has only one fully
connected layer at last, therefore, we extract the features of the fully connected layer for testing

时间: 2024-10-15 00:44:41

场景分类(scene classification) 摘录的相关文章

CVPR 2013 关于图像/场景分类(classification)的文章paper list

CVPR 2013 关于图像/场景分类(classification)的文章paper list 八14by 小军 这个搜罗了cvpr2013有关于classification的相关文章,自己得mark下来好好看看,好快啊,都快研二了,但是还是一点头绪都没!好好看看,争取每篇文章写点思想. Oral: 1.Rolling Riemannian Manifolds to Solve the Multi-class Classification Problem Rui Caseiro, Pedro

Cocos2d-X3.0 刨根问底(八)----- 场景(Scene)、层(Layer)相关源码分析

本章节我们重点分析Cocos2d-x3.0与 场景.层相关的源码.这部分源码集中在 libcocos2d –> layers_scenes_transitions_nodes目录下面 我先发个截图大家了解一下都有哪些文件.红色框里面的就是我们今天要分析的文件. 从命名上可以了解,这个文件夹里的文件主要包含了  场景,层,变换这三种类型的文件. 下面我们先分析Scene类 打开CCScene.h文件 /** @brief Scene is a subclass of Node that is us

【unity实用技能】unity编辑器工具之加载预制(Prefab)和场景(Scene)

在unity里做打包或者帮策划美术做工具的时候经常会需要把Prefab拉出来或者场景打开做检验工作 其实这个在上一篇在ui打包的文章里有提到,不过重点不同,上篇重点是打包,这篇的重点是把里面的一个小知识点拉出来讲一讲 接下来就讲讲两者分别怎么做 一.把预制Prefab拉出来 就是像我们平时把预制拉到这个地方 1.首先是获取你选中的那个Prefab(如果是有其他需求,比如默认目录下的所有文件等,就不这样处理,不过大同小异) GameObject[] selectGameObjects = Sele

[moka同学笔记]yii2场景的使用(摘录)

前半部分为自己使用的过程,下边为转载的,具体地址见:http://blog.sina.com.cn/s/blog_88a65c1b0101j717.html 1.在model中 public function rules() { return [ [['join_verify', 'create_activity', 'is_open_group', 'is_open_child_com','sendmail_limit','sendmail_from_name','sendmail_from'

场景(Scene)

// 创建Scene类 class MyScene : public cocos2d::Layer { public: // there's no 'id' in cpp, so we recommend returning the class instance pointer static cocos2d::Scene* createScene(); // Here's a difference. Method 'init' in cocos2d-x returns bool, instead

DCGAN 论文简单解读

DCGAN的全称是Deep Convolution Generative Adversarial Networks(深度卷积生成对抗网络).是2014年Ian J.Goodfellow 的那篇开创性的GAN论文之后一个新的提出将GAN和卷积网络结合起来,以解决GAN训练不稳定的问题的一篇paper. 关于基本的GAN的原理,可以参考原始paper,或者其他一些有用的文章和代码,比如:GAN mnist 数据生成,深度卷积GAN之图像生成,GAN tutorial等.这里不再赘述. 一. DCGA

【Unity】2.5 场景视图(Scene)

分类:Unity.C#.VS2015 创建日期:2016-03-29 一.场景视图(Scene View)导航 场景视图 (Scene View) 是你的交互式沙箱.你可以使用场景视图 (Scene View) 选择和放置环境.玩家.相机.敌人和所有其他游戏对象 (GameObjects). 在场景视图 (Scene View) 中调动和操纵对象是 Unity 最重要的一些功能,因此,能够迅速使用它们至关重要. 场景视图 (Scene View) 有一个导航控件集,可帮助你快速高效地四处移动.

机器学习笔记04:逻辑回归(Logistic regression)、分类(Classification)

之前我们已经大概学习了用线性回归(Linear Regression)来解决一些预测问题,详见: 1.<机器学习笔记01:线性回归(Linear Regression)和梯度下降(Gradient Decent)> 2.<机器学习笔记02:多元线性回归.梯度下降和Normal equation> 3.<机器学习笔记03:Normal equation及其与梯度下降的比较> 说明:本文章所有图片均属于Stanford机器学课程,转载请注明出处 面对一些类似回归问题,我们可

分类和逻辑回归(Classification and logistic regression)

分类问题和线性回归问题问题很像,只是在分类问题中,我们预测的y值包含在一个小的离散数据集里.首先,认识一下二元分类(binary classification),在二元分类中,y的取值只能是0和1.例如,我们要做一个垃圾邮件分类器,则为邮件的特征,而对于y,当它1则为垃圾邮件,取0表示邮件为正常邮件.所以0称之为负类(negative class),1为正类(positive class) 逻辑回归 首先看一个肿瘤是否为恶性肿瘤的分类问题,可能我们一开始想到的是用线性回归的方法来求解,如下图: