Spatio-temporal feature extraction and representation for RGB-D human action recognition

propose a novel and effective framework to largely improve the performance of human action recognition using both the RGB videos and depth maps. The key contribution is the proposition of the sparse coding-based temporal pyramid matching approach (ScTPM) for feature representation. Due to the pyramid structure and sparse representation of extracted features, temporal information is well kept and approximation
error is reduced. In addition, a novel Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) descriptor is proposed to capture spatial-temporal features from RGB videos at low computational cost. Using the ScTPM-represented 3D joint features and CS-Mltp features, both feature-level fusion and classifierlevel fusion are explored that further improves the recognition accuracy.

From the feature extraction perspective, a new local spatiotemporal feature descriptor, Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) is proposed to describe gradient-like
characteristics of RGB sequences at both the spatial and temporal directions.

From the feature representation perspective, our contribution lies in the design of a temporal pyramid matching approach based onsparse codingof the extracted features to represent the temporal patterns, referred to as Sparse coding Temporal Pyramid Matching (ScTPM).

From classification perspective, we evaluate both feature- and classifier-level fusion of two sources based on fast and simple linear SVM classifier.

In the field of video-based human action recognition, the common method is to design local spatio-temporal feature extraction and representation algorithms that mainly involves three steps:(1) local interest point detection, such as the Spatio-Temporal Interest Points (STIP)[15]and Cuboid detector[4]; (2) local feature description, such as histogram of oriented gradient (HOG)[3]and histogram of optical flow (HOF)[8]; (3) feature quantization and representation such as the K-means and Bag-of-Words (BoW).

the proposed CS-Mltp descriptor has advantages in several aspects: first, it can be easily combined with any detectors for action recognition since we adopt a 16-bin coding scheme; second,it encodes both the shape and motion characteristics to ensure high performance with stability.

Depth Features

Since 3D joint features are frame-based, the different numbers of frames in each sequence requires algorithms to provide solutions for ‘‘temporal alignment’’. Most  existing algorithms solve this problem through temporal modeling that models the     temporal evolutions of different actions. For example, the HMM is widely used to model the temporal evolutions[19,32,5]. The conditional random field (CRF)[6]predicts the motion patterns in the manifold subspace. The dynamic temporal warping (DTW)[22]tries to compute the optimal alignments of the motion templates composed by 3D       joints. However, the noisy joint positions extracted by the skeleton tracker[26]may undermine the performance of these models and the limited number of training samples makes these algorithms easily suffer from the overfitting problem

features and classifier fusion

Different from the literature work that based on HMM, our algorithm is based on the proposed ScTPM, and we explore two kinds of fusion schemes. The feature-level        fusion concatenates the histograms generated from two sources to form a longer hist-ogram representation as the input to classifier, and the classifier-level fusion combines the classification results on both sources to generate final result

Spatio-temporal feature extraction

Spatio-temporal feature representation

In sparse coding, the dictionary Vis learned in the training phase that collected a larg-e number of features from training samples by iteratively optimizing Eqs.(5) and (6).  In the coding phase, the sparse codes are retained by optimizing Eq.(5)given learnedV.

feature representation

Classification

For multi-class, the linear SVM is equivalent of learningLlinear
functions。

To perform RGB-D human action recognition, we fuse the features of depth maps and color images at two levels, (1) a featurelevel fusion where the histograms generated from two sources are simply concatenated together to form a longer histogram representation as the input to classifier, and (2) a classifier-level fusion where the classifiers for the two sources are trained separately and classifier combination is performed subsequently to generate final result.

分类器融合

不同的融合方法(点乘,或者加),会得到不同的融合结果~

小结:

这个论文的工作量还是蛮大的,其实idea也很容易想到,不过没有人家做的这么扎实

特征:3D joint features , CS-Mltp(由于描述的是运动信息,有前后帧的帧差)

那个Cuboid detector还不是太明白~~

时间: 2024-11-07 06:15:18

Spatio-temporal feature extraction and representation for RGB-D human action recognition的相关文章

【论文笔记】Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition 2018-01-28  15:45:13  研究背景和动机: 行人动作识别(Human Action Recognition)主要从多个模态的角度来进行研究,即:appearance,depth,optical-flow,以及 body skeletons.这其中,动态的人类骨骼点 通常是最具有信息量的,且能够和其他模态进行互补.

scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)

http://scikit-learn.org/stable/modules/feature_extraction.html 带病在网吧里. ..... 写.求支持. .. 1.首先澄清两个概念:特征提取和特征选择( Feature extraction is very different from Feature selection ). the former consists in transforming arbitrary data, such as text or images, in

【转载】Caffe (Convolution Architecture For Feature Extraction)

Caffe (Convolution Architecture For Feature Extraction)作为深度学习CNN一个非常火的框架,对于初学者来说,搭建Linux下的Caffe平台是学习深度学习关键的一步,其过程也比较繁琐,回想起当初折腾的那几天,遂总结一下Ubuntu14.04的配置过程,方便以后新手能在此少走弯路. 1. 安装build-essentials 安装开发所需要的一些基本包 sudo apt-get install build-essential 2. 安装NVID

ufldl学习笔记与编程作业:Feature Extraction Using Convolution,Pooling(卷积和池化抽取特征)

ufldl出了新教程,感觉比之前的好,从基础讲起,系统清晰,又有编程实践. 在deep learning高质量群里面听一些前辈说,不必深究其他机器学习的算法,可以直接来学dl. 于是最近就开始搞这个了,教程加上matlab编程,就是完美啊. 新教程的地址是:http://ufldl.stanford.edu/tutorial/ 学习链接: http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution

Feature Engineering versus Feature Extraction: Game On!

Feature Engineering versus Feature Extraction: Game On! "Feature engineering" is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance.

Feature extraction - sklearn文本特征提取

http://blog.csdn.net/pipisorry/article/details/41957763 文本特征提取 词袋(Bag of Words)表征 文本分析是机器学习算法的主要应用领域.但是,文本分析的原始数据无法直接丢给算法,这些原始数据是一组符号,因为大多数算法期望的输入是固定长度的数值特征向量而不是不同长度的文本文件.为了解决这个问题,scikit-learn提供了一些实用工具可以用最常见的方式从文本内容中抽取数值特征,比如说: 标记(tokenizing)文本以及为每一个

scikit-learn:4.2.3. Text feature extraction

http://scikit-learn.org/stable/modules/feature_extraction.html 4.2节内容太多,因此将文本特征提取单独作为一块. 1.the bag of words representation 将raw data表示成长度固定的数字特征向量.scikit-learn提供了三个方式: tokenizing:给每个token(字.词,粒度自己把握)一个整数索引id counting:每一个token在每一个文档中出现的次数 normalizing:

Feature extraction using convolution

http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution http://ufldl.stanford.edu/wiki/index.php/卷积特征提取

Big Spatio temporal Data(R-tree Index and NN & RNN & Skyline)

一.简单介绍大数据技术产物 “大数据”一词首先出现在2008年9月<Nature>杂志发表的一篇名为“Big Data: Wikiomics”的文章上(Mitch,2008).“大数据科学”尚未有统一定义,但是科学家普遍认为它是以海量的多元异构数据为主要研究对象,以大数据的存储.处理和理解方法为主要研究内容,以新兴的计算技术为主要研究工具,以扩展人类对数据的利用能力为主要目标的一门新兴的综合性学科.它主要针对当前海量(volume).多元(variety)和高速更新(velocity)数据的处