propose a novel and effective framework to largely improve the performance of human action recognition using both the RGB videos and depth maps. The key contribution is the proposition of the sparse coding-based temporal pyramid matching approach (ScTPM) for feature representation. Due to the pyramid structure and sparse representation of extracted features, temporal information is well kept and approximation
error is reduced. In addition, a novel Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) descriptor is proposed to capture spatial-temporal features from RGB videos at low computational cost. Using the ScTPM-represented 3D joint features and CS-Mltp features, both feature-level fusion and classifierlevel fusion are explored that further improves the recognition accuracy.
From the feature extraction perspective, a new local spatiotemporal feature descriptor, Center-Symmetric Motion Local Ternary Pattern (CS-Mltp) is proposed to describe gradient-like
characteristics of RGB sequences at both the spatial and temporal directions.
From the feature representation perspective, our contribution lies in the design of a temporal pyramid matching approach based onsparse codingof the extracted features to represent the temporal patterns, referred to as Sparse coding Temporal Pyramid Matching (ScTPM).
From classification perspective, we evaluate both feature- and classifier-level fusion of two sources based on fast and simple linear SVM classifier.
In the field of video-based human action recognition, the common method is to design local spatio-temporal feature extraction and representation algorithms that mainly involves three steps:(1) local interest point detection, such as the Spatio-Temporal Interest Points (STIP)[15]and Cuboid detector[4]; (2) local feature description, such as histogram of oriented gradient (HOG)[3]and histogram of optical flow (HOF)[8]; (3) feature quantization and representation such as the K-means and Bag-of-Words (BoW).
the proposed CS-Mltp descriptor has advantages in several aspects: first, it can be easily combined with any detectors for action recognition since we adopt a 16-bin coding scheme; second,it encodes both the shape and motion characteristics to ensure high performance with stability.
Depth Features
Since 3D joint features are frame-based, the different numbers of frames in each sequence requires algorithms to provide solutions for ‘‘temporal alignment’’. Most existing algorithms solve this problem through temporal modeling that models the temporal evolutions of different actions. For example, the HMM is widely used to model the temporal evolutions[19,32,5]. The conditional random field (CRF)[6]predicts the motion patterns in the manifold subspace. The dynamic temporal warping (DTW)[22]tries to compute the optimal alignments of the motion templates composed by 3D joints. However, the noisy joint positions extracted by the skeleton tracker[26]may undermine the performance of these models and the limited number of training samples makes these algorithms easily suffer from the overfitting problem
features and classifier fusion
Different from the literature work that based on HMM, our algorithm is based on the proposed ScTPM, and we explore two kinds of fusion schemes. The feature-level fusion concatenates the histograms generated from two sources to form a longer hist-ogram representation as the input to classifier, and the classifier-level fusion combines the classification results on both sources to generate final result
Spatio-temporal feature extraction
Spatio-temporal feature representation
In sparse coding, the dictionary Vis learned in the training phase that collected a larg-e number of features from training samples by iteratively optimizing Eqs.(5) and (6). In the coding phase, the sparse codes are retained by optimizing Eq.(5)given learnedV.
feature representation
Classification
For multi-class, the linear SVM is equivalent of learningLlinear
functions。To perform RGB-D human action recognition, we fuse the features of depth maps and color images at two levels, (1) a featurelevel fusion where the histograms generated from two sources are simply concatenated together to form a longer histogram representation as the input to classifier, and (2) a classifier-level fusion where the classifiers for the two sources are trained separately and classifier combination is performed subsequently to generate final result.
分类器融合
不同的融合方法(点乘,或者加),会得到不同的融合结果~
小结:
这个论文的工作量还是蛮大的,其实idea也很容易想到,不过没有人家做的这么扎实
特征:3D joint features , CS-Mltp(由于描述的是运动信息,有前后帧的帧差)
那个Cuboid detector还不是太明白~~