1. Problem Definition
There‘s no doubt that researches and
applications on the foundation of videos has become a popular field including
intelligence surveillance, interactions between human and machines,
content-based video retrieval and so on. However, it‘s also a research direction
full of challenges. We want to design and implement a simple system whose
responsibility is recognising limited kinds of actions. For example, if we want
the system have the ability to recognise actions of walking, running and
jumping, when given an unlabelled video containing one of the above actions, the
system should tell us which kind of actions it has detected, as shown in Figure
1. For more details about human activity analysis, please refer to
[1].
2. Design
In this section, I provide what we are
thinking about while designing this system and give a brief introduction to
relevant techniques in each module of the system. The global processing
flowchart is shown in Figure 2.
2.1 Foreground Objects Extraction
We are interested in the actions in a video,
so the foreground object seems to be more important to us than the background.
We need to figure out a efficient way of foreground object detection and
segmentation from a video stream. Once getting the foreground object in each
frame of a video, we can spare more effort to analyse useful data. An commonly
used approach to extract foreground objects form the image sequence is through
background subtraction when the video is grabbed from a stationary camera
assuming the backgrounds in all the videos are same. But the requirement is too
harsh because our videos may come form any corner in the world having different
backgrounds. We need a method having the property of adapting to the changes of
background more or less. Gaussian Mixture Model(GMM) [7]
and Optical Flow [2] method sare considered as promising
methods.In GMM, each pixel in the scene are modelled by a mixture of K(3~5)
Gaussian distributions. The optical flow method exploit the consistency of
optical flows over a short period of time, being able to detect foreground
objects in complex outdoor scenes. The main drawback of this method is high
computation cost. After weighting the superiorities and shortcomings of the
three methods, we choose GMM as the foundation of the foreground objects
extraction module in the system.
2.2 Action Feature Extraction
Action feature extraction is the key component
in actions recognition systems. If we can take advantage of a method not only
describing actions‘ characteristics appropriately, but also magnifying the
differences between actions as much as possible, we will make a breakthrough in
the relevant fields. Here,the well-known Motion History
Image(MHI) [6] is utilized to generate compact, descriptive
representations of motion information in the video. MHI compact the whole motion
sequence into a single image, which is actually the weighted sum of past
foreground objects and the weights decay back through time. Therefore, an MHI
image contains the past foreground objects within itself, where the most recent
one is brighter than the earlier ones.
2.3 Dimensionality Reduction
Because the obtained motion features are high
dimensional data, then dimensionality reduction becomes an essential part of the
system. On the one hand, the process of dimensionality reduction extract main
features of data, on the other hand, it helps to weaken the influence of noise.
There are many algorithms,such as
PCA[4,5], LPP [5],LDA [5],Laplacian
Eigenmaps [5] and the corresponding kernel versions.We decide to apply
PCA into our dimensionality reduction module. PCA is a simple but classical
method,whose advantage lies in reserving much intrinsic information of data.
That‘s to say,the compressed data can be used in reconstruction.
2.4 Learning and Recognition
Our ultimate goal is making the system
recognize actions correctly.So classifier used for learning and recognizing play
a crucial role in the system.So far,classifiers such as Bayes
classifier [9], decision trees [8], support
vector machines(SVM) [10] have been widely used in machine learning. I
prefer to use SVM in this module. Originally, SVM was a technique for binary
classification.Later it was extended to regression and clustering problems. SVM
maps feature vectors into a higher-dimensional space using a kernel function and
builds an optimal hyper-plane fitting into the training data to implements the
task of classification.
3. Implementation and Experiments
3.1 Dataset
All the video samples in the experiments are
made by Pose Pro 2012,including five people in a same scene waving hands,jumping
and walking. Besides,the shooting angle of the camera in the scene ranges from
?50°~50°
. Figure 4 illustrates few frames of each action.
We split the whole dataset into training set
and testing set,as shown in Table 1. It‘s suggested that the ratio between
training data and testing data is 7:3.
3.2 Extracting Foreground Objective
From Figure 5, we can see that the foreground
objective extraction is not as perfect as we expect.What we get are the moving
parts of the objective,not the whole objective. GMM only concerns itself with
the moving parts having relatively different color with the background.It seems
to be a drawback of GMM. That won‘t affect the following motion features
extraction much. Thinking in another way,which kind of action an objective is
doing only decided by its moving parts.
Besides,the parameters of GMM especially the
background threshold T
(T=0.7
in our experiment) influence its performance greatly. It‘s because its
optimal parameters change with surrounding environment that optimizing its
parameters is not a easy work.
3.3 Constructing MHI
Figure 6 shows the corresponding motion
history images to us. It‘s mentioned before that MHI is actually the
superimposition of foreground objective with weight. There is no need to make
all the foreground objectives extracted be part of MHI. It‘s better to sample
the whole foreground objectives sequence and integrate every N
(N=3
in our experiment) items into the MHI. This trick can not only reduce
computation cost, but also magnify the difference between similar actions such
as walking and running. The main defect of MHI is that the shooting angle is a
major determinant of its performance when there is only one camera equipped. For
instance,when the camera is in front of a man, it may difficult for us to
distinguish whether he is walking,walking or standing through a
MHI.
3.4 Dimensionality Reduction and Reconstruction
We compress MHI with the aim of simplifying
the action features further, so as to reduce the burden of the module of
learning and recognition. PCA is a linear method and easy to implement. The
compressed data can be used to construct data approximating the original data.
In Figure 7, the images in the first column are the original motion history
image and the images in the second,third and fourth column are reconstructed
from compressed data whose number of principal components are respectively 10,20
and 60. It can be easily observed that the more principal components used, the
more similar the reconstructed image is to the original image. When the number
of principal components is 10
, the result seems to be pretty good so we compress our MHI with 10
principal components here.
3.5 Optimizing Parameters of SVM
We choose radial basis
function(RBF) as the kernel function of SVM.Its parameter γ
is critical to its performance.Besides,the penalty factor C
also play an important role in soft margin allowing for mislabelled samples in
SVM. The optimal parameters must be found for specific dataset so as to obtain
the best recognition result. Cross validation [3] and
grid search [3] are used in the process of
optimization.
3.5.1 Cross Validation
The cross-validation procedure can prevent the
overfitting problem. In v-fold cross-validation, we first divide the training
set into v
(v=10
in our experiment) subsets of equal size. Sequentially one subset is tested
using the classifier trained on the remaining v?1
subsets. Thus, each instance of the whole training set is predicted once so the
cross-validation accuracy is the percentage of data which are correctly
classified.
3.5.2 Grid Search
We recommend a “grid-search” on C
and γ
to search approximately optimal parameter using cross-validation. Various pairs
of (C,γ)
values are tried and the one with the best cross-validation accuracy is picked
out. Trying exponentially growing sequences of C
and γ
is a practical method to identify good parameters (for example, C=2?5,2?3,?,215,γ=2?15,2?13,?,23)
. It‘s likely that we can‘t find the optimal pair of (C,γ)
using grid-search. After all, the computational cost of methods doing an
exhaustive parameter search by approximations or heuristics is too high, we are
satisfied with the approximating one without much cost. Furthermore, the
grid-search can be easily parallelized because of the independence of each (C,γ)
. Since doing a complete grid-search may still be time-consuming, we recommend
using neighbor search. Specifically, we use a coarse grid first and identify a
“better” region on the grid, then conduct a finer grid search on that region. We
can provide a possible interval of C
(or γ
) with the grid space. Then, all grid points of (C,γ)
are tried to find the one giving the highest cross validation accuracy.Then the
best parameters are used to train the whole training set and generate the
final model. Figure 8 is the contour plot of cross-validation accuracy during
the process of parameters optimization.
3.6 Actions Recognition
Table 2 shows the confusion matrix obtained
from our system. Every element on the diagonal corresponds to the correctly
recognised number of each action in the testing set. The element of i
-th row and j
-th column means how many samples of j
-th action are labelled as i
-th action. The recognition accuracy is quite satisfying and the system reaches
our basic requirements, even through we only test it with special
dataset.
4. Summary
Clearly, there are a lot of flaws in theory in
the system, which have been pointed out more or less in the designing part.
Besides, there are some aspects needed to be improved:
- Lacking sufficient training data.The videos
containing a same action may differ from each other a lot because of
illumination, shooting angle, background and so on. - MHI used in the system to extract actions‘
features has strict requirements for videos.It seems to fail to uncover more
underlying properties of each action. But I have to say that sometimes it‘s
difficult do distinguish an action from another without explicit
definition. - Having no ability to recognise an action
during its process through just few frames, let alone complex group
actions.
Plenty of work still need to be done to build
a excellent and practical actions recognition system.
- Developing an algorithm describing the
features of an action more efficiently and accurately. - Comparing other dimensionality reduction
algorithms with PCA,such as LDA.Perhaps the later works well than the former
in supervised learning. - Semi-supervised learning sometimes has more
satisfying performance than supervised learning.Maybe we should have a
try.
In one word, we need to explore the intrinsic
characteristics of specific data and relationships between them.
References
[1] JK Aggarwal and M.S. Ryoo. Human activity analysis: A review. ACM
Computing Surveys (CSUR), 43(3):16, 2011.
[2] J.L. Barron, D.J. Fleet, and SS Beauchemin. Performance of optical flow
techniques. International journal of computer vision, 12(1):43–77,1994.
[3] C.W. Hsu, C.C. Chang, C.J. Lin, et al. A practical guide to support
vector classification, 2003.
[4] A. Hyv ?rinen, J. Karhunen, and E. Oja. Principal component
analysis,a2001.
[5] E. Kokiopoulou, J. Chen, and Y. Saad. Trace optimization and
eigen-problems in dimension reduction methods. Numerical Linear Algebra with
Applications, 18(3):565–602, 2011.
[6] H. Meng, N. Pears, M. Freeman, and C. Bailey. Motion history histograms
for human action recognition. Embedded Computer Vision,pages 139–162, 2009.
[7] C. Stauffer and W.E.L. Grimson. Learning patterns of activity using
real-time tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 22(8):747–757, 2000.
[8] Decision tree. http://en.wikipedia.org/wiki/Decision_tree.
[9] Naive bayes classifier.
http://en.wikipedia.org/wiki/Naive_Bayes_classifier.
[10] Support vector machine.
http://en.wikipedia.org/wiki/Support_vector_machine.
Email:[email protected]
A Simple Actions Recognition System,布布扣,bubuko.com