Action Classification from Still Images

As a final project in the Computer Vision course at my college with Prof Marwan Torki, we used Pascal Voc 2010 Action Classification. This project excited me to learn more about machine learning and computer vision in specific. Since that project machine learning became a hobby to me and Kaggle platform made my life easier to learn new techniques from contests.

Problem Description

We were asked to create a model that can successfully classify actions in still images, this was challenging as in still images it is hard to detect if a passenger is walking or not.

But based on our approach (dense sift features with PCA and Random forest) we achieved an accuracy of 76%.

Of course, now convolutional neural networks ca achieves much higher results.

Technical Aspects

At summer 2013, deep learning was not used heavily used in image classification tasks and needs lot of training data and powerful resources and I had knowledge in the feature extraction from images and model creation

Along with my team colleague Omar Sourour, we made a literature review tried too many features before reaching our best score.

We used MATLAB at that time parallel toolbox and image toolboxes were making our life easier the only drawback was to edit/write c mex files for some libraries that didn’t exist in MATLAB like random forest.

Our model was based on what our literature review and online resources, so we used SIFT to extract KeyPoint features from input images, each image was divided into a set of grids and KeyPoint pixels were extracted from these grids giving us a large set of features RGB for each KeyPoint extracted.

Then we reduced the dimensions of the extracted features using PCA.

The principal components were then fed to random forest classifier and trained using MATLAB.

We used parallel toolbox to extract sift features from input images in parallel.

This project gave me the boost and the excitement I need to learn more about the machine learning, it was awesome to make the computer predict human pose or action from images.