Learning Speech Recognition from Tf Speech Commands

Check the code  github

At the end of 2017, Google Launched a competition on Kaggle using its dataset Speech command. In this competition we were challenged to predict simple commands from input user speech command. Each utterance is around 2 seconds.

This competition attracted me to gain more experience in Speech Recognition which I believe is like the computer vision image problems the only difference is the input audio which is then transformed to a 2D feature set.

Introduction

When I joined this competition, I started learning more about speech recognition under the supervision of Prof Mohamed Afify.

My first attempt at this contest was to learn about features extracted from input audio (mfcc,mel and log mel) , also I started reading about how to make the model robust to noise in input audio so I worked on the data augmentation part to add white noise to the input training utterances.

To make my experimentation easier I created a python code with config file to make parameter choice of input data feature type, size, model parameters and data augmentation techniques easier.

Dataset

Speech Commands dataset each folder containing a set of wav and the parent folder name specify the keyword.

Used Parameters for Wav Reading

used to specify sampling rate (of input wav files), time shift in millisecond, training clips duration in milliseconds, window size milliseconds , window stride milliseconds and finger print type (mfcc,mel and log_mel) and CTC flag to denote using of CTC based models or not.

Models used

Post Processing

After predicting the command found in the utterance, I tried using a language model to fix the words predicted by the previously trained CTC acoustic model .

For the other trained model, they were trying to predict the command class from each input utterance with unknown class for any other command.

Data Augmentation

Used Data augmentation technique to generalize the model on speech utterances with different speed not found in the train set (this technique successfully improved the model)

Results

using speech commands test set data

Kaggle Results

My main focus in this contest was to get into speech recognition, so all the experimentation were for learning purposes and i was able to be top 15% in this large contest with 1315 teams.

Reference