on
SVHN Classification
Given the Street View House Number dataset, train a model that can predict number classifications in similar images.
Implementation
There are numerous considerations to be made when building out a custom model. The thing to first consider is figuring out what data you’re dealing with. In the SVHN databse, there are two types of datasets: dataset one being images containing number sequences and the dataset two, being a MNIST like dataset for single digit classification. Although the task was to identify a sequence of numbers, I ended up choosing the second dataset due to the simplicity of the data format. Given more time, I would’ve liked to retrain the first dataset. Expanding on my choice of dataset, I was mostly training on a CPU device and wanted to do as much as I could to reduce training time. Given that the choice dataset contained images that were 32 x 32 x 3
, I thought it’d be more advantageous to train on versus the alternative, 48 x 48 x 3
. Alongwith the actual images to be used in training, the corresponding labels needed to read into memory. Though a straight forward task, I made the decision to utilize one hot encoding for classifications. Here’s the encoding scheme I used for the project:
# as opposed to the given labels
[1] // represents a 1
[10] // represents a 10
# added one last non-digit class and moved the 10 value to the 0th position
[1,0,0,0,0,0,0,0,0,0,0] // this now represents a 10
[0,0,0,0,0,0,0,0,0,0,1] // this now represents a non-digit classification
Preprocessing
To load the .mat
data type, I utilized the scipy
library to read the file into memory for processing. Initially, I loaded in both train.mat
and test.mat
until I found out that the .fit
model in keras could split the training data by some percentage via the validation_split
parameter. I also normalized every image to further alleviate the time spent on convolutions. Though I did my final training steps with BGR
images, I did obtain comparable recents with greyscaled images. That being said, I’m not quite sure if it matters all the much which channels are used in training; so long as the predictions are processed in the same manner.
CNN Architecture
Every CNN starts with convolutional layers and end with a fully connected output layer. When building my own custom model, I used two conv layers accompanied by corresponding max pooling layers. Due to it’s efficiency, I chose the relu
activation function for all conv layers (e.g. f(x) = max(0, x)
). MaxPooling
was added to reduce spatial dimensionality for training as well as to control overfitting.
After adding those top layers, I added fully connected layers to accept the output from the max pooling layers. Due to training on a cpu device with limited resources, the largest fc layers I could acheive was ~1000
nodes. Then of course, since we’re classifying digits we need 10 indices per each digit we want to predict. In my particular case, I added another layer (and corresponding dataset) to train an 11th class that would represent a digit/non-digit classification. The last layer is a softmax
output that grabs the previous calculations and outputs values in a normalized form. Specifically, the numbers become ranges from 0 to 1 and add up to 1.
Training
Given the above model, I trained several times adjusting many different hyperparameters to improve accuracy and loss metrics. The first parameter I attempted to update was batch_size
and epochs
. I kept a few things constant after a few mishaps on early stages of training. Namely, having the right digit to non-digit ratio. In the end, I had 72081
non-digit images and 73254
digit images. I also split my training set where 30%
was allocated to the validation set. Here was my first run accuracy and loss:
I was able to acheive very high accuracies with just 7
epochs, though my best performing model given these constants was simply to bump the epoch from 7
to 25
. I also added an Early Stopping callback to the fit function to evaluate my best runs and stop if the loss function passed a certain threshold. I chose 25
due to running this model with 30
and having it stop at 26
, though the 26th run wasn’t quite as accurate. I could’ve updated my early stopping parameters but the differences were seemingly minimal:
Due to my CPU, I was unable to run the VGG16
out of the box. The large 4096
output layers would consistently OOM my device and thus could not finish outputting graphs. Because of this, I ended up using a ported version of VGG which did not yield good results with the same hyperparameters
described above. Simar results were acheived for the VGG16
pretrained.
References
1. *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks*
Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet - https://arxiv.org/abs/1312.6082
2. *On the Convergence of A Family of Robust Losses for Stochastic Gradient Descent*
Bo Han, Ivor W. Tsang, and Ling Chen - https://arxiv.org/pdf/1605.01623.pdf
3. *VGG-16 pre-trained model for Keras*
baraldilorenzo - https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3
4. *Rectified Linear Units Improve Restricted Boltzmann Machines*
5. *Softmax*
Wikipedia - https://en.wikipedia.org/wiki/Softmax_function
6. *VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION*
Karen Simonyan and Andrew Zisserman - https://arxiv.org/pdf/1409.1556v6.pdf