Given the Street View House Number dataset, train a model that can predict number classifications in similar images.
There are numerous considerations to be made when building out a custom model. The thing to first consider is figuring out what data you’re dealing with. In the SVHN databse, there are two types of datasets: dataset one being images containing number sequences and the dataset two, being a MNIST like dataset for single digit classification. Although the task was to identify a sequence of numbers, I ended up choosing the second dataset due to the simplicity of the data format. Given more time, I would’ve liked to retrain the first dataset. Expanding on my choice of dataset, I was mostly training on a CPU device and wanted to do as much as I could to reduce training time. Given that the choice dataset contained images that were
32 x 32 x 3, I thought it’d be more advantageous to train on versus the alternative,
48 x 48 x 3. Alongwith the actual images to be used in training, the corresponding labels needed to read into memory. Though a straight forward task, I made the decision to utilize one hot encoding for classifications. Here’s the encoding scheme I used for the project:
# as opposed to the given labels  // represents a 1  // represents a 10 # added one last non-digit class and moved the 10 value to the 0th position [1,0,0,0,0,0,0,0,0,0,0] // this now represents a 10 [0,0,0,0,0,0,0,0,0,0,1] // this now represents a non-digit classification
To load the
.mat data type, I utilized the
scipy library to read the file into memory for processing. Initially, I loaded in both
test.mat until I found out that the
.fit model in keras could split the training data by some percentage via the
validation_split parameter. I also normalized every image to further alleviate the time spent on convolutions. Though I did my final training steps with
BGR images, I did obtain comparable recents with greyscaled images. That being said, I’m not quite sure if it matters all the much which channels are used in training; so long as the predictions are processed in the same manner.
Every CNN starts with convolutional layers and end with a fully connected output layer. When building my own custom model, I used two conv layers accompanied by corresponding max pooling layers. Due to it’s efficiency, I chose the
relu activation function for all conv layers (e.g.
f(x) = max(0, x)).
MaxPooling was added to reduce spatial dimensionality for training as well as to control overfitting.
After adding those top layers, I added fully connected layers to accept the output from the max pooling layers. Due to training on a cpu device with limited resources, the largest fc layers I could acheive was
~1000 nodes. Then of course, since we’re classifying digits we need 10 indices per each digit we want to predict. In my particular case, I added another layer (and corresponding dataset) to train an 11th class that would represent a digit/non-digit classification. The last layer is a
softmax output that grabs the previous calculations and outputs values in a normalized form. Specifically, the numbers become ranges from 0 to 1 and add up to 1.
Given the above model, I trained several times adjusting many different hyperparameters to improve accuracy and loss metrics. The first parameter I attempted to update was
epochs. I kept a few things constant after a few mishaps on early stages of training. Namely, having the right digit to non-digit ratio. In the end, I had
72081 non-digit images and
73254 digit images. I also split my training set where
30% was allocated to the validation set. Here was my first run accuracy and loss:
I was able to acheive very high accuracies with just
7 epochs, though my best performing model given these constants was simply to bump the epoch from
25. I also added an Early Stopping callback to the fit function to evaluate my best runs and stop if the loss function passed a certain threshold. I chose
25 due to running this model with
30 and having it stop at
26, though the 26th run wasn’t quite as accurate. I could’ve updated my early stopping parameters but the differences were seemingly minimal:
Due to my CPU, I was unable to run the
VGG16 out of the box. The large
4096 output layers would consistently OOM my device and thus could not finish outputting graphs. Because of this, I ended up using a ported version of VGG which did not yield good results with the same
hyperparameters described above. Simar results were acheived for the
1. *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks* Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet - https://arxiv.org/abs/1312.6082 2. *On the Convergence of A Family of Robust Losses for Stochastic Gradient Descent* Bo Han, Ivor W. Tsang, and Ling Chen - https://arxiv.org/pdf/1605.01623.pdf 3. *VGG-16 pre-trained model for Keras* baraldilorenzo - https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3 4. *Rectified Linear Units Improve Restricted Boltzmann Machines* 5. *Softmax* Wikipedia - https://en.wikipedia.org/wiki/Softmax_function 6. *VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION* Karen Simonyan and Andrew Zisserman - https://arxiv.org/pdf/1409.1556v6.pdf