SVHN Classification

Given the Street View House Number dataset, train a model that can predict number classifications in similar images.



There are numerous considerations to be made when building out a custom model. The thing to first consider is figuring out what data you’re dealing with. In the SVHN databse, there are two types of datasets: dataset one being images containing number sequences and the dataset two, being a MNIST like dataset for single digit classification. Although the task was to identify a sequence of numbers, I ended up choosing the second dataset due to the simplicity of the data format. Given more time, I would’ve liked to retrain the first dataset. Expanding on my choice of dataset, I was mostly training on a CPU device and wanted to do as much as I could to reduce training time. Given that the choice dataset contained images that were 32 x 32 x 3, I thought it’d be more advantageous to train on versus the alternative, 48 x 48 x 3. Alongwith the actual images to be used in training, the corresponding labels needed to read into memory. Though a straight forward task, I made the decision to utilize one hot encoding for classifications. Here’s the encoding scheme I used for the project:

# as opposed to the given labels
[1] // represents a 1
[10] // represents a 10

# added one last non-digit class and moved the 10 value to the 0th position
[1,0,0,0,0,0,0,0,0,0,0] // this now represents a 10
[0,0,0,0,0,0,0,0,0,0,1] // this now represents a non-digit classification


To load the .mat data type, I utilized the scipy library to read the file into memory for processing. Initially, I loaded in both train.mat and test.mat until I found out that the .fit model in keras could split the training data by some percentage via the validation_split parameter. I also normalized every image to further alleviate the time spent on convolutions. Though I did my final training steps with BGR images, I did obtain comparable recents with greyscaled images. That being said, I’m not quite sure if it matters all the much which channels are used in training; so long as the predictions are processed in the same manner.

CNN Architecture

Every CNN starts with convolutional layers and end with a fully connected output layer. When building my own custom model, I used two conv layers accompanied by corresponding max pooling layers. Due to it’s efficiency, I chose the relu activation function for all conv layers (e.g. f(x) = max(0, x)). MaxPooling was added to reduce spatial dimensionality for training as well as to control overfitting. After adding those top layers, I added fully connected layers to accept the output from the max pooling layers. Due to training on a cpu device with limited resources, the largest fc layers I could acheive was ~1000 nodes. Then of course, since we’re classifying digits we need 10 indices per each digit we want to predict. In my particular case, I added another layer (and corresponding dataset) to train an 11th class that would represent a digit/non-digit classification. The last layer is a softmax output that grabs the previous calculations and outputs values in a normalized form. Specifically, the numbers become ranges from 0 to 1 and add up to 1.


Given the above model, I trained several times adjusting many different hyperparameters to improve accuracy and loss metrics. The first parameter I attempted to update was batch_size and epochs. I kept a few things constant after a few mishaps on early stages of training. Namely, having the right digit to non-digit ratio. In the end, I had 72081 non-digit images and 73254 digit images. I also split my training set where 30% was allocated to the validation set. Here was my first run accuracy and loss: 7

I was able to acheive very high accuracies with just 7 epochs, though my best performing model given these constants was simply to bump the epoch from 7 to 25. I also added an Early Stopping callback to the fit function to evaluate my best runs and stop if the loss function passed a certain threshold. I chose 25 due to running this model with 30 and having it stop at 26, though the 26th run wasn’t quite as accurate. I could’ve updated my early stopping parameters but the differences were seemingly minimal: 25

Due to my CPU, I was unable to run the VGG16 out of the box. The large 4096 output layers would consistently OOM my device and thus could not finish outputting graphs. Because of this, I ended up using a ported version of VGG which did not yield good results with the same hyperparameters described above. Simar results were acheived for the VGG16 pretrained. vgg


1. *Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks*
 Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet -
2. *On the Convergence of A Family of Robust Losses for Stochastic Gradient Descent*
 Bo Han, Ivor W. Tsang, and Ling Chen -
3. *VGG-16 pre-trained model for Keras*
 baraldilorenzo -
4. *Rectified Linear Units Improve Restricted Boltzmann Machines*
5. *Softmax*
Wikipedia -
Karen Simonyan and Andrew Zisserman -