Figuring out Dog Breeds using CNNs

6 min readMay 14, 2021

https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg?crop=1.00xw:0.669xh;0,0.190xh&resize=640:*

There are hundreds and hundreds of different dog breeds out there, and it often takes more than a trained human eye to distinguish them. Without a doubt, it is an extremely difficult task. So, I wanted to use my knowledge of deep learning and apply it to this domain.

The Task

The idea is to take any image as input. The image may contain a dog, a human, or neither of the two. After figuring out which of the three cases it is, we should either:

Output the correct dog breed if the image contained a dog
Output a closest matching dog breed if the image contained a human
Output an error message if it is neither

The Data

The dataset of dog breeds comes from Udacity (available here: https://github.com/thomasgrusz/dog-breed-classifier). It contains 133 different dog breeds. These are spread across 8,351 total data points. They are also pre-split into training/validation/test datasets:

Training images — 6,680
Validation images — 835
Test images — 836

Australian_shepherd_00851 from training dataset

Along with the dog images, a dataset of human celebrities’ images is also provided. This is to provide some testing images for our final system.

Robert_Downey_Jr_0001 from human dataset

However, our task is quite complicated (with 133 different classes) and I wanted a larger dataset to train our model on. For that purpose, I utilized data augmentation techniques using ImageDataGenerator from Keras. This is a method that seeks to increase the size and diversity of the dataset using image transformation techniques. Here are the ones I used:

Random zoom transformations, ranging from 0.8x to 1.2x
Random rotational transformations, ranging from 25 degrees clockwise to counter-clockwise
Random height shifts, as a factor (1.3) of image height
Random width shifts, as a factor (1.3) of image width

Original data point and augmented copies

Using data augmentation, I was able to generate 133,532 new images using the original 6,680 images. Combined altogether, this put my final training dataset size at 140,212 images.

The Preprocessing

To load our datasets, I used datasets.load_files from sklearn. This gives us the label arrays as well as the data arrays with image filenames. Then, to load the images using the filenames, I again utilize functions from keras.preprocessing. These allow us to load the image with our chosen dimensions = (224,224,3).

Once the datasets are loaded, an intermediate step needs to be done to check if the image is of a dog, a human, or neither.

To check if the image is of a dog, I used transfer learning. Initializing a ResNet50 model on ImageNet weights, I was able to detect dogs (regardless of breed for now) with a high accuracy.

To check if the image is of a human, I utilized a CascadeClassifier from OpenCV. Using their haarcascade_frontalface_default XML classifier, I was able to devise a function for detecting humans in images.

How a Haar Cascade Classifier detects human faces (Credit:https://miro.medium.com/max/2884/1*JhFCP1CjF7fRYt9pLldMsw.jpeg)

If neither of the above were detected, then we defaulted to the error case.

The Experiments

Next up was the main task of devising a CNN architecture for our classification task. After some experimentation with different models and the validation dataset, I settled upon an architecture with the following parameters:

No. of convolutional blocks = 4
Filter size = (3, 3)
Strides = (2, 2) and (1, 1)
Filters = [16, 32, 64, 128]
optimizer = ‘rmsprop’
loss = ‘categorical cross_entropy’
Dropout = 0.4

I first fit the model to my original, non-augmented training dataset of 6,680 images. I was able to get the following training and validation accuracies:

As we can see, there is an issue with overfitting as the training accuracy keeps climbing far higher than the validation accuracy. Despite dropout and regularization, the issue persists a little.

I then fit the model to the augmented dataset with 140,212 training images. Here are the training and validation accuracies:

Here we can notice that the issue of overfitting is much less pronounced. Not only that, but it appears that the validation accuracy is also a bit higher than for the model trained on just the original training images.

By evaluating both models on the test set, their performances become clearer. The results are as follows:

We see that the CNN trained on the augmented dataset performs the best, with an f-score of 12.32%. Here, the f-scores were calculated using weighted averaging due to the many classes, so they may lie outside the precision and recall.

Since the task is quite difficult with 133 classes, it is also meaningful to look at the Top-5 Accuracy. This is a metric where if any of the model’s top 5 predictions were the correct one, then the prediction is marked as correct. The model is able to achieve an impressive top-5 accuracy score of 35.64% despite not having a massive architecture or many trainable parameters. It was trained for 20 epochs, and finished training within ~25 minutes.

It is good to notice here that augmenting our dataset with image transformations led to an increase in the performance overall. Whether it was due to just more data points overall or due to more diverse samples, the model was able to perform better against unseen test data.

Now that we have our classifier, we can test the full system. Using some test images:

Great! Looks like our system is performing as intended.

The Analysis

Now, we can look at some of the pairs of dog breeds that our model misclassifies a lot to see where and why it might be failing. To do this, I generate the confusion matrix from our augmented model, which will be of shape (133,133) in this case due to the number of classes. The leading diagonal in this represents correct predictions, while every other cell represents a misclassification pair in the form (true breed, predicted breed).

Looking at a few of the non-diagonal cells with the highest values and visualizing the breeds, we get the following problem pairs:

In both cases, it is not hard to see why the CNN would make a mistake and confuse the two dog breeds. Since the CNN looks for certain features in the images, there may be similar activations in its feature maps if two pictures are this similar in terms of which portions are similarly colored/shaped.

Conclusion

Using a small CNN architecture, it is not hard to train a model that is able to take on the daunting task of classifying between 133 different dog breeds with decent accuracy. Thanks to its many parameters, it is able to do much better at this task than any average human could.

Moreover, data augmentation techniques prove to be very valuable as they can bolster model performance by increasing the size and diversity of training data.

If allowed the availability of more computational resources and time in the future, I would also like to utilize transfer learning and use large pre-trained models as feature extractors for even better performance. CNNs are undoubtedly quite powerful for image classification!

Link to code: https://github.com/AhsanSuheer/dog-breed-prediction