In part 1, we discussed the data requirement curse for non-linear machine learning models. We saw that we can use learning curve graphs to estimate the dataset size for target performance. In part 2, we will design this experiment using a benchmark dataset: Fashion-MNIST. The Fashion-MNIST dataset contains 60,000 training images (and 10,000 test images) of fashion and clothing items, taken from 10 classes. Each image is standardized, grayscale, and 28×28 pixels (784 total pixels). This is a more challenging classification problem than MNIST digits, and top results are achieved by deep learning convolutional neural networks with a classification accuracy of about 90% to 95% on the test dataset. Each training and test example is assigned to one of the following labels:

data labels

The figure below shows some examples of Fashion-MNIST data points we will use in our experiment.

Fashion-MNIST data points

Our goal is to see if we can reproduce a graph that validates our learning curve assumptions and estimate the minimum dataset size required for acceptable model performance (as close to the top results as possible).

Diving into the experiment

We will use Google Colab (python 3) to design our experiment and the keras API to access the Fashion-MNIST dataset and define our CNN architecture. First let’s take a look at the imports we will need for this experiment.

imports

We use the matplotlib library for our display needs and keras for both accessing the Fashion MNIST dataset and creating our CNN architecture. Next, let’s define 3 function calls to make our code modular and our lives easier.

  1. Prepare the dataset for training and evaluation:

First, we reshape the data points such that they have a single color channel, then we will convert the labels into one-hot encoding for training. Lastly, each pixel in the dataset has values between 0-255. We need to convert these from unsigned int into float32 and normalize the values to 0-1.

  2. Create the CNN architecture:

cnn architecture

We will use a very simple sequential model for this experiment. This model will have 32 3×3 convolution filters with RELU activations. The convolution layer is followed by a 2×2 maxpooling layer, the output of which will be flattened to provide features for the classifier. We will then use a dense layer with 100 nodes to interpret the features. Finally, we will use another dense layer with 10 nodes representing 10 classes with softmax activation as a classifier. 

All the layers will use the HE weight initialization scheme to seed the weights. We will use Stochastic gradient descent with a conservative learning rate of 0.01 and a momentum of 0.9 as the optimizer, and categorical crossentropy as the loss function. Remember that all these parameters are configurable and need to be in order to have a good model to begin with. More complex problems like autonomous driving and speech recognition will require very complex models, but because our problem statement is relatively simple, we are using a relatively simple network.

   3. Train and evaluate the model:

evaluate the model

Our last function will train and evaluate the model. This step is the most important part of the journey but is the easiest to understand. We first use the model definition to create our CNN architecture, then train it using model.fit and lastly evaluate the trained model using model.evaluate. In this experiment we chose to use a batch size of 32 and train the model for 10 epochs.

That’s it. We have all the building blocks to run our learning curve graph experiment. Let’s first use the entire 60,000 images as training data to understand the benchmark performance of the model at the maximum dataset size. We will use this to benchmark our model and run the experiment to see how much data utilization we can reduce while maintaining a comparable performance.

To run this experiment, all we need to do is use keras datasets to access the Fashion MNIST dataset, then prepare our data for training, and finally train and evaluate the model as shown below.

Once we complete the training and evaluation, we should see results similar to the following:

In our runs, we observe that each epoch takes about 34s to run. That means our total time to train this model on 60,000 images is about 340s. This training yields about 91% accuracy of classification on the 10,000 image test set. Not too bad right? Now let’s see how much data we can eliminate to attain a comparable performance. Let’s start with 5,000 training images and increment each iteration by another 5,000. While there are much more elegant ways for handling this, for the sake of simplicity, we will change the training dataset size manually for each iteration as shown below.

While using an experimental setup like this, it is advisable to cross check the class imbalance, however, we will not include such strategies in this experiment. When we load the dataset from keras, it’s already shuffled. In this experiment, due to the simplicity of the dataset, we will rely on the initial shuffling and the use of categorical crossentropy loss to deal with the class imbalance caused by sampling.

Now, let’s take a look at what our model performance looks like while using only 5,000 training images.

We observe an 86% accuracy of classification with about 30s of training time – a reduction of 5% in accuracy for an 11x reduction in training time. On any given day, this would be a great trade off, however, our goal here is to see how close we can get to our model’s max performance with reduced training data. The table and the corresponding plot below show the mapping between training examples, training time and model’s performance.

From the above table, we might say that 35,000 training images is a good trade off for the value. This provides approximately 42% reduction in the training data requirement and 41% reduction in training time for the loss of 0.91% accuracy. The proper trade off for your use case is subjective and should be based on project management goals. 

Now, this is all great, but what if we don’t have 60,000 images to start with? Can we estimate the mapping between the dataset size and performance? Yes! That is where the extrapolation comes in. Let’s say you had only 30,000 images and wanted to see what the theoretical trade off would be to know how much more data needs to be collected or augmented. This means you have data for only the first 6 entries in the table above (you can always get more granular to better seed the interpolation algorithm). Extrapolating the learning curve graph from existing values will give a decent estimation of the model’s performance as the number of samples increases. Thus we can estimate the approximate dataset size requirements for the target performance. In the example below, using scipy’s interpolate library, we can estimate the performance of the model in the immediate future with decent accuracy (90.16% estimated vs 90.23% actual).

So what is the downside? Well, there are multitudes of extrapolation techniques out there. We assume that the learning curve graph follows a quadratic curve but we might be looking at a snapshot rather zoomed in. Additionally, the further out you want to estimate, the less accurate your predictions will be; extrapolation is simply an estimate that can give you a ballpark but cannot guarantee 100% accuracy. 

Learning curve graphs significantly help us in estimating the required dataset size with respect to target performance, resources needed and time to compilation of a project, but they are still just an estimation. Researchers also use techniques like statistical learning theory, power law function estimation, and estimation through analogy to figure out the rough dataset size required for a target model performance. However, learning curve graphs remain a powerful and relatively straightforward way to help developers understand and scope their projects, so they can approach them more thoughtfully.