Shashank, Chris, and Rob first worked together at ChemImage Corporation, leveraging machine learning with dense hyperspectral imaging datasets to identify chemical signatures. Now, they’re celebrating a big win; their up-and-coming company, Innotescus, created the second-best performing model as part of Andrew Ng’s Data-Centric AI Competition. But why and, most importantly, how did they do it?

What’s inside this article?

What we believe about Data-Centric AI

We at Innotescus, like many others in the ML community, learned the hard way that elegant algorithms and crazy math are just a part of this puzzle. We understand and appreciate the importance of having detailed insights into data, and using those insights to create objectively high-quality data for high-impact machine learning models.

Our goal is to illuminate the black box of machine learning, so being able to effectively curate a dataset is at the core of our focus. We feel strongly that the machine learning universe — and the world at large — would benefit immensely from explainable, approachable AI, and we want to lead the movement in that direction. The more we work to demystify the world of machine learning, the faster we can get to safer and more responsible technology that  makes the world a better place.

Was there a competition?

A collaboration between DeepLearning.AI and Landing AI, the Data-Centric AI Competition aimed to elevate data-centric approaches to improving the performance of machine learning models. The Data-Centric AI Competition flipped the traditional “model-first” approach on its head. Instead they asked us to improve a dataset given a fixed model architecture.

How we did it

Our method can be split into two parts: data labeling and balancing data distributions.

Data Labeling

As Prof Andrew Ng and others have highlighted, and as we have seen in our own experiences, the first and arguably largest source of problems in creating a high-quality training dataset comes from data labeling errors and inconsistencies. Having a consistent set of rules for labeling and a strong consensus among annotators/field experts mitigates errors and greatly reduces subjectivity.

For this competition, we broke the dataset cleaning process into three parts:

  1. Identify noisy images

This was a no-brainer. We removed noisy images from the training set. These images clearly don’t correspond to a particular class and would be detrimental to model performance.

noisy image
Imagine being the annotator assigned to these...
  1. Identify incorrect classes

    We corrected mislabelled data points. Human annotators are prone to mistakes, and having a systematic QA or review process helps identify and eliminate those errors.
mislabeled datapoints
  1. Identify ambiguous data points

    We defined consistent rules for ambiguous data points. For example, in the images shown below, we consider a data point as class 2 if we see a clear gap between the two vertical lines (top row), even when they are at an angle. If there is no identifiable gap, we consider the datapoint as class 5 (bottom row). Having pre-defined rules helped us reduce ambiguity more objectively.
ambiguous datapoints

This three-step process cut down the dataset to a total of 2,228 images, a 22% reduction from the provided dataset. This alone resulted in 73.099% accuracy on the test set, an approximately 9% boost from the baseline performance.

Balancing Data Distributions

When we collect training data in the real world, we arguably capture a snapshot of data in time, invariably introducing hidden biases into our training data. Biased data most times leads to poor learning. One solution is reducing ambiguous data points and ensuring balance along major dimensions of variance within the dataset.

  1. Rebalancing Training and Testing Datasets

    Real-world data has a lot of variances built into it. This variance almost always causes unbalanced distributions, especially when observing a specific feature or metric. When augmented, these biases can get amplified. The result? Throwing more data at your model may drive you further away from your goal.We observed this with two of our submissions in this competition. The two submissions contained the same data, just split differently between training and validation (80:20 and 88:12 respectively). We saw that the addition of 800 images to the training set actually reduced the accuracy of the test set by about 1.5%. After this, our approach shifted from “more data” to “more balanced data.”
  2. Rebalancing Subclasses Using Embeddings

    The first imbalance we observed was in the upper and lowercase distribution within each class. For example, our “cleaned” data contained 90 images of lowercase class 1 and 194 images of uppercase class 1. Staying true to our hypothesis, we needed equal representation (500 images) from each case per class (totaling 10,000 images limit as per the rules of the competition).
dive chart analytics andrew ng competition

We then further explored clusters within each uppercase and lowercase subset. We subdivided each case into three clusters using K-means clustering on the PCA-reduced ResNet-50 embeddings.

PCA-reduced ResNet-50

Once we had these clusters (shown below), we simply balanced each of the sub-clusters with augmented images resulting from translation, scale and rotation.

augmented images
A UMap visualization of clusters - separated by color - obtained on lowercase class 1
  1. Rebalancing Edge Cases with Hard Examples and Augmentation     .
    Towards the end of the competition, we observed that there were certain examples in our validation set that we consistently misclassified.  Our goal here was to help the model classify these examples with higher confidence.We believe that these misclassifications are caused by an underrepresentation of “edge case” examples in our training set. Below is one such example; this is a III being misclassified as a IV.
misclassified image
The ResNet50 output for a hard example. The model classifies the image as a IV with a raw prediction value of 2.793, but the prediction value for class III is only slightly lower at 2.225.

From the model prediction output, we observed that even though we misclassify this example, the values for class III and class IV are very close. We wanted to identify more examples on or near the “decision boundary” and add them to our training set, so we defined a difficulty score as described below:

difficulty score

Where Pomax is the max predicted output and Po2nd maxis the next best predicted value. This describes the percentage difference between the first- and second-most likely predicted values; For the output shown in the example, the difficulty score is 2.793 – 2.2252.793=0.203. We then added a constraint; if the difficulty score is less than 0.5, we consider that as a “hard example.”  This process gave us an additional 880 images that we added to our training set.  

difficulty score examples

Additionally, in order to avoid overfitting, we cropped the images by a few pixels to reduce the white space around the Roman numerals and used different iterations of dilation and erosion. Some of these additional examples are shown above. 

However, the addition of these 880 “hard” images meant that we had to remove 880 existing images from our validation set. To do this, we studied the histogram distribution of the difficulty score for each class in the training dataset, and matched its distribution in the validation set. Matching the training and validation difficulty scores was the best way to avoid over- or underfitting to the training dataset, and ensure optimal model performance.

Wrapping Up

The introduction of the Data-Centric AI competition by Prof Andrew Ng,, and DeepLearning.AI provided us the perfect opportunity to test some of our hypotheses and improve the solutions we provide. We are grateful for this opportunity and want to thank everyone involved in setting up this competition. A special shoutout to DeepLearning AI  moderators for making this journey easy and fun.