This blog walks through using NVIDIA’s TAO Toolkit and Innotescus’ data curation and analysis platform to improve a popular object detection model’s performance on the ‘person’ class by over 20%. 


AI applications are powered by machine learning models that are trained to accurately predict outcomes based on input data such as images, text, or audio. Training a machine learning model from scratch requires vast amounts of data and a considerable amount of human expertise, often making the process too expensive and time-consuming for most organizations. Transfer learning is the happy medium between building a custom model from scratch and choosing an off-the-shelf commercial model to integrate into an ML application. With transfer learning, ML practitioners can select a pretrained model that’s related to their solution, and re-train it on data reflecting their specific use case. Transfer learning strikes the right balance between the custom-everything approach – often too expensive – and an off-the-shelf approach – often too rigid – and enables ML practitioners to build tailored solutions with fewer resources.

NVIDIA’s TAO (train, adapt and optimize) toolkit makes it easy for developers to apply transfer learning to pretrained models and create custom, production-ready models without the complexity of AI frameworks. To train these models, high quality data is a must. While TAO focuses on the model-centric steps of the development process, Innotescus focuses on the data-centric steps.  

Together, Innotescus and the TAO toolkit make it cost-effective for organizations to successfully apply transfer learning in their custom applications, arriving at high-performing solutions in little time.

In this article, we will address the challenges of building a robust object detection model by integrating NVIDIA’s TAO toolkit with Innotecus to alleviate several common pain points businesses encounter when building and deploying commercial solutions.

The YOLO Object Detection Model

Our goal in this project is to apply transfer learning to the YOLO object detection model in the TAO toolkit using data curated on Innotescus. 

Object detection, the ability to localize and classify objects with a bounding box in an image or video, is the most widely-used application of computer vision technology. Object detection solves many complex real-world challenges like context and scene understanding, an important part of automating solutions for smart retail, autonomous driving, precision agriculture, and more.

Why do we want to use YOLO in particular for our model? Traditionally, deep learning-based object detectors operate through a two-stage process. In the first stage, the model identifies regions of interest in an image. In the second stage, each of these regions are classified. Typically, many regions are sent to the classification stage, and because classification is an expensive operation, two stage object detectors are extremely slow. YOLO stands for “You Only Look Once.” As the name suggests, YOLO is able to localize and classify simultaneously, leading to highly-accurate real-time performance which is essential for most deployable solutions. In April of 2020, the fourth iteration of YOLO was published; it has been tested on a multitude of applications and industries and has proven to be very robust.  

The figure below shows the general pipeline for training object detection models. For each step of this more traditional development pipeline, we will discuss the typical challenges developers encounter and how the combination of TAO and Innotescus solves these problems.

typical AI development workflow
A high-level AI workflow includes data collection, followed by curation to ensure high quality training data. The data then is used to train an AI model, which is then tested and deployed for inference.

Now we’re ready to interact with the platform via the API, which we’ll do as we walk through each step of the pipeline that follows.

Data Collection

First and foremost, we need data to train our model. Though it’s often overlooked, data collection is arguably the most important step in the development process. While collecting data we need to ask ourselves a few questions,

  1. Is our training data adequately representative of each object of interest?
  2. Are we accounting for all the scenarios in which we expect our model to be deployed?
  3. Do we have enough data to train our model?

We can’t always answer these questions completely, but having a well rounded game plan for data collection helps developers avoid issues during subsequent steps in the development process. Data collection is a time consuming and expensive process, but because the models provided by TAO are pre-trained, the data requirements for re-training are much smaller, saving organizations significant resources in this phase.

For this experiment, we will use images and annotations from the MS COCO Validation 2017 dataset, which are available here. This dataset has 5000 images with 80 different classes, but we will only use the 2685 images containing at least one person, as shown below.

Coco Validation 2017 Dataset
A python code snippet showing project and dataset creation in the Innotescus platform.
A collage of images showing examples of the ‘person’ class.

With our authenticated instance of the Innotescus client, we can begin setting up a project and uploading our human-focused dataset:

python code
A python code snippet showing project and dataset creation in the Innotescus platform.

data_type: The type of data this dataset will hold. Accepted values are DataType.IMAGE, and DataType.VIDEO.

storage_type: The source of the data. Accepted values are StorageType.FILE_SYSTEM, and StorageType.URL.

This dataset is now accessible through the Innotescus UI:

coco dataset
A gallery view shows the human-centric Coco Validation 2017 subset on the Innotescus platform. Users can easily browse, edit, and label using the simple graphical interface.

Data Curation

Now that we have our initial dataset, we can begin curating it to ensure a well-balanced dataset. Studies have repeatedly shown that this phase of the process takes around 80% of the time spent on a machine learning project. Using TAO and Innotescus, we will highlight techniques like pre-annotation and review that save time during this step without sacrificing dataset size or quality.

Beyond a first pass at fixing and submitting pre-annotations, Innotescus allows for a more focused sampling of images and annotations for multi-stage review. This allows large teams to systematically and efficiently ensure high quality throughout the dataset.

Read the full blog here to learn more about pre-annotation and how Innotescus’ multi-stage review improves dataset quality.

Exploratory Data Analysis

Exploratory data analysis, or EDA, is the process of investigating and visualizing datasets from multiple statistical angles to get a holistic understanding of the underlying patterns, anomalies, and biases present in the data. It is an effective and necessary step to take before thoughtfully addressing the statistical imbalances your dataset contains. Innotescus provides pre-calculated metrics for understanding class, color, spatial, and complexity distributions for both data and annotations, and allows users to add their own layer of information in image and annotation metadata to incorporate application-specific information into the analytics.

With Innotescus, EDA is made simple and intuitive and provides ML practitioners with the information they need to make simple yet powerful augmentations to their dataset to eliminate bias early in their development process.

Cluster Rebalancing with Dataset Augmentation:

The idea behind augmentation for cluster rebalancing is simple yet powerful; This technique showed a 21% boost in performance in the recent data-centric AI competition hosted by Andrew Ng and DeepLearning.AI. We will generate an N-dimensional feature vector for each data point (each bounding box annotation), and cluster all data points in higher-dimensional space. Once we cluster objects with similar features, we will augment the dataset such that each cluster has equal representation. 

imbalanced data
The plot shows the four unbalanced clusters of annotations.

When we look at the number of objects in each cluster, we can clearly see the imbalance, which informs how we should augment the data for re-training. The four clusters represent 854, 1523, 1481 and 830 images respectively. Where an image has objects in more than one cluster, we group that image in the cluster with a majority of its objects for augmentation. 

The code snippet shows how to calculate and display the number of objects and images in each cluster.

With our clusters well-defined, we will use the imgaug python library to introduce simple augmentation techniques – translation, image brightness adjustment, and scale augmentation – to enhance our training data. We will augment such that each cluster contains 2,000 images for a total of 8,000; as we augment images, imgaug ensures the annotation coordinates are altered appropriately as well.

The code snippet shows how we augment each cluster to achieve equal representation.

Using the same UMAP visualization technique, with augmented data points now in red, we see that our dataset is now much more balanced, as it more closely resembles a gaussian distribution.

rebalanced dataset
A plot showing the newly-rebalanced dataset.

With our well-balanced, high quality training data, the next and final step is to train the model. Read the original blog here to learn how. 

Key Takeaways

Using NVIDIA’s TAO toolkit for pre-annotation and model training, and using Innotescus for data refinement, analysis, and curation, we were able to improve YOLOv4’s mean average precision on the person class by a substantial amount – over 20%. Not only did we improve our performance on a selected class, we were able to do so with less time and data than would be required had we not leveraged the significant benefits of transfer learning. Transfer learning is a great way to produce high-performing, application-specific models in settings with constrained resources, and using tools like the TAO toolkit and Innotescus makes doing so feasible for teams of all sizes and backgrounds.

Try it for Yourself

Interested in using Innotescus to enhance and refine your own dataset? Sign up for a free trial, or check out our webpage to learn more. 

Get started with the TAO toolkit for your AI model training by downloading the sample resources