Can you tell the difference between a cat and a dog? Of course you can, but if you had to describe how you differentiate cats from dogs with 100 percent accuracy, you might scratch your head for a while. Most children learn objects through word books or videos that show the object and associated word at the same time. Through the repetition of seeing objects with their associated names, or labels, whether in books or daily life, children gradually learn to identify common things. Machine Learning, particularly Supervised Learning, is just like teaching a toddler; they learn by the same methods – images and labels. When ML scientists want to train a classification model for dogs, they need to collect a dataset with thousands of images of dogs and other animals that look like dogs, such as wolves, coyotes, and foxes, but images alone are not enough. ML scientists then need associated labels for these images to tell the model what it’s looking at – just like the children’s word books. In the world of Machine Learning, these labels are often called annotations, and without these annotations, our models will fail to learn anything at all.
national geographic word book - innotescus
National Geographic Word Book

Annotation is the meat and potatoes of Machine Learning, and it’s what Innotescus was built for. Annotations are essentially the right answers; they’re the desired outputs of our machine learning model, and they’re there to teach the model which inputs (i.e. cleaned data points) should lead to those outputs. The process of ‘teaching’ the model to successfully map these inputs to their corresponding outputs is called training. Through the training process, the model continuously adjusts its many parameters to maximize the number of annotations it identifies correctly. Without annotated data, the training process would have no way of measuring its accuracy, and would not know how to adjust its parameters after each training iteration to increase it.

So, what do annotations look like? How do you tell a computer what’s what in, let’s say, an image? Image annotations can take several different formats, depending on the application; the most common types are listed below:

Image Classification annotations simply say what class is present in an image, which implies only one class per image. Classification annotations are best for models that are used to identify and sort objects, one at a time.

Object Detection annotations identify the class and location of each relevant object in the image using a bounding box. Object detection models are deployed in a wide range of tasks, from autonomous driving to crowd monitoring.

object detection task

Bounding boxes are drawn around each human in an object detection task

Instance segmentation annotations are like object detections, but instead of a bounding box, they’re complex shapes, known as segmentation masks, that show which pixels belong to the relevant object. Like object detection tasks, instance segmentation tasks have one annotation for each instance of a class, and are used in similar scenarios in which pixel-by-pixel accuracy is necessary.

instance segmentation task - innotescus

Unique segmentation masks drawn for each human in an instance segmentation task

Semantic segmentation annotations are like instance segmentation in that they’re also pixel-by-pixel, however there cannot be multiple instances of the same class. In semantic segmentation, the annotation reveals all pixels which belong to a single class, and each annotation refers to a separate class.

semantic segmentation task - innotescus

A single segmentation mask encompasses all humans in a semantic segmentation task

These are the most common types of image annotation, however several others exist, including 3D cuboids, line, and landmark annotations. Though the different types of annotations are simple enough to understand, the actual process of annotating an entire dataset introduces several challenges:


Data annotation requires a lot of manual work which increases exponentially with the size of the dataset. Numerous studies conducted for different annotation task types indicate that about 80% of the time and resources in any machine learning project is spent on data preparation, of which over 30% is consumed by data annotations. While drawing bounding box annotations is relatively quick, studies conducted on the 164k-image Coco-Stuff dataset have shown that pixel-level annotations can take up to 90 minutes per image. Smarter tools which require limited human intervention are imperative to reduce the time required for these kinds of annotations.

Domain Knowledge

Many applications require the annotator to have certain domain knowledge to accurately label the data. For example, a task could require a highly trained medical doctor to correctly classify malignant and benign cells in a dataset being used to train a model for early stage cancer detection. As you may have guessed, building a team of annotators with the required domain knowledge or even training existing annotators is a time consuming and costly endeavor. To mitigate this issue, ML Scientists can quickly fine tune pre-trained models using a limited training dataset and apply it to the remaining data.

Annotation Accuracy

The performance of your model is going to largely depend on the accuracy of your labeled data. Since data annotation is a tedious manual task, it is all but inevitable that a large training dataset will contain some inaccurate labels. Below are few common sources of error.

  • Mislabelled instances: Training a machine learning model is like teaching a toddler. If you show a child an apple and say it’s a grape, the next time the child sees the apple, she’ll call it a grape.
  • Label ambiguity: Real-world data is messy and noisy. This introduces variations in annotators’ interpretations of that data. For example, occluded objects, shadows, reflections, and glare all cause ambiguity while labeling, and introduce inconsistencies in a multi-annotator labeling environment.
  • Inaccurate localization: For robust model training, we need the localization of objects of interest to be as accurate as possible. For example, when training an object detection model, we need tight bounding boxes or boundary segments around the objects to get the model’s prediction as close to ground truth as possible.

To build a training dataset with highly accurate annotations, we need to provide clear guidelines to build consensus between annotators and ensure consistency in the dataset.

ground truth annotation - innotescus

Top-Left: Ground truth annotation.  Top-Right: Human mislabeled as Car.
Bottom-Left: Label ambiguity – are the bag and scarf parts of the human class?
Bottom-Right: Inaccurate localization – boundary extends the object.


As datasets grow larger, the resources required for annotation can become unmanageable. Crowdsourcing is a rapidly growing solution for large scale labeling jobs, and while it can be a cost-effective and quick solution, it comes with two major disadvantages:

  • Non-expert annotators: Most commercial crowdsourcing platforms currently offer a large workforce of non-expert annotators. For applications that require domain knowledge, these platforms often produce unreliable training data.
  • Security: Your data is your secret ingredient. Crowdsourcing platforms run a huge risk of compromising data security.

Data annotation is an indispensable part of building any machine learning solution. High quality annotations are not just required for training robust supervised learning models, they also can be very useful in understanding hidden patterns in the data and making informed choices when selecting an optimal ML technique for the application at hand.

Because the annotation process can introduce all kinds of errors in a dataset, finding the right tool to create and manage annotations is crucial. In our next post, we’ll dive deeper into how annotation tools handle these challenges to enable fast, accurate, and transparent data annotation.

Free EBook - resolve 5 common ML Data Cleaning Problems