Data Annotation: The Meat and Potatoes of Machine Learning Part 2
Data annotation is one of the most important stages of preparing data to train a Machine Learning model. As we explained in our previous post, Supervised Learning models will simply learn to replicate their training data when they’re deployed in the field, so poor quality annotations make for disappointing model performance. Though the annotation process isn’t hard to understand, we listed several challenges that sneak up on most teams, from simple problems like a lack of time and manpower, to more complex problems such as a lack of domain knowledge and ambiguous labels. Here we’ll discuss some solutions to these issues:
Saving time annotating large datasets is a common goal among most ML development teams, but doing so while maintaining quality annotations can be tricky. In computer vision alone, there are many annotation types, including image tag, bounding box, polygon, line, spline, 3D cuboid and landmark annotations. As the type of annotation varies, so does the complexity and time required for labeling. Tagging images for image classification and drawing bounding boxes for object localization can be completed quickly and with few resources, while semantic segmentation tasks require far more time and sophisticated software. Cityscapes, a popular benchmark dataset for computer vision research, has over 5,000 segmented images. Researchers noted that annotation and quality control for the entire dataset required more than 1.5 hours per image on average – that’s over 7,500 hours of labeling time!
To counter this problem, the AI research community has been actively investigating automated, semi-automated, interactive and guided segmentation approaches to annotation. By reducing the need for human input in the annotation process, scientists hope to decrease both the time and cost of annotating data. In recent years, improvements to classical methods like SLIC and Grabcut, and novel methods that use deep learning to power interactive segmentation have facilitated easier image annotations by requiring the annotator to click just a few points in the image.
Here we use one of Innotescus’ assisted segmentation tools to create a complex mask.
Many applications of deep learning require annotators with expertise in a certain domain. However, domain experts and ML scientists would rather be perfecting their models instead of spending months annotating training datasets. Because of this, ML projects often begin with sourcing and training annotators, adding cost from the get-go. However, several alternatives have emerged; while Ontology and knowledge graph based annotation algorithms are showing some promise, transfer learning is currently leading the way to overcome limitations caused by domain-specific annotation tasks.
Transfer Learning is a ML technique in which a model trained to do a particular task can be used for a related task with minimal fine-tuning and re-training. For example, a model trained to detect cars can be fairly easily re-purposed to detect trucks. Using transfer learning, we can potentially decrease the human involvement in the annotation process to just a few tweaks. This not only reduces the number of domain experts needed to perform large scale annotation tasks, but also significantly decreases the amount of the time required for each annotation.
Annotation accuracy is arguably the biggest source of concern in the entire annotation process. Unintentional inconsistencies and errors in annotations are unavoidable, especially in large training datasets, and these mislabeled instances, sloppy object boundaries, and ambiguous object classes ultimately lead to poor model performance.
Unclear labeling guidelines can cause confusion. Does the back of a stop sign or an altered stop sign count as a stop sign?
While there are many metrics to measure the annotation accuracy like K-alpha and the Pearson correlation coefficient (PCC), to ensure consistent quality across the dataset, annotators need to be provided with clear and straightforward instructions, examples, and motivations. Aside from unambiguous instructions, the best way to deal with annotation accuracy issues is to facilitate multi-annotator labeling. This process requires multiple annotations for the same data, which are then used as inputs to a robust, automated consensus mechanism, ensuring a single high quality output.
Outsourcing data annotation is a logical way to reduce the cost and time of annotating data yourself, however it introduces the risk of poorly annotated or unsecured data. Again, finding the right tool to manage this workflow can make all the difference: tools like Innotescus, with end-to-end encryption, individual customer data isolation, and varying levels of project access, ensure nobody can gain access to your data without your explicit permission. Once annotators do gain access, Innotescus provides areas for project administrators to include clear instructions for annotation to ensure a high quality work product.
Though the issues brought up by annotating data can become surprisingly complicated, access to the right tools and a clear understanding of the final application make all the difference. Understanding and accounting for this before diving into annotation is a necessary step to achieve a successful deployment in the field. With a thoroughly cleaned and carefully annotated dataset, you’re now ready to visualize and evaluate your dataset, to understand its actual contents, which take center stage in the next steps of the Machine Learning process.
We are a group of scientists, engineers, and entrepreneurs with a vision for better AI. With backgrounds primarily in Machine Learning and Computer Vision, the Innotescus team understands the importance of having full control over and insight into data used to train Machine Learning models.