Having the right data is only the first step on the road to training your machine learning model and deploying your solution. Though you’re probably eager to jump right into the training process, it’s important to first have a firm grip on what exactly you’re feeding your model, as they say – garbage in, garbage out. However, before you can profile your data to understand just what it is you’re feeding your model, you need to clean it. If your data is not cleaned appropriately prior to training, the model will perform suboptimally, resulting in additional computing time and costs, a delayed deployment schedule, and lost profits. This seems straightforward, but making sure your data is ready for training is much more involved and time consuming than most realize.

Endless data cleaning - Innotescus

To build high quality training datasets, ML scientists clean the data in an iterative process. Data cleaning is not simply about removing “bad” data; it is a complex process which involves removing and/or modifying data that might be incorrect, irrelevant, incomplete, or improperly formatted. Once scientists identify data points that may harm the performance of their machine learning models, they must find ways to transform them to maximize the dataset’s accuracy. 

Data cleaning requires knowledge of the Machine Learning algorithm being employed and how it interprets the data, as well as the domain in which it is being deployed. For example, if your application is credit card fraud detection, your dataset might include various attributes like the transaction amount, currency, time, and location. Here data cleaning might involve ensuring the currency of all transactions is consistent or identifying duplicate transactions and flagging them. Additionally,  some attribute values might be missing or corrupted due to human or software errors. The decision to omit such observations depends on the ML algorithm consuming the data. While most ML techniques cannot easily comprehend missing values, algorithms like nearest neighbors or Breiman’s random forests intrinsically handle them. Hence, data cleaning is a complex and iterative process. In this blog, we list a few common data cleaning problems that you might have to deal with while building a high quality dataset.

Data formatting

Collecting data from different sources is necessary to maintain variability in the dataset and ensure model robustness. However, it also introduces potential mistakes or inconsistencies, so the data first needs to be standardized for further consumption. Data formatting is often misunderstood as only referring to standardizing file formats; for example, we may want all images in a dataset to be either jpeg or png. However, format standardization includes much more: we need to ensure all the images have the same size if the downstream algorithm requires it, or proper time synchronization. In the case of tabular or categorical data, all the attributes must have consistent formats, whether the attribute is a date, currency, or address. Furthermore, we need to ensure the values for each attribute are within the data constraints. For example, if the categorical data contains a probability attribute, then each value should be a float with a specified number of decimal places, and should have values between 0 and 1. 

Corrupted data

Data corruption refers to structural errors in the data caused during measurement, transfer or storage; it can be caused by a multitude of factors, including but not limited to malware attacks, voltage spikes, hardware issues, and interruptions in the transmission of data. Corrupted data is generally irretrievable and unusable. Two of the most effective and widely used methods to deal with corrupted data are to simply remove the data points from the dataset and to replace corrupted data points with Null values.

Duplicate/irrelevant data

Data acquired from different sources over different periods of time can often lead to recording either duplicate or irrelevant data points. For example, while collecting data for an autonomous driving application, say stop sign detection, we may end up collecting a lot of traffic data that does not contain any stop signs. On the other hand, we could end up collecting data of the same stop sign multiple times. Duplicate or irrelevant data points lead to major issues like data skewness or class imbalance while training machine learning models. Irrelevant data can simply be removed directly from the repository, however, dealing with duplicate data is a matter of choice, and is a popular topic of research. Identifying duplicate data points and applying adequate transformation to them can significantly enrich your datasets.

Missing data

Missing data is arguably the most wide-spread problem in data cleaning. As with corrupt data, we could simply remove the entire data point containing missing values or incomplete data. However, because of how frequently this occurrs, simply removing the observation will dramatically reduce the amount of data for training. Missing data can be classified into three major categories,

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Not missing at random (NMAR)

Depending on the type, we can employ several imputation strategies to account for missing data. In other words, we can fill the missing values using the information we have from existing values. Such strategies include mean or median imputation, imputation using most frequently occuring local values, imputation using K-nearest neighbor similarity features, and more.

Outlier removal

Outliers are observations that have extreme values when compared to other observations in the dataset, and might occur due to high variance in measurements or experimental errors. Outliers can cause problems with generalizability in many machine learning techniques, including regression models and poorly constructed deep learning models. Removing outliers from the dataset is generally the most effective and easy solution; correctly identifying the outliers, however, can be tricky. Outliers are mainly of three types, 

  • Point outliers: Single observations that don’t follow the overall distribution
  • Collective outliers: Subsets of observations that don’t follow the overall distribution
  • Contextual outliers: Patterned outliers that are generally caused by noise

Collective outliers could be indicative of a previously unexplained phenomenon, so removing such observations might cause us to lose important information. Similarly, Contextual outliers can be dealt with using proper filtering for denoising techniques. To identify outliers in the dataset, we can use two popular methods:

  • Univariate method: This method looks for outliers in a single feature space. The simplest method to achieve this is using box plots or Interquartile range (IQR), as shown in the example below.
  • Multivariate method: This method looks for outliers in n-dimensional feature space corresponding to n correlated features. The simplest method to do this is by training a predictive model to identify outliers. Over the years, algorithms like Random Consensus Sampling (RANSAC) have gained wide popularity for being able to perform particularly effective outlier rejection.
OutlierImage - Innotescus
Example of Univariate outlier identification, generated using modified MATLAB carsmall dataset containing 100 cars and their corresponding MPG values.

Although demanding and labor-intensive, data cleaning is an inevitable step in the machine learning development process. In order to use the data for the rest of the Machine Learning development process, we must first make sure it’s properly prepared in all respects, which means finding thoughtful, automated ways to compensate for a variety of problems inherent to real world data collection. 

When the data has been sufficiently cleaned, we are left with robust datasets that will help scientists avoid many of the common pitfalls of Machine Learning. The next step, preliminary data profiling, involves dividing data into meaningful categories and sorting it into designated repositories. For example, cleaned data for autonomous driving applications can be divided based on traffic conditions (low, medium and heavy), weather (sunny, overcast and rainy), and more. Through such profiling, we can dig deeper into data and analyze things like category-based distributions, which will provide better visibility into possible limitations and edge cases. Through the processes of profiling cleaned data, ML scientists will gain deeper insights into the quality of their data that were previously obscured by the disarray of raw data. From these insights, ML scientists can either proceed to begin data annotation, or return to the data collection and cleaning process to enhance their datasets.

Free EBook - resolve 5 common ML Data Cleaning Problems