Ursula K. Le Guin - The Journey Quote

“It is good to have an end to journey toward; but it is the journey that matters, in the end.”

— Ursula K. Le Guin, Influential Science Fiction Author

Rigorous data cleaning is essential for a successful, on-schedule machine learning project. Some may consider data cleaning to be a single step in a ML project. Others may look further and envision data cleaning, in itself, as a waterfall process – comprised of well-defined, discrete steps that follow one another, from inspection to correction to verification. With this approach, the results of the process are controlled and predictable.

Data cleaning is treated most powerfully as an iterative process, one that starts with project goals, proceeds through inspection and cleaning, and arrives at insights…

We at Innotescus propose a change in this thinking. Data cleaning is treated most powerfully as an iterative process, one that starts with project goals, proceeds through inspection and cleaning, and arrives at insights… which then feed back into the goal statements. Not only does a consistent, multi-step, iterative process encourage addressing data-collection errors early, but it also provides in-depth insights into model development strategies, which improve ML model performance and influence a project’s aims. 

This blog post leverages our experience working with businesses that rely on our “AnnotationOps” platform for better quality data, faster annotation, and early insights into their data for successful computer vision projects. In this post, we will explore three related subjects within data cleaning:

  1. The fundamental steps in the data cleaning process 
  2. Why treating the process as iterative can make your projects more successful
  3. How you can tailor the process to fit your organization’s structure and needs

A Framework for an Iterative Data Cleaning Process

A comprehensive data cleaning process starts with goal-setting, next emphasizes execution steps, then arrives at insights and scale. We propose the 7-step cycle below (figure 1).

Guidelines for Each Step

  1. Set Goals
    • Set data quality goals for validity, accuracy, completeness, consistency and uniformity. Foremost, keep the end-application in mind in order to distinguish a must-achieve benchmark from an acceptable range of results.
    • At the outset, create a data cleaning rulebook for the project. This guide will begin with goals, then capture detailed process guidelines and findings from each step in the cleaning cycle. This book becomes the heart of your iterative data cleaning methodology and a key component of the project documentation.
  2. Select a Sample 
    • Choose a subset of data to work on and make sure that it is representative of the live data, at least during the first iterations. Choosing a representative data set requires careful thought. Methods include Simple Random, Systematic, Stratified, and Clustered. A truly random sample can be computing-intensive, while a non-random subset may be sufficient for early hypothesis exploration.  
    • Explicitly allocate time to consider how biases may be introduced into your data, your model, or even your application goals. Bias identification requires much more of a right-brain, holistic exploration than a traditional data sampling review. As with each step in this process, document your findings.
  3. Inspect
    • Exploratory Data Analysis, or EDA, is the process of visualizing datasets from multiple angles to quickly get a holistic understanding of the data. It is an effective way to analyze then respond to the statistical characteristics your dataset presents. For example, you may discover the existence of outlier data points that are either erroneous and should either be removed from the data set, or should be explored further to uncover new, real-world situations reflected in the data.
    • Keep top of mind the 5 most common data cleaning needs: formatting, corruption, duplication, missing data and outliers as you probe for errant data points. Make each of the five categories a sub-section in your rulebook.
  4. Clean
    • Request the eBook from Innotescus on techniques for resolving each of the 5 Common Machine Learning Data Cleaning Problems (listed in the previous bullet).  
    • Assess the available code libraries and API sets that transform data cleaning into a more efficient, higher level task. Document your data cleaning technology choices in the rulebook.
  5. Verify
    • Confirm that the data has been thoroughly cleaned, by comparing the data set to the validation conditions established in your rulebook during the goal definition phase. It’s possible to automate simple test cases for formats, boundary conditions, value ranges, etc. 
    • Ensure that your data set is clean enough to be fed to a “naive model,” which is similar to your end product at its outset. Naive classification models do not possess at the outset any tuning (feature engineering) related to the underlying real-world situation; they essentially are a blank slate.
  1. Report Findings and Insights
    • Document essential instructions in your data cleaning rule book. This includes a list of erroneous data sources that should be avoided.
    • Much as any scientist would, keep a record of insights and hypotheses you’ve developed during the current iteration, then plan for the steps you intend to take in the next round.
  2. Scale the Data Cleaning Effort
    • After completing several iterative cycles, your team should have settled on a well-documented, data cleaning rule book. The near-final book should contain a section describing the data cleaning effort involved in terms of types of work, skill sets needed, intensity and duration.
    • Your ultimate goal is to scale the project by delegating tasks from the most senior team members, who are breaking new ground, to other more cost-effective resources. These may include the annotation team or a third-party data cleaning service provider.  

Improving the Process Through Iteration, Prioritization and Customization

The above steps serve as a template for the data cleaning process. In actuality, the process can be rather fluid. Here, we briefly consider the effect on project success of fast iteration, effort prioritization, and process customization to suit your organization’s situation.

Studies have shown that more experienced data scientists repeatedly iterate between hypothesis, investigation, cleaning and analysis activities, especially during early brainstorming stages of a project. They are aware that statistical and neural network models are prone to generating misleading results when the data set has been partially cleaned; therefore, they proceed with caution. They may or may not formally document findings during every repetition of each step, but they are rigorous about recording the ultimate conclusions to guide the rest of the team. The best data scientists freely consult with the organization’s business champions when the findings suggest that the project aims can be ratcheted up. 

Through prioritization, you can optimize time spent on data cleaning by assessing the data’s characteristics and the project goals, and focusing on the most critical factors. Ask: which aspects of the dataset are likely to have the greatest influence on the ML model correctly distinguishing outcomes. Also, keep a record of the most common quality issues that tend to occur with particular classes of data, and the most effective approaches to resolve these issues. For example, different situations and data types are more amenable to interpolation versus replacement with null values. In the near future, we likely will see the commercialization of recommendation engines, themselves applying machine learning, that propose an efficient sequence of data cleaning steps.

Finally, the data cleaning cycle should be tailored to your organization’s structure and project design.  Consider the model maintenance expected for your application due to evolving input data. In this case, you should envision when, and by whom, model performance monitoring and additional data quality assurance activities should occur. The owners may be the data science team, however your organization may choose to foster a feedback channel from Customer Service, if they were using the results of the ML application.

In sum, we strongly believe that the best data cleaning frameworks are iterative, dynamic and tailored. The 7-step template is a useful starting point, however the most effective results will stem from making the process your own.