Arguably, the most important challenge facing machine learning today involves both data science and social equity: reducing bias in ML data, models and practices. Machine learning is playing a rapidly expanding role in industries ranging from lending to insurance, healthcare, recruiting and more. As mathematician and speaker Cathy O’Neil points out in her TED Talk, the growing use of AI means that those who design and deploy machine learning hold substantial power over business and societal outcomes. A realization is emerging that bias in ML (and in society at large) needs to be addressed head-on.

This post provides examples of bias in the domain of ML-based vision recognition; highlights common causes of bias; and recommends ways that data scientists, project managers and other project participants can take greater responsibility for preventing bias in their ML projects.

Cathy O'Neil: Do Algorithms Perpetuate Human Bias?

Bias in Vision Projects

ML practitioners may consider vision recognition to be a challenging field technically, but not fully appreciate the effort that needs to be concentrated on avoiding bias. Take, for example, ‘smart’ surveillance cameras that have been trained to spot illegal activities and even categorize the types of people engaged in those activities (e.g., “replay footage of all younger men who loiter in the store”). What if the training data was drawn primarily from night-time footage – might the ML model then be overly sensitized to the traits of people who tend to work night shifts?

Sources of Bias

Bias can be introduced at many points in a project: in the original data (sample bias), filtering of data used (exclusion bias and handling of outliers), the labelling/annotation process (observation bias), construction of the model (prejudice bias), and elsewhere.

Questions to pose to project participants (as well as fresh sets of eyes) early on should probe for conceivable sources of bias, e.g.:

  • Has the specification of “optimal outcomes” for the project been made open-mindedly (without assuming the status quo represents the optimal situation)?
  • How was sample data collected? What context may be absent from the available data?
  • Are annotators, categorizing data for training purposes, being asked to apply subjective opinions?

Means of Reducing Bias

Bias reduction can be addressed through best practices as well as technological support. An Information Week article recommends that the basis for best practices is a plan that includes dialogue to build awareness of potential issues, an evaluation program, and mitigation steps. In particular, this helps avoid defensiveness if critiques are made later in the project without agreed-to mitigation steps.

One effective planning and evaluation practice is “feature engineering.” It’s the human-driven process to evaluate all data attributes that could serve as inputs into an ML model, understand their effect, and ensure that the useful attributes are incorporated, structured and weighted appropriately. Feature engineering inherently makes the practitioners responsible for interpreting and fine-tuning outcomes. In contrast, popular “deep learning” techniques refer to a solely software-driven process to weight the attributes given to it and extrapolate patterns.


This overview should give the reader a sense that bias can be introduced into machine learning models from multiple vectors, most often inadvertently. Regardless of source, there can be serious consequences for those affected by the models’ outcomes. It is the responsibility of each person involved in planning, structuring, using and assessing an ML model to consider how explicit practices and enhanced tools can reduce the effects of bias on their business and on society.

Free EBook - resolve 5 common ML Data Cleaning Problems