What are the odds that your ML project will reach production?

One claim puts the chances at 13%, or a little better than one in ten. At VentureBeat’s “VB Transform” summit, thought leaders from IBM and Gap posed the three greatest obstacles to successful deployment:

  1. Investing in AI and big data without a clear plan for converting learning into compelling customer benefits and business results. “Sometimes people think, all I need to do is throw money at a problem or put a technology in, and success comes out the other end,” said Deborah Leff, IBM’s CTO for Data Science and AI.
  2. Lack of data quality and access. Too often, useful data is trapped in silos, either due to access limitations, format
    variations, or quality issues.
  3. Obstacles to team collaboration. Gaining insights and acting on them requires shared goals and vocabulary that may not fully exist among business champions, analysts, data scientists, AI/ML modellers, developers and database
venture beat innotescus

What are the odds that your ML project will make it to production? One claim puts the chances at 13%.

How can you tilt the odds in your favor? At Innotescus™, we have developed a platform for data annotation in projects involving vision recognition and we offer these considerations:

Manage Data Quality by Understanding Data Variability

Many ML projects tend to over-concentrate on model development. While that’s essential, an even bigger hurdle to reaching the goal line is data quality and access.

In an earlier blog post, we addressed data cleaning problems and solutions for ML projects. One insight inferred from the post was that not all errant data has to be “fixed.” Yes, data points that are missing or appear to be incorrect can be thrown out or replaced with null or interpolated values. A smart step to take early in one’s project is to analyze the sample set’s variability, then determine the data sanitation and enrichment required to achieve a well-distributed, unbiased dataset of sufficient size.

Deeply understanding your dataset not only leads to better model engineering; it also reduces iterative adjustments and
project length, and reduces computation resource needs.

An example from image classification is to identify and categorize potential failures in industrial hardware. Stress fatigue in metals and miscalibration of robot arms and their object manipulation are two such cases. Image training data in these cases is sparse, and the images may exhibit poor lighting and shadows. Instead of concentrating on acquiring much more data or correcting image illumination, a data variability analysis may determine that bounding boxes (i.e., rectangular selection areas) can isolate the critical factors found in the photos, and that the data collection and annotation teams should instead concentrate on compiling images of under-represented flaws.

Deeply understanding your dataset not only leads to better model engineering; it also reduces iterative adjustments and
project length, and reduces computation resource needs.

Parallelize Team Input on Data Quality

Your product manager is responsible for involving the data engineers, data scientists, algorithm experts, data annotators and others in the overall effort to design, develop, and deploy useful ML. Avoid the common mistake of isolating each group’s activities or widely staggering their efforts. Instead, each expert should gain exposure to the data set early and be expected to comment on its implications in a common repository.

SAS’s CIO, Keith Collins, has stated that “when building an effective AI team, we can either look for a superhuman or a super team… Teamwork wins the day.” Your project objective should be to systematically facilitate interactions among the team to identify and solve issues as soon as they arise, in a way that satisfies all parties.

Adopt Data Tools that Create a Path to Success

A majority of ML/data science tools on the market are vertical, in the sense that they apply to a specific phase and technical mode in a project, such as data ETL or model sensitivity testing. To better convert learning into business results, you also should consider horizontal tools that incorporate life cycle management, workflow and collaboration.

In terms of data annotation, such tools should:

  • Make it easy to perform early data variability assessment
  • Use statistical visualization and feature engineering capabilities to accelerate model development and time-to-market.
  • Facilitate the participation of team members, business colleagues, and even third parties in data labelling and quality assurance.

Innotescus satisfies all three of these points and enables better data, faster annotation, and deeper insights for impactful computer vision applications. It is designed to optimize model development over the entire lifecycle. If you are interested in improving your chances of success with a ML vision project, please contact us.

Free EBook - resolve 5 common ML Data Cleaning Problems