The Devil is in the Data: Machine Learning Process Simplified
Artificial Intelligence (AI), Machine Learning (ML), Computer Vision (CV), and Deep Neural Networks (DNN): these four common buzzwords represent billions of dollars to modern businesses. Machine Learning is empowering computer systems to solve problems from day-to-day tasks like email spam filters to complex tasks such as early stage cancer detection. Machine Learning may seem daunting and sound like the science fiction portrayed in The Matrix movies, but in reality, it is merely data, algorithms, and training iterations. In this blog we will break down the nine common steps of Machine Learning for people interested in joining the technical conversation.
There are three major categories of Machine Learning: Supervised, Unsupervised, and Reinforcement Learning. Here, owing to its performance and popularity, we will focus specifically on Supervised Learning. Supervised Learning, simply put, is where an algorithm or model learns a mapping function f, from the input variables X (categorical data, images, or text), to the output variables Y (the desired outcome or ground truth).
The aim is to iteratively train the model to estimate the mapping function so well that when you have a new input x, you can use the model to accurately predict the corresponding output y. It’s called Supervised Learning because humans supervise the predictions of the model and provide appropriate feedback using the ground truth during training processes.
Common examples of Supervised Learning in industrial solutions include self-driving cars, fraud detection in banking services, and chatbots. These solutions require the Supervised Learning models to perform complex tasks, such as object detection, complex pattern recognition, and natural language processing. To accomplish these tasks, we need to have a clear understanding of the problem, a well-defined strategy, and a structured algorithm pipeline. Only then can we train a model that mimics or even outperforms human capabilities. The key to success for these models is representative, clean, and structured datasets. Supervised Learning may sound intimidating, but the standard process always follows nine steps from start to finish. To help you understand the process, consider the problem of self-driving cars stopping appropriately before stop signs. For the cars to stop in time, they first need to “see” and “recognize” the stop signs, so we need to train a Supervised Learning model that can take an image of a scene as input and determine, in near real-time, whether the scene contains stop signs. This prediction can then be used to guide the self-driving car to stop at a stop sign. Next, we will explore each of the nine steps in the Supervised Learning process.
Step 1: Data Collection
Machine learning, in most cases, is like teaching a toddler. It requires love, patience, and LOTS OF DATA! Having large volumes of quality data is crucial for effective Supervised Learning. Data describing a scene or an event is usually collected from multiple sources and sensor streams. In our example, multiple cars with mounted cameras drive through the city for days collecting real-life traffic videos. These raw videos, or unstructured data files, then land in the hands of ML experts.
Step 2: Data Cleaning and Profiling
The deluge of unstructured data can be pure chaos. For the data to make any sense, it needs to be cleaned, structured, and brought to order. First, ML experts import the data from multiple sources into the appropriate repositories, standardize the data formats, and aggregate based on pertinent rules. Second, ML experts check for corrupted, duplicate, or missing data points, and discard unwanted data that could impact the overall quality of the dataset. For example, once multiple traffic videos are collected, ML experts will look for and remove corrupted or redundant files, if any exist. Finally, ML experts categorize videos captured during different conditions with labels, such as day, night, sunny, rainy, etc. This step provides a profile and the first insights into the datasets that will be used for training, validating, and testing the Supervised Learning model.
Step 3: Data Annotation
Now, the cleaned and structured data needs to be annotated. Annotation is the process of assigning encoded values to raw data. Encoding values includes, but is not limited to, assigning class labels, drawing bounding boxes, and marking object boundaries. High-quality annotations are required for teaching Supervised Learning models what the objects are as well as measuring the performances of the trained models. Currently, annotating datasets takes up most of the time and resources in the ML solution design lifecycle. By most estimates, this process alone takes up to 80% of a ML expert’s time, for example, hours of video footage need to be annotated for stop sign recognition alone. Without these annotations, ML experts will not be able to teach the model what it should be looking for in the scene.
Step 4: Data Visualization
Once the laborious task of data annotation is over, ML experts design the algorithm pipeline to train a model. For efficient algorithm design and to avoid pitfalls in the process, we first try to understand the data through visualizing a representative sample, if not the entire dataset itself. Data visualization enables Exploratory Data Analysis (EDA) with graphs and summary statistics. EDA is crucial for discovering hidden patterns, identifying relevant correlations between different variables, and finding anomalies or class imbalances in the dataset. EDA can also help verify assumptions and test model hypotheses. In our example, EDA can help us understand key statistics such as how many stop signs there are in our dataset, and how many of them were collected in extreme conditions, i.e. heavy traffic, heavy rain, extreme darkness, etc. We can also start to identify any unintended biases in the datasets, like whether the stop signs in the scenes are uniformly or normally distributed in scale and location. EDA enables ML experts to design models that can appropriately handle multiple conditions and biases.
Step 5: Data Enrichment
With a sound understanding of the data distribution and its potential impact, ML scientists can then enrich datasets as necessary. Data enrichment is the process used to enhance, augment, and refine data points, making the dataset more robust and therefore more valuable. This step might include collecting more relevant data points, generating synthetic or augmented data points, or transforming existing data points. In our example, if we find the stop signs collected during rainy weather conditions represent merely a small fraction of our entire dataset, we can augment the dataset to include more such examples for the model to learn these particular conditions. Such augmentations can reduce the risk of overfitting a model to a particular condition.
Step 6: Feature Engineering
A perfect training dataset is not the only variable in the equation. ML experts often need to apply domain knowledge to choose algorithms or techniques to train effective predictive models. The traditional ML process requires transforming raw data into features that represent or describe the underlying problem. Expertise and domain knowledge are often required to handcraft a rich set of features and create impactful solutions. Without the right combination of features, even an otherwise adequate training dataset can lead to poor model performance. In our example, we can extract multiple relevant features that describe the stop signs, such as color, shape, etc. This step essentially converts the image into a matrix of numbers that describes patterns in stop signs. Our model will learn to find these same patterns, which are then fed into algorithms, like random forests, or support vector machines that perform complex tasks by interpreting the identified features. However, manual feature extraction is often a tedious task with a fair amount of trial and error, which might inject human bias. To eliminate this source of error, neural networks automate the feature extraction process generally by using a convolution operation. This ability of neural networks to extract complex non-linear relationships between features is one of the reasons for the increase in their popularity.
Step 7: Training and Validation
With the right dataset, or features, split into non-overlapping subsets for training, validation, and testing, the iterative training process for the model begins. ML experts closely monitor the training using different metrics, perform hyperparameter tuning as required, and wait…wait…wait. At the end of the iterative training process, we will have a model that detects stop signs! We did it! But for real world solutions, we have to keep in mind the following two steps.
Step 8: Deployment
Industries and organizations have different thresholds for ideal performance. Once the model’s performance passes that threshold, organizations can start deploying their solutions to solve problems faster and better in the real world. In our example, once the algorithm passes the performance threshold, it would be used on actual self-driving cars to recognize stop signs.
Step 9: Improvement
It is in the ML experts’ best interests to continue refining their models and adjusting them for new business needs. Therefore, when such opportunities arise, ML experts will start from the beginning of the machine learning process to improve model performance.
Machine Learning is an exciting emerging technology that is rapidly changing how we see and solve complex problems in the field of AI, in fact, we are seeing a watershed time in human history, and a future full of potential. With the research community and industries applying a significant amount of resources toward developing practical AI solutions, the field of machine learning is making amazing advancements every day. Although it may seem complex and intimidating to people new to the field, the basic process used for developing efficient ML solutions is fairly straightforward, and it requires a lot of high quality data. Data remains the most important element of Machine Learning – as they say, if you ask the data nicely, it will confess. In an era of ‘big data’ overflowing with sensors, data feeds, and smart devices, it is not merely the quantity of data that differentiates ML solutions, it is the quality of that data, and how well those seeking to use it understand it.
We are a group of scientists, engineers, and entrepreneurs with a vision for better AI. With backgrounds primarily in Machine Learning and Computer Vision, the Innotescus team understands the importance of having full control over and insight into data used to train Machine Learning models.