# The Final Stages: Choosing, Training, and Deploying a Machine Learning Model

We’re nearing the finish line! **We’ve collected data**, **meticulously and thoughtfully prepared it**, gone through **the process of feature engineering as needed**, and are ready to choose, train, and deploy a model. When designing a Machine Learning solution for a real-world problem, it’s important to remember that the goal is not just to train a model to make accurate predictions on a representative dataset, but to train a model to make accurate predictions on data points seen in the field. A good dataset and an impressive Machine Learning model by no means guarantee good performance in production. In this blog, we will outline the major steps in selecting, training, and deploying a robust machine learning solution.

**Classical Machine Learning vs Deep Learning**

Before jumping in to model selection, training, and deployment, it’s important to distinguish between two often conflated terms. In the traditional sense, Machine Learning is a set of algorithms that ingest data, parse it, learn from it, and apply the learnings to make data-driven decisions. Due to the popularity of Deep Learning, Deep Learning and Machine Learning have become synonymous with each other, however, the two terms are not synonyms. Deep Learning algorithms are extremely data hungry and generally require high-end GPUs. Additionally, classical ML algorithms like Regression, **Support Vector Machines** (SVMs) and **Random Forests** are still very powerful techniques that can be used for solving complex real-world problems. While determining which route to take, we have to consider 3 major factors:

**Amount of training data:** Deep Learning algorithms outperform classical ML algorithms if the amount of training data is large enough. With smaller available datasets, classical ML algorithms are preferred.

**Distinctive features: **If the problem being solved is relatively simple, and if ML scientists can extract highly separable features from the training dataset, classical ML algorithms are often chosen over deep learning techniques due to their simplicity. On the other hand, for complex problems like image classification and NLP, distinctive features are not readily accessible. For such problems, Deep Learning is the way to go.

**Infrastructure:** Deep learning techniques, as mentioned earlier, require immense amounts of data and compute resources to train models in a reasonable amount of time. When such infrastructure is unavailable, classical ML algorithms can be used to obtain reasonable performance.

**Training, Validation and Test Datasets**

Once you have carefully prepared your dataset for training an ML model, the first step is typically to divide your dataset into 3 subsets: training, validation, and testing.

**Training dataset: **This is the subset of your entire dataset that is used to fit the model. In other words, the ML model sees and learns its predictions from this part of the dataset.

**Validation dataset:** This is the subset of the dataset that is generally used for unbiased evaluation of the model during training. The ML model occasionally sees this data but doesn’t necessarily learn from it. While the model is still fine-tuning the hyperparameters, the validation dataset is used to frequently evaluate the progress of training and the model’s potential to overfit or underfit to the training dataset.

**Testing dataset:** This is the subset of the dataset that is used to perform unbiased evaluation of the trained model. The ML model sees this data just once during the final performance evaluation of the completely trained model.

Now that we know what each of these data subsets are and what they are used for, how do we split the dataset? This mostly depends on two things – the number of data points in your dataset and the training model complexity.

In our **previous posts**, we discuss some of the rules of thumb used to figure out the amount of training data required. Some models don’t need substantial amounts of data for training; for models with fewer tunable hyperparameters, we could probably get away with a 60-20-20 split for training, validation, and testing respectively. On the other hand, for models with hundreds of millions of tunable hyperparameters, it might be better to use an 80-20 split between training and testing, and employ strategies like **cross-validation**. While there is currently very little empirical evidence to form a golden rule for finding an optimal split ratio, there have been **impressive advances** made in addressing this issue.

**Model Selection**

Arguably the biggest challenge in applied Machine Learning is choosing the optimal **ML model **from a wide range of model architectures with varying complexities and intended uses. There are three main criteria to remember when choosing a model: performance, training time, and deployment cost. Model selection involves not just choosing the right ML algorithm pipeline (Regression, Support Vector Machine, Neural Networks, etc), but also choosing the optimal configuration within the chosen algorithm pipeline (tunable parameters, learning function, loss function, etc).

The simplest way to perform model selection is to first fit a handful of models to “sufficient” data. This, however, is impractical, mainly because it is nearly impossible to define what “sufficient” data is. Instead, we can use a few measures like **Akaike information criterion** (AIC), **Bayesian information criterion** (BIC), and **structural risk minimization** (SRM), which penalize higher numbers of model parameters and reward goodness of fit on the training set, thus favoring low complexity models that result in good training performance. However, due to their preference for heavily penalizing model complexity, sometimes these algorithms end up sacrificing some performance for model simplicity. Alternatively, we can use learning curve graphs and cross-validation to estimate prediction errors of different models, and interpolate those values to make our selection.

**Training**

Having selected a strong ML model to solve your problem, it’s time to start training! There are a few key components involved in training a robust model.

**Objective function:** The objective function, also referred to as the loss function, is one of the most fundamental components of ML model training. An ML problem can be formulated in a variety of ways, but is most commonly represented as an optimization problem. This is done via an objective function, which represents the goal during training. The ML model learns by traversing the objective function plane to either minimize the losses or maximize the rewards. Some widely used objective functions include **mean square error** (MSE), **cross-entropy error**, **Huber loss**, and **log loss**. Choosing the right Objective function is critical for training a highly robust ML model.

**Performance Metrics:** While the objective function measures the performance of the model, it is generally not intuitive and is only used during training. For exhaustively evaluating the performance of the ML model, we need to define more intuitive performance metrics. The metrics defined for evaluation are application and scope dependent. Some common metrics used include accuracy, the confusion matrix, area under the curve and the F1 score. Defining the metrics and setting a baseline for acceptable performance determines the trajectory of training.

**Hyper-parameters: **Hyperparameters are adjustable parameters in the ML model that must be tuned in order to obtain a model with optimal performance. The number and types of tunable hyperparameters directly depend on the model being trained. For example, if we are using Fully Convolutional Networks (FCNs), we have hyper parameters like batch size, learning functions (**AdaGrad**, **Adam**, SGD, etc), learning rate, and number of epochs. On the other hand, simple classical ML algorithms like SVMs require tuning kernel types, penalty parameters and number of iterations.

**Repurposing Pre-Trained Networks via Transfer Learning**

As discussed earlier, Deep Learning algorithms are notoriously data hungry and require a lot of compute power to train in a reasonable amount of time. One effective way to overcome this hurdle is to use pre-trained models. The ML research community is constantly building and training effective deep learning models using millions of data points, so instead of rediscovering the wheel, we can make use of these models and model architectures.

Deep Learning models are difficult to train because they empirically learn the model parameters (weights and biases), which are randomly initialized and take a lot of time and data to learn. However, in transfer learning, instead of starting the learning process with random initialization, we can start with the parameters learned by a model trained to solve a similar problem and fine tune these parameters, saving a lot of training time. For example, the parameters learned by a model that is trained to recognize cars can be used to initialize a model that will learn to recognize other types of vehicles like trucks or airplanes.

**Deploying Your Trained Model**

Thoughtfully deploying a trained model is just as complex as creating it. The solution’s performance requirements and available resources constrain the deployment options, but there are typically many different ways to arrive at a sufficient answer. Some of the major considerations for deploying a Machine Learning solution include:

**Latency:** Deploying a solution on the cloud may allow ML scientists to take advantage of virtually limitless memory and processing resources, but network latency alone may render this solution unworkable.

**Precision:** More precise sensor inputs and network weights require more memory and processing power. Methods like quantizing inputs and using less dense models, like TensorFlow Lite rather than TensorFlow, allow solutions to more easily fit within the deployment constraints.

**Processing Power:** As performance requirements grow more demanding, the necessary hardware typically becomes more specialized, from CPUs and GPUs to FPGAs and even highly-specialized accelerators. With new specialized hardware comes a need for different skill sets to program and debug each one, and new toolchains to support the development process, adding to deployment costs and engineering complexity.

These are only a few of the considerations to take into account when deploying a trained network. The process of selecting the right hardware and integrating software containing a trained network into that hardware can be a sizable undertaking, but is ultimately what enables a trained model to interact with the ‘real’ world.

Once the training and deployment pipeline is set, ML scientists can go back to focusing on the data by continuing to monitor the performance of the model in the field. Though training and deploying properly can have a significant impact on performance, the performance of a model ultimately reflects the data used to train it. Now, ML scientists can get a clear sense of how the model performs in the field, and continue to improve it by retraining and redeploying using those new insights.