It’s no secret that a comprehensive, well-labeled dataset goes a long way towards an effective Machine Learning solution, and while data collection is a large part of that, there are several methods available to build a dataset that does not solely revolve around passively recording data. While data augmentation is a popular and relatively simple method of stretching your dataset, synthetic data creation is a slightly more sophisticated way to round out a robust training set.

Because creating realistic synthetic data requires technical sophistication, its use is best targeted to scenarios in which typical data collection proves too expensive, slow, or ineffective in general. This can mean edge cases that are exceedingly rare or difficult to capture, data collection which might violate privacy regulations or data collection that might be prohibitively expensive. In all of these cases and more, synthetic data can be an effective way to work around the limitations imposed by standard data collection techniques.

Methods of Producing Synthetic Data

Variational Autoencoders

Variational Autoencoders (VAEs) are networks that encode and then decode their input data, which forces the encoder to output a latent space representation of the input data that relies on fewer more meaningful dimensions. The network is trained by trying to minimize the difference between the input data and its corresponding output, training the encoder on the most relevant features and the decoder on the reconstruction of the data from these features. The second part of this process, the reconstruction of the encoded data, can be harnessed to create altogether new data that still contains the statistically relevant features learned from the rest of the dataset.

what is synthetic data

Synthetic Minority Oversampling Technique

Synthetic data can be particularly useful in cases where there are too few examples of the minority class for a model to effectively learn the decision boundary. One way to solve this problem is to oversample the minority class, and simply duplicate examples of it in the training dataset. While this balances the class distribution, it provides no new information to the model. Rather than duplicating existing information, data scientists can synthesize examples around the minority class using the Synthetic Minority Oversampling Technique.


SMOTE is a method of synthesizing data to bolster datasets that include rare events or scenarios whose detections are crucial, such as cancer detection. The basic process of SMOTE requires the data scientist to sample two data points – one from the minority class and one that is nearby in feature space, but not of the minority class. Then, the data scientist must create a synthetic data point along the line between the two samples in feature space. Using SMOTE to create more samples in and around the minority class allows the network to better define the minority class and the crucial boundaries around it.

Generative Adversarial Networks

Generative Adversarial Networks (GANs) are used to generate synthetic data by training a generative model, a network that creates synthetic data, using a discriminator model, a network that has been trained to classify data as real or fake. The generative model is trained until the discriminator is unable to distinguish between its synthetic data and real data, and has about a 50% success rate at categorizing its outputs as real or fake. Once trained, the generative model can be used to create synthetic data for the intended application.


Though the techniques outlined above leverage real data, they go a step further than augmentation in creating new data rather than simply altering existing data. Synthetic data can be a powerful way to address the shortcomings of data collection, which can become too slow or costly for certain types of data. Though technically more challenging than data augmentation, synthesizing new data can help train networks to address a greater variety of scenarios that, although less common or harder to record, are crucial to the successful deployment of a machine learning model.

Learn more from Innotescus

Learn more about augmentation or see our platform’s analytics that help users understand their data and make informed decisions about when and how to apply synthetic data.

Contact Innotescus to learn more.

Free EBook - resolve 5 common ML Data Cleaning Problems