The Brave 1st Step of Machine Learning: Dealing with Data
Data is Unavoidable
Since you opened this blog post, approximately 822,260 Google searches were performed, 30,000 minutes of YouTube videos were uploaded, 60,000 tweets were tweeted, and 486,610 photos or videos received likes on Instagram. The sixth edition of DOMO’s report stated, “By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth”. According to Statista’s information, the world will create over 50 zettabytes in 2020, and the number will increase exponentially to 175 zettabytes in 2025. Zetta is defined as 1021.
There are 7.8 billion people on earth, of which 5.2 billion are unique mobile phone users, 4.4 billion are internet users, and 3.5 billion are active social media users. Quartz reported approximately 80% of web traffic happens on mobile phones. Spencer, a Carnegie Mellon University student, had an average screen time of nine hours per day last week, mostly comprised of social networking apps which generate and collect reams of personalized data. You and what you spend time on are being monitored and tracked by organizations – as they say, the internet never forgets.
The Data you Generate is Put to Work
Companies don’t just collect this information for fun; all this user-generated data is collected by mobile apps to analyze user preferences, create targeted marketing campaigns, and make more money. For example, the posts you like on Facebook or Instagram will help their algorithms determine which ads to show next, as well as in what order. However, social media companies aren’t the only ones who have leveraged machine learning to take advantage of the large amounts of data they’ve collected.
Whether you realize it or not, companies across many industries have been collecting data to train Machine Learning models for years. Recaptcha, a tool used to block bots and spam from websites, is perhaps one of the most recognizable data collection methods out there. Every time you (yes, you) prove to a website that you’re not a robot, Recaptcha is using your input not just to grant access to a website, it is also storing your input to train ML models. Recaptcha first began by asking users to enter distorted text into a text box, then evolved, asking users to identify traffic signs and crosswalks in images. These tasks map directly to problems that ML can help solve: deciphering and digitizing words in historical texts and identifying important objects for autonomous vehicle operation.
Of course, data used to train ML models isn’t just stored in images and text, even for a single, albeit complex, application. For autonomous driving alone, vehicles are equipped with an array of sensors which collect 3D point clouds using LIDAR, video files from cameras, time series data from radars, and more – all to train ML models to replicate every task involved in driving a car.
Data exists in countless formats that vary broadly by industry; data can be held in a simple database of financial transactions to train a fraud detection network, or millions of nucleotide sequences stored in FASTA files to train a model that predicts a patient’s risk of developing an illness. Regardless of the ML application, the first step is always to identify and procure the appropriate type of data needed to properly train a network, and a lot of it.
How Much Data is Enough Data?
While there is clearly no shortage of data in today’s world, finding data that’s well-suited to train meaningful Machine Learning solutions isn’t as easy as it may sound. Collecting large datasets can be very expensive in terms of man-hours, equipment, licensing and storage. Thus, a crucial problem in developing highly effective machine learning solutions is to determine how much training data is required. This is a tricky question; the amount of training data required depends on a variety of factors. Here, we will address two of the most significant factors – the complexity of the problem and the complexity of the algorithm.
The Complexity of the Problem
The first factor in determining the training dataset size is the difficulty of the problem being solved. Supervised Learning algorithms learn the relationship between inputs and desired outputs by approximating the underlying mapping function. If the variation of the underlying unknown function is limited or controlled, we can expect the training data to show clear and repeating patterns.
Let’s consider a real world problem. In many industrial applications, Machine Learning can be used to segregate items based on simple similarity measures like color or size. Our imaginary task is to segregate red and green boxes in a controlled factory setting. Here, the amount of variation in the inputs is very limited, and we can achieve satisfactory performance with a few thousand data points. On the other hand, for applications like autonomous driving, the input conditions can change drastically (varying backgrounds, weather, illumination conditions, etc.), complicating the underlying mapping function. Such an application will require millions of data points to train an effective model.
Similarly, the variation in output, like the number of possible classes, and the associated margin of error allowed in that output, also factors heavily into the complexity of a problem. The number of classes is fairly straightforward – the more variation in desired output, the harder it is for the model to learn to generalize the underlying mapping function. Furthermore, if the room for error in an application’s output is low, like in spine MRI segmentation and analysis as shown below, the amount of data required for training a robust and accurate model will be higher to ensure that the training data adequately represents most, if not all, of the possible variations in conditions and outcomes.
The Complexity of the Algorithm
Effective machine learning techniques like deep learning are generally nonlinear algorithms, which, as the name suggests, try to learn complex nonlinear relationships between inputs and their corresponding outputs. These algorithms are more robust and flexible, but can have anywhere from thousands to millions of parameters to learn. Consider the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual competition in which software programs try to correctly detect and/or classify a variety of objects in the ImageNet dataset. The ImageNet dataset currently has over 14 million hand-annotated images, with over 20,000 categories. In 2012, we saw one of the earliest uses of deep learning algorithms, AlexNet, to solve this challenge. AlexNet was able to reduce the error rate by about 10% compared to the previous year’s winner. This was a huge breakthrough in visual object recognition and computer vision as a whole, but doesn’t come without a cost: AlexNet has about 61 million parameters to learn and optimize. Generally, the more parameters a machine learning model has to learn, the more training data it requires to avoid underfitting (over constrained models) or overfitting (under constrained models).
A Couple Tricks of the Trade
So, how can we figure out the right amount of data required to train a machine learning model? As frustrating as it is, there is no one correct answer to this question, however, years of experimentation have led data scientists to formulate a few practical ways to deal with this issue. For linear algorithms, like regression, the ‘10x’ rule seems to do the trick. This rule says that you need 10 times the amount of data as the number of parameters in your model. For nonlinear algorithms, one of the simplest methods to estimate the required data size is through the learning curve graph. This graph is a plot between the model performance and the data size. Here, data scientists start training their models using different data sizes and evaluate their performance. The subsequent learning curve graph can then be used to project the estimated data required to attain the desired performance goal. Though these methods work for many, they are still relatively messy and can only provide an initial estimate.
How to Stretch your Data
If you are unsure if you have sufficient data, or just don’t have enough, does that mean that you cannot use powerful machine learning algorithms for your application? Here are some ways to deal with lack of data:
- Public datasets
The popularity and effectiveness of machine learning algorithms has led to multiple researchers, industries and enthusiasts collecting data to solve a variety of complex challenges. A quick Google search can provide us multiple sources of data trying to solve similar problems.
- Data augmentation
Data augmentation involves increasing the size of your dataset by transforming the existing data with a variety of simple methods like applying eigenvectors, adding noise, translating the image, or varying the scale. Data augmentation can drastically increase the size of your dataset at a very low cost, but since we are essentially transforming our existing data, data augmentation doesn’t introduce a lot of variability, and we run a slight chance of overfitting to the training data.
- Synthetic data
Synthetic data is programmatically generated “fake” data, but has the same basic schema and underlying statistical properties as real data. With advances in algorithms like Generative Adversarial Networks(GANs) and Modified Synthetic Minority Over-Sampling Techniques(Modified-SMOTE), it is practically impossible to differentiate between real and fake data. Using methods like these is a little computationally expensive and requires some amount of representative data to begin with, but can significantly improve model performance.
- Ensemble methods
Ensembling techniques aggregate predictions from multiple weak learners/models through a weighted average or voting. Such aggregation results in significantly lower variance, more accurate prediction, and improved overall generalizability, particularly when dealing with lack of data.
- Transfer learning
As mentioned earlier, nonlinear techniques like deep learning algorithms have to learn anywhere from thousands to millions of parameters. Learning these parameters from scratch requires a lot of data. However, most deep learning models exhibit a curious property that can be exploited – the first few layers of a model tend to learn high level patterns, and can be applied to many other datasets and tasks. This method of transferring knowledge learned by doing one task to another is aptly called Transfer Learning. Hence, we can use a pre-trained model or train a base model using available large datasets first, then fine-tune the model for our specific application using a smaller dataset. For example, if you’re working on an image classification problem, you can use a model pre-trained on ImageNet, and then fine-tune it for your specific problem using a more targeted dataset.
The first step of the machine learning process is always identifying and gathering the appropriate data to train your model. Understanding and acquiring the right amount and type of data is no small task, regardless of how plentiful data may seem. To get started, take a look at these publicly available datasets that have been contributed by ML researchers, practitioners, and enthusiasts just like you!
- Our favorite source for free datasets, collaboration and competition is Kaggle.
- UC Irvine Machine Learning Repository offers 100s of datasets classified by ML problem type.
- Amazon Dataset repository is an ever growing source where you can find datasets from web pages to satellite imagery. What more? They come with usage examples.
- Waymo is another great source of data for a wide range of tasks in autonomous driving.
- VisualData is a repository of about 450 datasets helping you address computer vision tasks from segmentation to image captioning. https://www.visualdata.io/
- Since 2018 Microsoft research open data has been collaborating across the research community to collect datasets for a variety of categories. https://msropendata.com/
- General (11)