Exploratory Data Analysis for your Dataset – Explained
There are 3 major ways to improve a Machine Learning model’s performance: using a higher quality training dataset, more computing power, or a better ML algorithm. Some ML scientists prefer to focus on fine-tuning their algorithms and generating higher AWS bills, instead of improving the quality of their training dataset, since, as we’ve discussed in previous posts, producing quality data annotations is time-consuming work. However, there is more to
What is EDA
Exploratory Data Analysis (EDA) is a crucial step in building any Machine Learning solution. One of the simplest definitions of EDA can be found on Wikipedia: “exploratory data analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods.” In other words, EDA is simply the process of visualizing datasets from multiple angles to quickly get a holistic understanding of the data, and identify useful hidden patterns and outliers.
Machine Learning techniques, especially the all-powerful deep learning ones, are iterative processes requiring lots of compute resources. Going blind into the battle of model selection, model tuning, and training often ends up costing an immense amount of time and resources. John Turkey, a mathematician, inventor of box plots, and one of the earliest proponents of EDA once said that “the simple graph has brought more information to the data analyst’s mind than any other device.” Performing EDA through meaningful data visualization provides valuable insight into datasets and puts the data into a context that helps in mapping the right direction for building a robust ML solution.
- Health of the dataset: EDA exposes possible errors in the dataset such as misleading data, wrong data types and unintentional redundancy. EDA helps us monitor the overall health of the dataset by iteratively visualizing the data quality.
- Outliers and anomalies: Visualization techniques like box plots and probability density function curves easily identify outliers or anomalies in a dataset. Whether ML scientists seek to better understand or simply eliminate extreme observations, visualizing the distribution of the data in comparison with the outliers is useful in validating assumptions and testing the hypothesis.
- Hidden patterns: Most of machine learning is training models to identify patterns and trends in data. However, it is very difficult to identify patterns in raw data, especially in higher dimensional datasets. Visualization techniques like facets, clustering and confusion matrices expose such hidden patterns with ease.
- Imbalances: Imbalanced datasets tend to skew the predictions of a ML model and cause poor performance. Identifying imbalance in the training datasets and accounting for it upfront saves an immense amount of time and resources throughout the rest of the development process. Simple visualization techniques like histograms and pie charts can very easily expose such imbalances.
- Model complexity: A good understanding of the complexity of the data helps us make data-driven decisions for designing the algorithm pipeline. For example, if the data is very noisy or has a lot of redundant information, adequate preprocessing will allow us to choose a simpler model without sacrificing downstream performance. On the other hand, if the distribution has clear and easily separable patterns, choosing unnecessarily dense learning models can result in overfitting.
Types of EDA
Exploratory data analysis can take all sorts of forms depending on the type of data, but all forms of exploratory data analysis fall under one of three categories: univariate, bivariate, and multivariate analysis. These categories simply correspond to the number of variables being analyzed at a time – one, two, or more than two.
- Univariate Analysis: As the name suggests, univariate analysis focuses on analyzing data by studying the behavior of a single variable or feature at a time. Here, we look at key statistics like the distribution of the variable at different time windows or the frequency of occurrence of a certain value. We employ multiple visualization techniques including histograms, bar plots, box plots and violin plots to understand the behavior of the variable. In image processing, this might look like the distribution of classes in a dataset, as shown below:
The class distribution of the Coco 2017 Validation dataset has a clear bias towards humans
- Bivariate Analysis: Bivariate analysis focuses on analyzing the data by comparing two independent variables at a time. Here, we look at the empirical relationship between the two variables and the strength of their association using visualization techniques like scatter plots and mosaic plots. In a computer vision context, this might look like a scatter plot showing the relationship between the mean and standard deviation of each image’s pixel values within a dataset, as shown below:
A simple scatter plot showing the mean and standard deviation of the green pixel values in this dataset reveals outliers in both dimensions.
- Multivariate Analysis: Multivariate analysis focuses on analysing the data by simultaneously comparing many variables. Here, we look at the correlation between multiple variables to understand the overall structure of the data, impact of the variables on each other, and opportunities for dimensionality reduction. Visualization techniques like clustering, facets, contour plots, and covariance matrices are used to perform Multivariate analysis.
EDA in computer vision (CV) has made significant advancements in recent years. Researchers working in this area divide visualization into two major approaches, statistical modeling and image feature visualization. Statistical modeling encompasses the methods above – visualizing image-specific summary statistics like color channel mean and variance or image entropy. Image feature visualization involves visualizing more complex image features like texture, shape, and oriented gradients, or understanding image datasets using powerful techniques like Facets, T-distributed stochastic neighbor embedding (T-SNE), principal component analysis (PCA), and multidimensional scaling (MDS).
Common python libraries for EDA
Some form of EDA is part of most data management tools including excel, tableau public and R programming. However, due to its widespread use in the machine learning community, Python has seen the development of particularly powerful EDA libraries. Here, we list 5 common libraries used for EDA in a variety of ML applications and domains.
- Matplotlib: Matplotlib was one of the earliest data visualization libraries built for Python. It was designed to resemble the data visualization abilities of MATLAB over a decade ago and has since been one of the most widely used EDA libraries. While Matplotlib is a very powerful library, it generally takes a lot of experience with it to create useful graphs.
- Tensorflow: The Tensorflow Data Validation (TFDV) library was built to understand and visualize data at scale. TFDV can be easily employed to compute and visualize descriptive statistics from large scale datasets. On top of that, TFDV provides the ability to infer a schema and validate new data.
- Pandas: Pandas-profiling provides a magical one line function call for EDA. pandas_profiling.ProfileReport() results in near-complete EDA for categorical data. What’s more? You can export the report as an interactive html document.
- Seaborn: Seaborn is a data visualization library based on Matplotlib, that makes creating complicated plots much easier. It offers a wide range of appealing visualization patterns with fewer lines of code and simpler syntax.
- Ploty: Plotly is unique in that it’s an online tool for data analytics and visualization that provides a well documented python API, so it can seamlessly integrate with other python libraries. Plotly provides a set of rich interactive visualizations that makes diving into your dataset fun.
A dataset contains many stories, and exploratory data analysis helps Machine Learning scientists listen to those stories that the dataset has to tell. Through constructing visualizations for univariate, bivariate, or multivariate analysis, ML scientists can better understand the underlying patterns of their datasets. They can then make informed decisions to eliminate outliers or anomalies, reduce class imbalances, or enrich their dataset with new data. The number of visualization tools and libraries available reflects the importance of EDA to the ML development process, but even simple EDA in the form of box plots and histograms goes a long way. So, before diving into blindly increasing compute power or model complexity, making use of even simple EDA tools to fully understand the dataset will yield massive benefits in development efficiency and model performance in the long run. Without understanding the scope of the teacher’s knowledge – the training dataset – you will never understand what your student – the ML model – will learn, and how it will perform in the field.
We are a group of scientists, engineers, and entrepreneurs with a vision for better AI. With backgrounds primarily in Machine Learning and Computer Vision, the Innotescus team understands the importance of having full control over and insight into data used to train Machine Learning models.