A Beginner’s Guide to Exploratory Data Analysis

Analytics in Practice

Have you ever played the game Twenty Questions?  

In this classic guessing game, one person chooses a mystery object. Once they’ve chosen, they tell the rest of the players only the broad category this object fits into, such as “place.” The players then try to win the game by identifying the mystery object using only twenty yes or no questions.  

If you’re new to exploratory data analysis, I encourage you to think about it a bit like Twenty Questions. Just like this game, the final answer is not clear from the get-go. You’ll need to begin by casting a wide net, which you will narrow down as you learn more. Each line of inquiry will build upon those that came before it. Finally, just like in Twenty Questions, an inquisitive nature will be rewarded, because questions are the start of answers.

 

Casting a Wide Net: Visualize Your Raw Data 

While in the game of Twenty Questions you are given a category as a starting point, in the world of analytics you are given a dataset. Your first mission is to investigate what that dataset includes.  

Read each column name to get an idea of what information your dataset encompasses. Explore with an open mind and think about what insights different variables might provide for your analysis, report, or business.  

Your dataset is likely to be much too large to get a visual read on variables simply by skimming columns and rows. Instead, make use of graphical representations to explore variables one at a time, such as histograms, pie charts, or box plots. The intent here is not to memorize every data point, but to understand the general structure of your variables. 

If you’re using Alteryx Designer for your analysis, be sure to add a Browse Tool to take advantage of the all-new holistic data profiling. Here you can quickly identify the shape of your data through a series of auto generated charts, graphs, and data statistics. 

Narrowing In: Calculate Descriptive Statistics 

Have a big picture idea of what your dataset includes? Alright, now let’s take a closer look at the variables you’re particularly interested in.  

Descriptive statistics, sometimes referred to as summary statistics, give a quick and simple numerical description of your data. This includes mean, median, mode, minimum value, maximum value, and standard deviation. Descriptive statistics may help you identify things that you did not catch when you plotted your variables, such as outliers or skewed distributions.  

You can calculate descriptive statistics for discrete or continuous quantitative variables, along with categorical variables. 

 

Building Upon What You Know: Investigate Relationships 

Much like in the game of Twenty Questions, it’s important to think critically about what information you’ve gotten so far, and how it all fits together. As you unearth more insights about your data through exploration, you’ll also want to consider the relationships between variables. Bivariate dimensionality is the relationship between two variables per subject, while multivariate dimensionality is a measurement made on many variables per subject. 

It's also important to consider the degree of correlation between variables. If your data exploration leads you down the path of advanced analytics where you employ regression modeling, using variables that are too closely related may yield results that don’t reveal any new or real insights. Data exploration in the early stages can help you to identify variables that are truly independent, and to avoid multicollinearity.  

Data exploration allows you to get familiar with your variables, form working hypotheses, and discover complex relationships early on in your analytics process. When exploring, think of yourself like a data detective playing Twenty Questions. Remember to start broad and then narrow in, and to always approach your data with an attitude of curiosity and intrigue. Data always has something to tell us, we just need eyes and ears that are alert to see and listen. 

Learn More. 

Learn more about exploratory analysis, why it's important, and the future of data exploration