Exploratory Data Analysis

You have a new dataset and it’s time to dig in. You may have some information about how the data was collected (or not!), but you almost certainly don’t already know what’s in the data - what artifacts, data entry errors, missing data, and outliers may show up. Exploratory Data Analysis, or EDA, is like an exploratory hike through unknown territory. You may have prepared ahead of time - you may have a map, hiking boots, and so on, but you cannot predict what wildlife you will see or what obstacles you might encounter.

Throughout this section, we’ll use the palmerpenguins data to illustrate different ways to explore data.

# Run if you don't have the package installed
# install.packages("palmerpenguins")
library(palmerpenguins)
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>
from palmerpenguins import load_penguins
penguins = load_penguins()
penguins.head()
  species     island  bill_length_mm  ...  body_mass_g     sex  year
0  Adelie  Torgersen            39.1  ...       3750.0    male  2007
1  Adelie  Torgersen            39.5  ...       3800.0  female  2007
2  Adelie  Torgersen            40.3  ...       3250.0  female  2007
3  Adelie  Torgersen             NaN  ...          NaN     NaN  2007
4  Adelie  Torgersen            36.7  ...       3450.0  female  2007

[5 rows x 8 columns]

Examining the Data

Roger Peng’s e-book “Exploratory Data Analysis with R” contains a checklist for EDA that is a useful starting point.

This checklist covers mostly numerical summaries, but these numerical summaries can be incredibly useful for getting familiar with the data before proceeding with a graphical exploration.

  1. Formulate your question - come up with at least one question to guide your analysis. This prevents you from trying to explore too many paths at once. (You don’t necessarily need to stick to this path if you find another interesting one, though.)

  2. Read in your data

  3. Check the packaging - look at the number of rows and columns and ensure they match your expectations.

  4. Examine the structure of your data (e.g. using str() in R or DataFrame.info() in pandas)

  5. Look at the top and the bottom of your data using functions like head() and tail()

  6. Check your “n”s - examine counts of data by relevant variables, like location, participant, time to ensure the data matches your expectations.

  7. Validate with at least one external data source - check that the observations in your data fall within an acceptable range defined by some outside data source.

  8. Try the easy solution first - get a simple answer to your question with a quick plot or table

  9. Challenge your solution - spend a bit more time answering your question by e.g. summarizing your data across a relevant dimension or examining the effect of important covariates.

  10. Follow up - do you have the right data to answer your question? Do you need additional data? Do you need to adjust your question?

Numerical EDA

It is often good to calculate some basic summary statistics for each variable, along with the assessments above and the graphical summaries we’ll focus on in this section.

The skimr package in R produces very visually appealing numerical summaries.

library(skimr)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

The skimpy package in python is modeled on the skimr R package, but somehow has a much better name.

from skimpy import skim
skim(penguins)
╭─────────────────────────────── skimpy summary ───────────────────────────────╮
│          Data Summary                Data Types                              │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                       │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                       │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                       │
│ │ Number of rows    │ 344    │ │ float64     │ 4     │                       │
│ │ Number of columns │ 8      │ │ string      │ 3     │                       │
│ └───────────────────┴────────┘ │ int64       │ 1     │                       │
│                                └─────────────┴───────┘                       │
│                                   number                                     │
│ ┏━━━━━━┳━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┓  │
│ ┃ colu ┃    ┃      ┃      ┃      ┃      ┃      ┃      ┃      ┃      ┃     ┃  │
│ ┃ mn_n ┃    ┃      ┃      ┃      ┃      ┃      ┃      ┃      ┃      ┃ his ┃  │
│ ┃ ame  ┃ NA ┃ NA % ┃ mean ┃ sd   ┃ p0   ┃ p25  ┃ p50  ┃ p75  ┃ p100 ┃ t   ┃  │
│ ┡━━━━━━╇━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━┩  │
│ │ bill │  2 │ 0.58 │   44 │  5.5 │   32 │   39 │   44 │   48 │   60 │ ▃▇▆ │  │
│ │ _len │    │      │      │      │      │      │      │      │      │ ▇▃  │  │
│ │ gth_ │    │      │      │      │      │      │      │      │      │     │  │
│ │ mm   │    │      │      │      │      │      │      │      │      │     │  │
│ │ bill │  2 │ 0.58 │   17 │    2 │   13 │   16 │   17 │   19 │   22 │ ▃▅▆ │  │
│ │ _dep │    │      │      │      │      │      │      │      │      │ ▇▆▂ │  │
│ │ th_m │    │      │      │      │      │      │      │      │      │     │  │
│ │ m    │    │      │      │      │      │      │      │      │      │     │  │
│ │ flip │  2 │ 0.58 │  200 │   14 │  170 │  190 │  200 │  210 │  230 │ ▂▇▇ │  │
│ │ per_ │    │      │      │      │      │      │      │      │      │ ▃▆▃ │  │
│ │ leng │    │      │      │      │      │      │      │      │      │     │  │
│ │ th_m │    │      │      │      │      │      │      │      │      │     │  │
│ │ m    │    │      │      │      │      │      │      │      │      │     │  │
│ │ body │  2 │ 0.58 │ 4200 │  800 │ 2700 │ 3600 │ 4000 │ 4800 │ 6300 │ ▂▇▆ │  │
│ │ _mas │    │      │      │      │      │      │      │      │      │ ▅▃▁ │  │
│ │ s_g  │    │      │      │      │      │      │      │      │      │     │  │
│ │ year │  0 │    0 │ 2000 │ 0.82 │ 2000 │ 2000 │ 2000 │ 2000 │ 2000 │  ▇  │  │
│ │      │    │      │      │      │      │      │      │      │      │ ▇ ▇ │  │
│ └──────┴────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴─────┘  │
│                                   string                                     │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name      ┃ NA   ┃ NA %   ┃ words per row      ┃ total words     ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩  │
│ │ species          │    0 │      0 │                  1 │             344 │  │
│ │ island           │    0 │      0 │                  1 │             344 │  │
│ │ sex              │   11 │    3.2 │               0.97 │             333 │  │
│ └──────────────────┴──────┴────────┴────────────────────┴─────────────────┘  │
╰──────────────────────────────────── End ─────────────────────────────────────╯

Graphical EDA

When examining a data set, it is often useful to start out by examining single variables. We often care about distributions of data, both for getting to know the dataset and for considering modeling possibilities once we transition to confirmatory data analysis.

The following chunks load in the graphics packages we’ll use for the rest of this section.

library(ggplot2)
import seaborn as sns 
import seaborn.objects as so
import matplotlib.pyplot as plt

One-dimensional summaries

There are several basic options for distribution assessment:

  • Boxplots (numerical data) - visually simple graphical five-number summaries
  • Histograms (numerical data) - show full distribution of the data, useful for assessing skewness, etc.
  • Density plot (numerical data) - continuous version of histogram
  • Barplots (categorical data) - categorical equivalent of histogram.

It is often common to initially make these plots for one variable and then to begin to explore conditional distributions by drawing multiple plots for different values of a different variable.

The following examples show exploratory plots for the palmerpenguins data.

Boxplots

ggplot(penguins) + geom_boxplot(aes(y = body_mass_g))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(penguins) + geom_boxplot(aes(x = sex, y = body_mass_g))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

ggplot(penguins) + geom_boxplot(aes(x = species, y = body_mass_g))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).

sns.boxplot(data = penguins, y = "body_mass_g")
plt.show()

sns.boxplot(data = penguins, x = "sex", y = "body_mass_g")
plt.show()

sns.boxplot(data = penguins, x = "species", y = "body_mass_g")
plt.show()

Histograms

ggplot(penguins) + 
  geom_histogram(aes(x = body_mass_g))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

ggplot(penguins) + 
  geom_histogram(aes(x = body_mass_g, fill = sex), 
                 alpha = .5, position = "identity")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

ggplot(penguins) + 
  geom_histogram(aes(x = body_mass_g, fill = species), 
                 alpha = .5, position = "identity")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).

(
  so.Plot(penguins, x = "body_mass_g")
    .add(so.Bars(), so.Hist(bins=30))
    .show()
)

(
  so.Plot(penguins, x = "body_mass_g", color = "sex")
    .add(so.Bars(alpha=.5), so.Hist(bins=30))
    .show()
)

(
  so.Plot(penguins, x = "body_mass_g", color = "species")
    .add(so.Bars(alpha=.5), so.Hist(bins=30))
    .show()
)

Density Plots

ggplot(penguins) + 
  geom_density(aes(x = body_mass_g))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

ggplot(penguins) + 
  geom_density(aes(x = body_mass_g, fill = sex), 
               alpha = .5)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

ggplot(penguins) + 
  geom_density(aes(x = body_mass_g, fill = species), 
               alpha = .5)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_density()`).

so.Plot(penguins, x = "body_mass_g").add(so.Area(), so.KDE()).show()

(
  so.Plot(penguins, x = "body_mass_g", color = "sex")
    .add(so.Area(), so.KDE(common_norm=False))
    .show()
)

(
  so.Plot(penguins, x = "body_mass_g", color = "species")
    .add(so.Area(), so.KDE(common_norm=False))
    .show()
)

Bar Plots

ggplot(penguins) + 
  geom_bar(aes(x = species))

ggplot(penguins) + 
  geom_bar(aes(x = species, fill = sex), 
           position = "dodge")

ggplot(penguins) + 
  geom_bar(aes(x = species, fill = island))

(
  so.Plot(penguins, x = "species")
    .add(so.Bar(), so.Count())
    .show()
)

(
  so.Plot(penguins, x = "species", color = "sex")
    .add(so.Bar(), so.Count(), so.Dodge())
    .show()
)

(
  so.Plot(penguins, x = "species", color = "island")
    .add(so.Bar(), so.Count(), so.Stack())
    .show()
)

Two-dimensional summaries

Some of the one-dimensional summaries discussed above are easily modified into two dimensional summaries by coloring by or faceting by another variable. Other two-dimensional summaries require different types of plots: for instance, scatter plots, which show two numerical variables and can be modified to use color, shape, and facets to include even more information.

Scatterplots

ggplot(penguins, 
       aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() 
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(penguins, 
       aes(x = bill_length_mm, y = bill_depth_mm, 
           color = species)) +
  geom_point() 
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot(penguins, 
       aes(x = bill_length_mm, y = bill_depth_mm, 
           color = species)) +
  geom_point() +
  facet_wrap(~year)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

(
  so.Plot(penguins, x = "bill_length_mm", y = "bill_depth_mm")
    .add(so.Dot())
    .show()
)

(
  so.Plot(penguins, x = "bill_length_mm", y = "bill_depth_mm", color = "species")
    .add(so.Dot())
    .show()
)

(
  so.Plot(penguins, x = "bill_length_mm", y = "bill_depth_mm", color = "species")
    .facet("year")
    .add(so.Dot())
    .show()
)

Scatter plot Matrices

Sometimes, we want to examine conditional relationships between many different variables at a time - for instance, to identify highly correlated variables when considering fitting a linear model. Scatter plot matrices allow us to examine relationships between many sets of two variables at once.

library(GGally)

ggpairs(penguins, columns = 3:6, aes(color = species))

penguin_measures = penguins.drop("year", axis = 1)
sns.pairplot(penguin_measures, hue = "species")

Analyzing An Example Dataset

An exploration of an interesting dataset showing the value of different types of plots for understanding your data.

Setup

# Data manipulation
library(tibble) # data frame++
library(dplyr) # wrangling

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr) # reshaping
library(magrittr) # pipes

Attaching package: 'magrittr'
The following object is masked from 'package:tidyr':

    extract
library(forcats) # factors

# Graphics
library(ggplot2)
theme_set(theme_bw())
# Data manipulation
import pandas as pd
import numpy as np

# Graphics
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

Numerical EDA

library(skimr)
new_data <- read.csv("../data/new_data.csv")
skim(new_data)
Data summary
Name new_data
Number of rows 2142
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
group 0 1 1 1 0 13 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
V1 150 0.93 54.25 16.74 15.56 40.89 52.53 67.33 98.29 ▂▆▇▆▁
V2 146 0.93 47.88 26.95 0.02 22.19 47.80 71.97 99.69 ▇▇▇▇▆
from skimpy import skim
new_data = pd.read_csv("../data/new_data.csv")
skim(new_data)
╭─────────────────────────────── skimpy summary ───────────────────────────────╮
│          Data Summary                Data Types                              │
│ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                       │
│ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                       │
│ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                       │
│ │ Number of rows    │ 2142   │ │ float64     │ 2     │                       │
│ │ Number of columns │ 3      │ │ string      │ 1     │                       │
│ └───────────────────┴────────┘ └─────────────┴───────┘                       │
│                                   number                                     │
│ ┏━━━━━━━┳━━━━━┳━━━━━━┳━━━━━━┳━━━━┳━━━━━━━┳━━━━━┳━━━━━┳━━━━━┳━━━━━━┳━━━━━━━┓  │
│ ┃ colum ┃     ┃      ┃      ┃    ┃       ┃     ┃     ┃     ┃      ┃       ┃  │
│ ┃ n_nam ┃     ┃      ┃      ┃    ┃       ┃     ┃     ┃     ┃      ┃       ┃  │
│ ┃ e     ┃ NA  ┃ NA % ┃ mean ┃ sd ┃ p0    ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist  ┃  │
│ ┡━━━━━━━╇━━━━━╇━━━━━━╇━━━━━━╇━━━━╇━━━━━━━╇━━━━━╇━━━━━╇━━━━━╇━━━━━━╇━━━━━━━┩  │
│ │ V1    │ 150 │    7 │   54 │ 17 │    16 │  41 │  53 │  67 │   98 │ ▂▆▇▇▅ │  │
│ │       │     │      │      │    │       │     │     │     │      │   ▁   │  │
│ │ V2    │ 146 │ 6.82 │   48 │ 27 │ 0.015 │  22 │  48 │  72 │  100 │ ▅▇▅▆▆ │  │
│ │       │     │      │      │    │       │     │     │     │      │   ▅   │  │
│ └───────┴─────┴──────┴──────┴────┴───────┴─────┴─────┴─────┴──────┴───────┘  │
│                                   string                                     │
│ ┏━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓  │
│ ┃ column_name      ┃ NA   ┃ NA %   ┃ words per row      ┃ total words     ┃  │
│ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩  │
│ │ group            │    0 │      0 │                  1 │            2142 │  │
│ └──────────────────┴──────┴────────┴────────────────────┴─────────────────┘  │
╰──────────────────────────────────── End ─────────────────────────────────────╯

Graphical EDA

Missing Data

library(visdat)

vis_dat(new_data) + 
  # Make color values a bit more colorblind friendly
  scale_fill_manual(values = c("#d73027", "#4575b4"), 
                    na.value = "#AAAAAA")

import missingno as msno
msno.matrix(new_data, label_rotation = 0)
plt.show()

Single Variable Distributions

new_data |> 
  pivot_longer(-group, names_to = "var", values_to = "value") |>
  ggplot() + 
    geom_boxplot(aes(x = value)) + 
    facet_grid(var~.)
Warning: Removed 296 rows containing non-finite outside the scale range
(`stat_boxplot()`).

new_data |>
  mutate(missing_val = is.na(V1) | is.na(V2)) |>
  mutate(missing_val = c("NO", "YES")[missing_val+1] |>
                       fct_relevel("YES")) |>
  ggplot() + 
  geom_bar(aes(x = group, fill = missing_val))

new_data = new_data.sort_values(['group', 'V1', 'V2'])
new_data_long = pd.melt(new_data, id_vars = 'group', value_vars = ['V1', 'V2'], var_name = 'var')
sns.catplot(data = new_data_long, x = 'value', y = 'var', kind = 'box')

plt.show()

new_data['missing_val'] = pd.isna(new_data.V1) | pd.isna(new_data.V2)
so.Plot(new_data, x = 'group', color = 'missing_val').add(so.Bar(), so.Hist(), so.Stack()).show()

Conditional Distributions

new_data |>
  pivot_longer(-group, names_to = "var", values_to = "value") |>
  ggplot() + 
    geom_boxplot(aes(x = group, y = value)) + 
    facet_grid(var~.)
Warning: Removed 296 rows containing non-finite outside the scale range
(`stat_boxplot()`).

sns.catplot(data = new_data_long, x = 'group', y = 'value', row = 'var', kind = "box")

plt.show()

Bivariate Distributions

library(naniar)

Attaching package: 'naniar'
The following object is masked from 'package:skimr':

    n_complete
new_data |>
  ggplot(aes(x = V1, y = V2)) + 
  geom_miss_point() + 
  ggtitle(paste(
    "naniar package:",
    "Include missing vars")) + 
  theme(legend.position = 
          "bottom")

new_data |>
  ggplot(aes(x = V1, y = V2)) + geom_miss_point(size = .6) + 
  facet_wrap(~group, nrow = 2) + 
  ggtitle("naniar package: Include missing variables") + 
  theme(legend.position = c(1, 1.25), legend.justification = c(1, 1.25),
        legend.direction = "horizontal")
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

Python, to the best of my knowledge, doesn’t have a way to show scatterplots with missing data outside the range of the non-missing data. We’ll have to be content with a normal scatterplot to see the bivariate relationship.

so.Plot(new_data, x = 'V1', y = 'V2').add(so.Dots()).show()

so.Plot(new_data, x = 'V1', y = 'V2').add(so.Dots()).facet(row = 'group', wrap = 2).show()

Your Turn

Use the US county-level data

Dataset: US County-level data, https://shorturl.at/U9KUO
(You should be able to read this in directly from the URL)

About the data: https://github.com/evangambit/JsonOfCounties

  1. Read in the data in your favorite software and pick an interesting set of variables.
    Options:
  • County geometry
  • Weather (NOAA)
  • Demographics (Census)
  • Cause of death (CDC, averaged 1999-2019)
  • Employment (BLS)
  • Police Shootings (Wash Post)
  • Average Income (BEA)
  • COVID deaths (USAFacts/CDC)
  • Election Results
  1. Generate some questions to explore using EDA skills.

  2. Try out a few of the following techniques:

  • Tabular summaries
  • Visualizing missing data
  • Univariate distributions
  • Conditional distributions
  • Bivariate distributions
  • Scatterplot matrices