Exploratory Data Analysis

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

— John Tukey

Guiding Questions

  • What does this data have to tell me about the topic?

  • Is the data reliable?

  • Is the data complete?

  • How are the variables distributed?

  • How are the variables related?

Question Generation

  • Does the structure of the data match your expectations?

  • Did the data read in correctly?

  • How does the data fit in context?

  • Do the observations make sense?

    • scale
    • missingness
    • relationships with other variables

Numerical EDA

  • Basic summary statistics

  • Made easier with tools like skimr and skimpy that tabulate summaries and sparkline charts

library(skimr)
skim(new_data)

Numerical EDA

Data summary
Name new_data
Number of rows 2142
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Numerical EDA

var n_missing complete_rate
group 0 1
var min max empty n_unique whitespace
group 1 1 0 13 0

Numerical EDA

var n_missing complete_rate mean sd
V1 150 0.9299720 54.24875 16.73711
V2 146 0.9318394 47.87741 26.94546
var p0 p25 p50 p75 p100 hist
V1 15.5607495 40.89041 52.52804 67.32513 98.28812 ▂▆▇▆▁
V2 0.0151193 22.19145 47.79891 71.97336 99.69468 ▇▇▇▇▆

Graphical EDA

  • Start with single variables (1D summaries)

  • Add in factor variables (Conditional 1D summaries)

  • Move to 2D summaries and pairwise scatterplots

  • For high dimensional numerical data, consider dimension reduction techniques (PCA, t-SNE, UMAP)

  • Be careful to notice/account for missing values

    • naniar and visdat R packages
    • missingno in python

Graphical EDA

library(visdat)
vis_dat(new_data) + 
  # Make color values a bit more colorblind friendly
  scale_fill_manual(values = c("#d73027", "#4575b4"), 
                    na.value = "#AAAAAA")

Graphical EDA

new_data |> 
  pivot_longer(-group, names_to = "var", values_to = "value") |>
  ggplot() + geom_boxplot(aes(x = value)) + facet_grid(var~.)

Graphical EDA

new_data |>
  pivot_longer(-group, names_to = "var", values_to = "value") |>
  ggplot() + geom_boxplot(aes(x = group, y = value)) + facet_grid(var~.)

Graphical EDA

new_data |>
  mutate(missing_val = is.na(V1) | is.na(V2)) |>
  ggplot() + geom_bar(aes(x = group, fill = missing_val))

Graphical EDA

library(naniar)
new_data |>
  ggplot(aes(x = V1, y = V2)) + 
  geom_miss_point() + 
  ggtitle(paste(
    "naniar package:",
    "Include missing vars")) + 
  theme(legend.position = 
          "bottom")

Graphical EDA

new_data |>
  ggplot(aes(x = V1, y = V2)) + geom_miss_point(size = .6) + 
  facet_wrap(~group, nrow = 2) + 
  ggtitle("naniar package: Include missing variables") + 
  theme(legend.position = c(1, 1.25), legend.justification = c(1, 1.25),
        legend.direction = "horizontal")

Your Turn

Your Turn

Note

Dataset: US County-level data, https://shorturl.at/U9KUO

About the data: https://github.com/evangambit/JsonOfCounties

Tasks

  1. Read in the data in your favorite software and pick an interesting set of variables.

  2. Generate some questions to explore using EDA skills.

  3. Use numerical and graphical summaries to answer your questions.

Your Turn

Try out a few of the following techniques:

  • Tabular summaries
  • Visualizing missing data
  • Univariate distributions
  • Conditional distributions
  • Bivariate distributions
  • Scatterplot matrices