Feature Hierarchy

Clusters Beat Trend!? Testing Feature Hierarchy in Statistical Graphics

Susan VanderPlas & Heike Hofmann

Iowa State University

Graphics and Perception

The greatest value of a picture is when it forces us to notice what we never expected to see.

John Tukey

Gestalt Laws of Perception


The whole is different than the sum of the parts

Gestalt Plots

How do plot aesthetics change our perception of the plotted data?

Statistical Lineups

Which plot is the most different?
Null plot data is from a data-generating method consistent with the null hypothesis

The nullabor package helps with null data creation

Which plots are the most different?

Which plots are the most different?

Which plots are the most different?

Which plots are the most different?

Two-Target Lineups

5, 12

Data Generating Mechanism

Linear Model

Parameter: \(\sigma_T\), the variability around the trend line

  1. Generate evenly spaced \(x_i\) in \([-1, 1]\)
  2. Jitter \(x_i\)
  3. Generate \(y_i = x_i + e_i\), \(e_i \sim N(0, \sigma_T^2)\)
  4. Center and scale \(x_i, y_i\)

Cluster Model

Parameters: \(K\) clusters, \(\sigma_C\) cluster variability

  1. Generate \(K\) cluster centers \(c^x,c^y\) on a \(K\times K\) grid such that \(cor(c^x, c^y) \in [.25, .75]\)
  2. Center and standardize \(c^x, c^y\)
  3. Determine cluster size \(g_1, ..., g_K \sim Multinomial(K, p)\)
  4. Generate points around cluster centers: \((x_i, y_i) = (c^x_{g_i}, c^y_{g_i}) + (e_i^x, e_i^y)\) where \(e_i \sim N(0, \sigma_c^2)\)
  5. Center and scale \(x_i, y_i\)

Cluster Model

Mixture Model

Groups created by k-means clustering

Mixture Model

Experimental Design - Data Parameters

18 combinations of plot parameters (\(2K \times 3\sigma_T \times 3\sigma_C\))
3 replicates of each parameter set; 54 total lineup data sets

Experimental Design - Plot Aesthetics

10 Aesthetics \(\times\) 54 data sets = 540 plots

Experimental Design

Results

Most participants identified a mix of cluster and trend targets

Results

Faceoff Model

\[C_{ijk} := \left\{\begin{array}{c}\text{Participant }k\text{ selects the cluster target }\\ \text{for dataset }j\text{ with aesthetic }i\end{array}\right\}\]

Faceoff Model

\[\text{logit} P(C_{ijk}|C_{ijk}\cup T_{ijk}) = \mathbf{W}\alpha + \mathbf{X}\beta + \mathbf{J}\gamma + \mathbf{K}\eta\]

Faceoff Model

Participant Reasoning: Plain plots

Participant Reasoning: Trend plots

Participant Reasoning: Color plots

Participant Reasoning: Color + Ellipse plots

Participant Reasoning



Some of the null plots were missing an ellipse - We failed to enforce group size constraints on k-means algorithm.

Conclusion

Conclusion