What Makes a Good Graph?

Graphical Testing and Principles for Graph Design

Susan Vanderplas

2023-03-27

Introductions

  • PhD/MS in Statistics from Iowa State

  • BS in Applied Math and Psychology from Texas A&M

  • Research Areas

    • Forensic Science - automated algorithms for pattern evidence
    • Data Science - automation, data pipelines, tools
    • Visualization - experimental evaluation of graphics, new methods
  • Fundamental Goals

    • Make visualizations designed to leverage human perception to understand data (and communicate about data)
    • Design algorithms to mimic people’s capabilities (vision/perception)

Why do we use Visualizations?

QR code link to https://pollev.com/susanvanderp753

The Good, The Bad, and the Ugly

The Good

The Bad

There are also a lot of global versions of this map showing traffic to English-language websites which are indistinguishable from maps of the location of internet users who are native English speakers.

The Ugly

Top 500 Supercomputers by Processor Family. Moxfyre, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

Statistical Visualizations

Statistics and Charts

A statistic is a quantity computed from values in a sample used for a statistical purpose
Source: Wikipedia

\[\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\]

A chart is a graphical representation for data visualization, in which the data is represented by symbols
Source: Wikipedia

By DanielPenfield - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9402369

Charts are computed from values in a sample (usually) and used for a statistical purpose



So… charts are statistics!

Testing Statistics: Example

Data from Tidy Tuesday, 2020-09-01

Testing Statistics

  • If statistics are charts, then what is the reference distribution?

  • What constitutes an “extreme” or “significant” chart?

Hypothesis testing:

  • take a sample

  • calculate a test statistic

  • compare test statistic to reference distribution
    (formed by \(H_0\))

  • if it is unlikely, reject null hypothesis

Graphical hypothesis testing:

  • take a sample

  • create a test statistic/graph

  • compare graph to a reference distribution of other graphs generated under \(H_0\)

  • if test graph “stands out” then reject null hypothesis

Testing Statistics: Example

Testing Statistics

  • The plot is a statistical lineup

  • The method is visual inference
    (a graphical hypothesis test)

  • Many factors influence the results

    • the data
    • the plot type
    • the plot aesthetics
      (color, shape, etc.)
    • extra statistical features
      trend lines, error bars

How much do graphical features matter?

Which plot(s) are the most different?

Which plot(s) are the most different?

Which plots are the most different?

31 Evaluations

Panel % selected
12 9.7%
5 29.0%
18 32.3%
Other 29.1%

22 Evaluations

Panel % selected
12 59.1%
5 9.1%
18
Other 31.7%

Two-Target Lineups

  • Modify lineup protocol for tests of
    competing hypotheses \(H_1\) and \(H_2\)

  • \(H_1\) and \(H_2\) target plots

  • 18 null plots generated using a
    mixture model consistent with \(H_0\)

Experimental Design - Parameters

  • \(K = 3, 5\) clusters

  • \(N = 15 K\) points

  • \(\sigma_T = 0.25, 0.35, 0.45\) (variability around the trend line)

  • \(\sigma_C = \begin{array}{cc}0.25, 0.30, 0.35 (K = 3)\\0.20, 0.25, 0.30 (K = 5)\end{array}\) (variability around the cluster centers)

  • \(\lambda = 0.5\) (mixture parameter)



18 combinations of plot parameters ( \(2K \times 3\sigma_T \times 3\sigma_C\) )

3 replicates of each parameter set = 54 total lineup data sets

Experimental Design - Aesthetics

10 Aesthetics \(\times\) 54 data sets = 540 plots

Experimental Design

  • 1201 participants from Mechanical Turk

  • Each participant evaluates 10 plots (12010 evaluations)

    • Each \(\sigma_C \times \sigma_T\) value with one replicate, randomized across \(K\) values
    • All 10 aesthetic types
  • Participants select the plot or plots which are most different

    • Provide a short explanation
    • Rate confidence level

Results

Most participants identified a mix of cluster and trend targets

Results

Faceoff Model

  • Examine trials in which participants identified at least one target: 9959 trials

  • Compare P(select cluster target) to P(select trend target)

\[C_{ijk} := \left\{\begin{array}{c}\text{Participant }k\text{ selects the cluster target }\\ \text{for dataset }j\text{ with aesthetic }i\end{array}\right\}\]

Faceoff Model

\[\text{logit} P(C_{ijk}|C_{ijk}\cup T_{ijk}) = \mathbf{W}\alpha + \mathbf{X}\beta + \mathbf{J}\gamma + \mathbf{K}\eta\]

  • \(\alpha\): vector of fixed effects describing data parameters \(\sigma_C,\sigma_T, K\)

  • \(\beta\): vector of fixed effects describing aesthetics \(1 \leq i \leq 10\)

  • \(\gamma_j\): random effect of dataset, \(\gamma_j\sim N(0, \sigma^2_{\text{data}})\)

  • \(\eta_k\): random effect of participant \(\eta_k\sim N(0, \sigma^2_{\text{participant}})\)

  • \(\epsilon_{ijk}\): error associated with single evaluation of plot \(ij\) by participant \(k\), \(\epsilon_{ijk}\sim N(0, \sigma^2_e)\)

Faceoff Model

Responses: Plain plots

Responses: Trend

Responses: Color

Responses: Color + Ellipse

Participant Reasoning

Some of the null plots were missing an ellipse - We failed to enforce group size constraints on k-means algorithm.

Conclusion

Making Good Charts

  • Plot aesthetics matter
    • non-additive effects
    • what do you want to emphasize?
  • Multiple encoding is useful -
    “show the data” in a way that makes it easy to understand
  • Lineups are powerful tools for understanding graphical perception
  • Our perception is more complicated than most statistical models
    (and also, hard to trick/evade!)