What Makes a Good Graph?

Graphical Testing and Principles for Graph Design

Susan Vanderplas

2023-03-27

Introductions

PhD/MS in Statistics from Iowa State
BS in Applied Math and Psychology from Texas A&M
Research Areas
- Forensic Science - automated algorithms for pattern evidence
- Data Science - automation, data pipelines, tools
- Visualization - experimental evaluation of graphics, new methods
Fundamental Goals
- Make visualizations designed to leverage human perception to understand data (and communicate about data)
- Design algorithms to mimic people’s capabilities (vision/perception)

Why do we use Visualizations?

QR code link to https://pollev.com/susanvanderp753

The Good, The Bad, and the Ugly

The Good

The Bad

There are also a lot of global versions of this map showing traffic to English-language websites which are indistinguishable from maps of the location of internet users who are native English speakers.

The Ugly

Top 500 Supercomputers by Processor Family. Moxfyre, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

Statistical Visualizations

Statistics and Charts

A statistic is a quantity computed from values in a sample used for a statistical purpose
Source: Wikipedia

\[\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i\]

A chart is a graphical representation for data visualization, in which the data is represented by symbols
Source: Wikipedia

By DanielPenfield - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=9402369

Charts are computed from values in a sample (usually) and used for a statistical purpose

So… charts are statistics!

Testing Statistics: Example

Data from Tidy Tuesday, 2020-09-01

Testing Statistics

If statistics are charts, then what is the reference distribution?
What constitutes an “extreme” or “significant” chart?

Hypothesis testing:

take a sample
calculate a test statistic
compare test statistic to reference distribution
(formed by \(H_0\))
if it is unlikely, reject null hypothesis

Graphical hypothesis testing:

take a sample
create a test statistic/graph
compare graph to a reference distribution of other graphs generated under \(H_0\)
if test graph “stands out” then reject null hypothesis

Testing Statistics: Example

Now, this is a very easy example, because we all know there’s been a massive increase in crop yields over the last 50 years. But, you can see how this paradigm is powerful - you can easily tell which plot is “different”, and if I ask 20 different people to evaluate it, I’d wager at least 19 would say “plot 15 is different”.

You’ll also notice that we didn’t have to ask anything statistical in nature - all of our hypotheses are embedded into the statistical lineup through the generation of the other 19 plots. We call these lineups because they’re similar to the criminal procedure of the same name.

The powerful part of this is that we can test for statistical significance of data even in cases where the effect is very subtle, or not easily mathematically quantified, as long as we can generate realistic “null” data through a reasonable mechanism.

Testing Statistics

The plot is a statistical lineup
The method is visual inference
(a graphical hypothesis test)
Many factors influence the results
- the data
- the plot type
- the plot aesthetics
  (color, shape, etc.)
- extra statistical features
  trend lines, error bars

How much do graphical features matter?

Which plot(s) are the most different?

Which plots are the most different?

31 Evaluations

Panel	% selected
12	9.7%
5	29.0%
18	32.3%
Other	29.1%

22 Evaluations

Panel	% selected
12	59.1%
5	9.1%
18	–
Other	31.7%

Two-Target Lineups

Modify lineup protocol for tests of
competing hypotheses \(H_1\) and \(H_2\)
\(H_1\) and \(H_2\) target plots
18 null plots generated using a
mixture model consistent with \(H_0\)

Experimental Design - Parameters

\(K = 3, 5\) clusters
\(N = 15 K\) points
\(\sigma_T = 0.25, 0.35, 0.45\) (variability around the trend line)
\(\sigma_C = \begin{array}{cc}0.25, 0.30, 0.35 (K = 3)\\0.20, 0.25, 0.30 (K = 5)\end{array}\) (variability around the cluster centers)
\(\lambda = 0.5\) (mixture parameter)

18 combinations of plot parameters ( \(2K \times 3\sigma_T \times 3\sigma_C\) )

3 replicates of each parameter set = 54 total lineup data sets

Experimental Design - Aesthetics

10 Aesthetics \(\times\) 54 data sets = 540 plots

Experimental Design

1201 participants from Mechanical Turk
Each participant evaluates 10 plots (12010 evaluations)
- Each \(\sigma_C \times \sigma_T\) value with one replicate, randomized across \(K\) values
- All 10 aesthetic types
Participants select the plot or plots which are most different
- Provide a short explanation
- Rate confidence level

Results

Most participants identified a mix of cluster and trend targets

Results

Faceoff Model

Examine trials in which participants identified at least one target: 9959 trials
Compare P(select cluster target) to P(select trend target)

\[C_{ijk} := \left\{\begin{array}{c}\text{Participant }k\text{ selects the cluster target }\\ \text{for dataset }j\text{ with aesthetic }i\end{array}\right\}\]

Faceoff Model

\[\text{logit} P(C_{ijk}|C_{ijk}\cup T_{ijk}) = \mathbf{W}\alpha + \mathbf{X}\beta + \mathbf{J}\gamma + \mathbf{K}\eta\]

\(\alpha\): vector of fixed effects describing data parameters \(\sigma_C,\sigma_T, K\)
\(\beta\): vector of fixed effects describing aesthetics \(1 \leq i \leq 10\)
\(\gamma_j\): random effect of dataset, \(\gamma_j\sim N(0, \sigma^2_{\text{data}})\)
\(\eta_k\): random effect of participant \(\eta_k\sim N(0, \sigma^2_{\text{participant}})\)
\(\epsilon_{ijk}\): error associated with single evaluation of plot \(ij\) by participant \(k\), \(\epsilon_{ijk}\sim N(0, \sigma^2_e)\)

Faceoff Model

Overall, when there are multiple aesthetics which emphasize point similarity over linear continuity, participants are more likely to identify the cluster target relative to the trend target. When viewing graphical displays, the Gestalt principal of similarity tends to dominate over the gestalt principal of

Interestingly, the two conflict conditions produce opposite effects - with color, ellipse, trend, and error, participants are significantly more likely to select the trend target; with color and trend, participants are significantly more likely to select the cluster target. This may be because the ellipses in the linear target are all in a line, which, when combined with the error bands may serve to further highlight the continuity of the points. This effect is related to the gestalt principal of common region, which, when combined with the continuity effect would strengthen the likelihood of participant detection of the trend target plot. When the ellipses and error bands are not present, the principal of common region is not recruited and the similarity effect dominates over the continuity effect created by the trend line.

In addition, while not shown here, the estimates for parameters alpha_C and alpha_T are highly significant: as variability increases, the strength of the target’s signal decreases and the probability of detecting the corresponding target also decreases.

Responses: Plain plots

We asked participants to briefly describe their reasoning for choosing a specific plot or plots. We removed stopwords and “stemmed” answers so that “groups”, “group”, “grouping”, “grouped” are all the same word, then plotted the words used in the reasoning as wordclouds, where the size of the word is proportional to the frequency of its appearance in the participant explanations.

These 3 wordclouds show participant reasoning for participants who selected null plots, cluster targets, and trend targets, respectively, when shown a lineup with no additional aesthetics. Participants who selected cluster targets were clearly concerned with the clustered nature of the points; participants who selected the trend target were more concerned with the linear relationship between x and y. Participants who selected null plots were concerned with variability and outliers. As we didn’t give the participants guidance on what feature to judge “different” by, it is not surprising that some selected features other than clustering and linear trends.

Responses: Trend

Responses: Color

Responses: Color + Ellipse

When Color and Ellipses are shown, it is clear that participants who selected the cluster targets did so because the clusters were “distinct” or “didn’t touch” or didn’t overlap. Separation and space, as well as cluster size/cohesion are also commonly expressed sentiments.

The addition of ellipses does also emphasize the linear trend target more than we had initially hoped - in many cases, a line of ellipses serves to create a sense of continuity due to the spatial arrangement of the ellipses. In most cases, though, this effect was minor compared to the visual weight of the small, compact clustered groups of points surrounded by an ellipse.

At the beginning of the results section, we talked about how plots with ellipse aesthetics were more likely to lead to null plot selection. Participant explanations provide some additional insight: Words like “one”, “two”, “oval”, and “missing” provide clues as to what exactly was so different about some of these plots. I’ve reproduced a plot on the next slide to demonstrate this effect.

Participant Reasoning

Some of the null plots were missing an ellipse - We failed to enforce group size constraints on k-means algorithm.

The addition of the ellipse aesthetic highlighted the fact that some groups had as few as one or two points. These groups were assigned using the k-means algorithm, which doesn’t by default have a constraint on the sizes of groups. With 540 plots and 10800 panels to proofread before running the experiment, we missed the fact that some null plots were missing an ellipse, as a group needs at least 3 points to have a valid bounding ellipse.

It’s safe to say that there was a strong effect of the ellipse aesthetic; however, that effect did not induce participants to act in the hypothesized manner by selecting the target plots. Working with humans is difficult sometimes.

We re-ran this experiment with more even null plot cluster allocation, but hit an additional snag: participants cued in on the odd cluster shapes generated from the mixture model. They were still using the ellipses, but once again, the features participants made their decision on weren’t the features we expected them to examine.

To me, this means that not only is visual inference incredibly powerful for assessing single plots, and when used in designed experiments like this one; it’s also a good way to test data when you aren’t entirely sure what effect is most important. By manipulating the plot, using the lineup procedure, and asking for feedback, you can crowdsource what features people cue onto and determine which features are most critical to model effectively.

Conclusion

Making Good Charts

Plot aesthetics matter
- non-additive effects
- what do you want to emphasize?

Multiple encoding is useful -
“show the data” in a way that makes it easy to understand

Lineups are powerful tools for understanding graphical perception

Our perception is more complicated than most statistical models
(and also, hard to trick/evade!)