class: center, middle, inverse, title-slide # Visual Statistics ## Communication and Graphical Testing ### Susan Vanderplas ### March 16, 2021 --- ## Introductions - PhD/MS in Statistics from Iowa State - BS in Applied Math and Psychology from Texas A&M - Research Areas - Forensic Science - automated algorithms for pattern evidence - Data Science - automation, data pipelines, tools - Visualization - Fundamental Goals - Make visualizations designed to leverage human perception to understand data (and communicate about data) - Design algorithms to mimic people's capabilities (vision/perception) --- class:middle ## Why do we use Visualizations? .large.center[https://pollev.com/susanvanderp753] <!-- <iframe src="https://pollev-embeds.com/discourses/dFoDVGNn2jsoUu9hww9D9/respond" width="800px" height="500px"></iframe> --> --- ## Why do we use Visualizations? <iframe src="https://embed.polleverywhere.com/discourses/NFICXszwVrNRqLfxPe5fy?controls=none&short_poll=true" width="800px" height="500px"></iframe> --- class:inverse ## The Good, the Bad, and the Ugly ![The Hertzsprung-Russell Diagram](../2021-BSE-GSA-Graphics/index_files/figure-html/HR-diagram-1.png) ??? When drawn well, statistical graphics help us understand our data and see things that we didn't know were there. The relationship between star magnitude, star color, and spectral class wasn't well understood until someone created a chart like this that showed the color index (or spectral class) against the absolute brightness. Then, it was much easier to see that the chart described a life-cycle - stars start out in the main sequence, and then become giants, dwarfs, or slowly change spectral class over time as they cool down in temperature. Well designed graphs can help us understand the natural phenomenon behind the raw numerical data we've collected. --- class:inverse ## The Good, the Bad, and the Ugly <iframe src="https://xkcd.com/1138/#comic" width="800px" height="560px"/> ??? Of course, even decent visualizations can't compensate for lousy data... --- class:inverse ## The Good, the Bad, and the Ugly ![Top 500 Supercomputers by Processor Family](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ef/Processor_families_in_TOP500_supercomputers.svg/1280px-Processor_families_in_TOP500_supercomputers.svg.png) ??? As with anything, graphics require a combination of "art" and "science" - you not only have to use the best method to display the data (which this isn't, necessarily), you also have to use some judgment as to how to show what you're hoping to show... and this is a good example of what happens when that doesn't happen. --- class:middle,center,inverse # Statistical Visualizations ??? I'm going to start out talking about some basic definitions, and then I'll show you one of my favorite ways that we use to determine how well statistical charts work. Hopefully, during this talk, I'll also convince you that there are other reasons beyond display to use statistical graphics. --- ## Statistics and Charts .pull-left[ A **statistic** is a quantity computed from values in a sample used for a statistical purpose .small[Source: [Wikipedia](https://en.wikipedia.org/wiki/Statistic)] ] .pull-right[ A **chart** is a graphical representation for data visualization, in which the data is represented by symbols .small[Source: [Wikipedia](https://en.wikipedia.org/wiki/Chart)] ] <br/><br/> -- .center.large[But, charts are computed from values in a sample (usually) and used for a statistical purpose] <br/><br/> -- .xx.center.emph.cerulean[So... charts **are** statistics!] --- ## Testing Statistics: Example .bottom[Data from [Tidy Tuesday, 2020-09-01](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-01/readme.md)] <img src="index_files/figure-html/unnamed-chunk-2-1.png" width="100%" /> ??? Suppose we have this data, and we want to determine whether there is a relationship between Maize production and time. If we were testing this statistically, we'd fit a linear regression to the data, and test to see whether the slope is equal to 0, right? We can do that graphically as well: if there is no relationship, then it doesn't matter which Y value is assigned to which X value - we can just shuffle the production values independently of time. --- ## Testing Statistics - If statistics are charts, then what is the reference distribution? - What constitutes an "extreme" or "significant" chart? .pull-left[ Hypothesis testing: - take a sample - calculate a test statistic - compare test statistic to reference distribution (formed by `\(H_0\)`) - if it is unlikely, reject null hypothesis ].pull-right[ Graphical hypothesis testing: - take a sample - create a test statistic/graph - compare graph to a reference distribution of other graphs generated under `\(H_0\)` - if test graph "stands out" then reject null hypothesis ] ??? Using the hypothesis that there is no relationship between X and Y, we can create a reference distribution of other graphs generated by shuffling the Y values. Do you think you'll be able to pick out the reference chart? --- ## Testing Statistics: Example <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ??? Now, this is a very easy example, because we all know there's been a massive increase in crop yields over the last 50 years. But, you can see how this paradigm is powerful - you can easily tell which plot is "different", and if I ask 20 different people to evaluate it, I'd wager at least 19 would say "plot 15 is different". You'll also notice that we didn't have to ask anything statistical in nature - all of our hypotheses are embedded into the statistical lineup through the generation of the other 19 plots. We call these lineups because they're similar to the criminal procedure of the same name. The powerful part of this is that we can test for statistical significance of data even in cases where the effect is very subtle, or not easily mathematically quantified, as long as we can generate realistic "null" data through a reasonable mechanism. --- ## Testing Statistics .pull-left[ - The plot is a **statistical lineup** - The method is **visual inference** (a graphical hypothesis test) - Many factors influence the results - the data - the plot type - the plot aesthetics .small[(color, shape, etc.)] - extra statistical features .small[trend lines, error bars] ] .pull-right[ ![](index_files/figure-html/unnamed-chunk-3-1.png) ] --- class:middle,center,inverse # How much do graphical features matter? ??? Now I'm going to talk a bit about a specific experiment I ran to determine which features matter the most when creating a plot. There is not much guidance on how to create plots that is experimentally driven, so I wanted to try to lay a foundation for scientifically validated plot creation guidelines. To talk about this, though, I'll need a bit of help. --- ## Which plot(s) are the most different? <img src="figure/set-48-k-5-sdline-0.45-sdgroup-0.25-TREND.png" style="display:block;margin-left:auto;margin-right:auto;" width="70%"/> --- ## Which plot(s) are the most different? <img src="figure/set-48-k-5-sdline-0.45-sdgroup-0.25-COLOR.png" style="display:block;margin-left:auto;margin-right:auto;" width="70%"/> --- ## Which plots are the most different? .pull-left[ <img src="figure/set-48-k-5-sdline-0.45-sdgroup-0.25-COLOR.png" style="display:block;margin-left:auto;margin-right:auto;" width="80%"/> .center[31 Evaluations] Panel | % selected ----: | -------: 12 | 9.7% 5 | 29.0% 18 | 32.3% Other | 29.1% ].pull-right[ <img src="figure/set-48-k-5-sdline-0.45-sdgroup-0.25-TREND.png" style="display:block;margin-left:auto;margin-right:auto;" width="80%"/> .center[22 Evaluations] Panel | % selected ----: | -------: 12 | 59.1% 5 | 9.1% 18 | -- Other | 31.7% ] ??? These plots are from the study I'm going to talk about, but just as a proof of concept - both of these plots contain the same data. However, which plot was selected as "interesting" varies massively. So we can determine that the data isn't the part that matters here - the way the plot is designed matters. I'll talk about how we designed the study and then I'll show you more conclusive results. --- ## Two-Target Lineups .pull-left[ - Modify lineup protocol for tests of competing hypotheses `\(H_1\)` and `\(H_2\)` - `\(H_1\)` and `\(H_2\)` target plots - 18 null plots generated using a mixture model consistent with `\(H_0\)` ].pull-right[ <img src="figure/set-48-k-5-sdline-0.45-sdgroup-0.25-TREND.png" style="display:block;margin-left:auto;margin-right:auto;" width="100%"/> ] ??? To generate this data, we used a mixture model - so we generated data from a cluster model, and data from a linear model, and then to create the null plots, we selected points randomly from each model type. I'm going to skip over how we designed the simulation model, because we don't have time to get into the details, but it was a fun ride. --- ## Experimental Design - Parameters - `\(K = 3, 5\)` clusters - `\(N = 15 K\)` points - `\(\sigma_T = 0.25, 0.35, 0.45\)` (variability around the trend line) - `\(\sigma_C = \begin{array}{cc}0.25, 0.30, 0.35 (K = 3)\\0.20, 0.25, 0.30 (K = 5)\end{array}\)` (variability around the cluster centers) - `\(\lambda = 0.5\)` (mixture parameter) <br/><br/> .center[ 18 combinations of plot parameters ( `\(2K \times 3\sigma_T \times 3\sigma_C\)` ) 3 replicates of each parameter set = 54 total lineup data sets ] --- ## Experimental Design - Aesthetics <img src="index_files/figure-html/aes-combo-1.png" width="90%" /> 10 Aesthetics `\(\times\)` 54 data sets = 540 plots --- ## Experimental Design - 1201 participants from Mechanical Turk - Each participant evaluates 10 plots (12010 evaluations) - Each `\(\sigma_C \times \sigma_T\)` value with one replicate, randomized across `\(K\)` values - All 10 aesthetic types - Participants select the plot or plots which are most different - Provide a short explanation - Rate confidence level ??? The experiment is set up as a balanced incomplete block design, where each participant sees all 10 aesthetic types, and 9 distinct variability combinations, randomized across number of groups. Participants are instructed to select the plots which are most different, and then asked to briefly explain their reasoning and rate their confidence in their answer. --- ## Results <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="80%" /> Most participants identified a mix of cluster and trend targets ??? We expect that there will be more cluster plot identifications than trend plot identifications, as there are more aesthetic combinations which emphasize clustering. This plot shows that most participants identified some cluster and some trend target plots; that is, that participants were sensitive to the different aesthetics in the charts. --- ## Results <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="80%" /> ??? This plot shows the aesthetic types and proportion of target plot selections. Participants were more likely to choose cluster targets across plot aesthetic types. When ellipses were present, there were many more participants selecting null plots; we will discuss why this occurred at the end of the presentation. --- ## Faceoff Model - Examine trials in which participants identified at least one target: 9959 trials - Compare P(select cluster target) to P(select trend target) `$$C_{ijk} := \left\{\begin{array}{c}\text{Participant }k\text{ selects the cluster target }\\ \text{for dataset }j\text{ with aesthetic }i\end{array}\right\}$$` ??? We fit several models to the participant responses, but the most interesting of these models is a conditional model which analyzes only trials in which participants selected one of the target plots. We compare the probability that participants select the cluster target to the probability that participants select the trend target, modeling the implicit choice between the two competing data generating models. --- ## Faceoff Model `$$\text{logit} P(C_{ijk}|C_{ijk}\cup T_{ijk}) = \mathbf{W}\alpha + \mathbf{X}\beta + \mathbf{J}\gamma + \mathbf{K}\eta$$` - `\(\alpha\)`: vector of fixed effects describing data parameters `\(\sigma_C,\sigma_T, K\)` - `\(\beta\)`: vector of fixed effects describing aesthetics `\(1 \leq i \leq 10\)` - `\(\gamma_j\)`: random effect of dataset, `\(\gamma_j\sim N(0, \sigma^2_{\text{data}})\)` - `\(\eta_k\)`: random effect of participant `\(\eta_k\sim N(0, \sigma^2_{\text{participant}})\)` - `\(\epsilon_{ijk}\)`: error associated with single evaluation of plot `\(ij\)` by participant `\(k\)`, `\(\epsilon_{ijk}\sim N(0, \sigma^2_e)\)` ??? The faceoff model is a logistic mixed effects model, with fixed effects describing the effects of data parameters sigma C, sigma T, and K, as well as the effects of the aesthetics. We include random effects to describe dataset and participant variability, as well as the error associated with each plot identification by a single participant. All errors are orthogonal due to the study design and are assumed to be iid. --- ## Faceoff Model <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="75%" /> ??? Overall, when there are multiple aesthetics which emphasize point similarity over linear continuity, participants are more likely to identify the cluster target relative to the trend target. When viewing graphical displays, the Gestalt principal of similarity tends to dominate over the gestalt principal of Interestingly, the two conflict conditions produce opposite effects - with color, ellipse, trend, and error, participants are significantly more likely to select the trend target; with color and trend, participants are significantly more likely to select the cluster target. This may be because the ellipses in the linear target are all in a line, which, when combined with the error bands may serve to further highlight the continuity of the points. This effect is related to the gestalt principal of common region, which, when combined with the continuity effect would strengthen the likelihood of participant detection of the trend target plot. When the ellipses and error bands are not present, the principal of common region is not recruited and the similarity effect dominates over the continuity effect created by the trend line. In addition, while not shown here, the estimates for parameters alpha_C and alpha_T are highly significant: as variability increases, the strength of the target's signal decreases and the probability of detecting the corresponding target also decreases. --- ## Responses: Plain plots <img src="index_files/figure-html/reasoning-plain-1.png" width="100%" /> ??? We asked participants to briefly describe their reasoning for choosing a specific plot or plots. We removed stopwords and "stemmed" answers so that "groups", "group", "grouping", "grouped" are all the same word, then plotted the words used in the reasoning as wordclouds, where the size of the word is proportional to the frequency of its appearance in the participant explanations. These 3 wordclouds show participant reasoning for participants who selected null plots, cluster targets, and trend targets, respectively, when shown a lineup with no additional aesthetics. Participants who selected cluster targets were clearly concerned with the clustered nature of the points; participants who selected the trend target were more concerned with the linear relationship between x and y. Participants who selected null plots were concerned with variability and outliers. As we didn't give the participants guidance on what feature to judge "different" by, it is not surprising that some selected features other than clustering and linear trends. --- ## Responses: Trend <img src="index_files/figure-html/reasoning-trend-1.png" width="100%" /> ??? When a trend line aesthetic is added to the plots, the reasoning changes somewhat for participants identifying cluster target and null plots. Participants who selected cluster targets talk about the separation (presumably, the linear trend line separating the groups) between lines, points, and groups of points; participants who selected null plots talk about the outliers, angle of the trend line, and spread of points around the line. --- ## Responses: Color <img src="index_files/figure-html/reasoning-color-1.png" width="100%" /> ??? When points are colored by group, participants who select null plots use the color of the points to highlight the outliers which they are using as cues. The logic for selecting the cluster or trend target plots does not change much, but the reasoning of those who selected null plots demonstrates that the color aesthetic facilitates description of outlying points. Even though participants selected the wrong plot, in many cases they still used logic indicating that the gestalt organization of the plot through color cues is quite powerful. --- ## Responses: Color + Ellipse <img src="index_files/figure-html/reasoning-colorellipse-1.png" width="100%" /> ??? When Color and Ellipses are shown, it is clear that participants who selected the cluster targets did so because the clusters were "distinct" or "didn't touch" or didn't overlap. Separation and space, as well as cluster size/cohesion are also commonly expressed sentiments. The addition of ellipses does also emphasize the linear trend target more than we had initially hoped - in many cases, a line of ellipses serves to create a sense of continuity due to the spatial arrangement of the ellipses. In most cases, though, this effect was minor compared to the visual weight of the small, compact clustered groups of points surrounded by an ellipse. At the beginning of the results section, we talked about how plots with ellipse aesthetics were more likely to lead to null plot selection. Participant explanations provide some additional insight: Words like "one", "two", "oval", and "missing" provide clues as to what exactly was so different about some of these plots. I've reproduced a plot on the next slide to demonstrate this effect. --- ## Participant Reasoning .center[<img src="figure/89903c2363590827b167328f09a9f076.png" width = "70%" class="one-col-image" style="margin-bottom:auto;"/>] -- Some of the null plots were missing an ellipse - We failed to enforce group size constraints on k-means algorithm. ??? The addition of the ellipse aesthetic highlighted the fact that some groups had as few as one or two points. These groups were assigned using the k-means algorithm, which doesn't by default have a constraint on the sizes of groups. With 540 plots and 10800 panels to proofread before running the experiment, we missed the fact that some null plots were missing an ellipse, as a group needs at least 3 points to have a valid bounding ellipse. It's safe to say that there was a strong effect of the ellipse aesthetic; however, that effect did not induce participants to act in the hypothesized manner by selecting the target plots. Working with humans is difficult sometimes. We re-ran this experiment with more even null plot cluster allocation, but hit an additional snag: participants cued in on the odd cluster shapes generated from the mixture model. They were still using the ellipses, but once again, the features participants made their decision on weren't the features we expected them to examine. To me, this means that not only is visual inference incredibly powerful for assessing single plots, and when used in designed experiments like this one; it's also a good way to test data when you aren't entirely sure what effect is most important. By manipulating the plot, using the lineup procedure, and asking for feedback, you can crowdsource what features people cue onto and determine which features are most critical to model effectively. --- ## Conclusion - Plot aesthetics matter - non-additive effects - what do you want to emphasize? -- - Multiple encoding is useful - "show the data" in a way that makes it easy to understand -- - Lineups are powerful tools for understanding graphical perception -- - Our perception is more complicated than most statistical models (and also, hard to trick/evade!) --- ## Why do we use Visualizations? <iframe src="https://embed.polleverywhere.com/discourses/NFICXszwVrNRqLfxPe5fy?controls=none&short_poll=true" width="800px" height="500px"></iframe> ??? If we return briefly to this question of why we use visualizations, I'm hoping that I've convinced you that there is one other way to use visualizations - as statistics (that are testable) to determine what we see and don't see in a chart. I've shown you one way that I use statistical lineups, but there are others, including in a traditional hypothesis testing context. Visual inference is an incredibly powerful technique for examining statistical hypothesis that are sometimes not as well specified - if you don't have a model that forces things to be normal and identically distributed, for instance, you may still be able to test that hypothesis visually, even if you can't fit a full model to the data.