data:image/s3,"s3://crabby-images/ae0c2/ae0c2823e9742396fe12102899c2ef65e382d459" alt=""
data:image/s3,"s3://crabby-images/bc998/bc9983b3f88f8698b43b2437e4a93d0de372a471" alt=""
A chart is good if it allows the user to draw useful conclusions that are supported by data. Obviously, this definition depends on the purpose of the chart - a simple EDA chart is going to have a different purpose than a chart showing e.g. the predicted path of a hurricane, which people will use to make decisions about whether or not to evacuate.
Unfortunately, while our visual system is amazing, it is not always as accurate as the computers we use to render graphics. We have physical limits in the number of colors we can perceive, our short term memory, attention, and our ability to accurately read information off of charts in different forms.
You’ve almost certainly noticed that some graphical tasks are easier than others. Part of the reason for this is that certain tasks require active engagement and attention to search through the visual stimulus; others, however, just “pop” out of the background. We call these features that just “pop” without active work preattentive features; technically, they are detected within the first 250ms of viewing a stimulus [1].
Take a look at Figure 21.1; can you spot the point that is different?
Color and shape are commonly used graphical features that are processed pre-attentively. Some people suggest utilizing this to pack more dimensions into multivariate visualizations [2], but in general, knowing which features are processed more quickly (color/shape) and which are processed more slowly (combinations of preattentively processed features) allows you to design a chart that requires less cognitive effort to read.
As awesome as it is to be able to use preattentive features to process information, we should not use combinations of preattentive features to show different variables. Take a look at Figure 21.2 - part (a) shows the same grouping in color and shape, part (b) shows color and shape used to encode different variables.
Here, it is easy to differentiate the points in Figure 21.2(a), because they are dual-encoded. However, it is very difficult to pick out the different groups of points in Figure 21.2(b) because the combination of preattentive features requires active attention to sort out.
Careful use of preattentive features can reduce the cognitive effort required for viewers to perceive a chart.
Encode only one variable using preattentive features, as combinations of preattentive features are not processed preattentively.
Our eyes are optimized for perceiving the yellow/green region of the color spectrum, as shown in Figure 21.3. Why? Well, our sun produces yellow light, and plants tend to be green. It’s pretty important to be able to distinguish different shades of green (evolutionarily speaking) because it impacts your ability to feed yourself. There aren’t that many purple or blue predators, so there is less selection pressure to improve perception of that part of the visual spectrum.
Not everyone perceives color in the same way. Some individuals are colorblind or color deficient [3]. We have 3 cones used for color detection, as well as cells called rods, which detect light intensity (brightness/darkness). In about 5% of the population (10% of XY individuals, <1% of XX individuals), one or more of the cones may be missing or malformed, leading to color blindness - a reduced ability to perceive different shades. The rods, however, function normally in almost all of the population, which means that light/dark contrasts are extremely safe, while contrasts based on the hue of the color are problematic in some instances.
You can take a test designed to screen for colorblindness here
Your monitor may affect how you score on these tests - I am colorblind, but on some monitors, I can pass the test, and on some, I perform worse than normal. A different test is available here.
In reality, I know that I have issues with perceiving some shades of red, green, and brown. I have particular trouble with very dark or very light colors, especially when they are close to grey or brown.
In addition to colorblindness, there are other factors than the actual color value which are important in how we experience color, such as context.
Our brains are extremely dependent on context and make excellent use of the large amounts of experience we have with the real world. As a result, we implicitly “remove” the effect of things like shadows as we make sense of the input to the visual system. This can result in odd things, like the checkerboard and shadow shown in Figure 21.4 - because we’re correcting for the shadow, B looks lighter than A even though when the context is removed they are clearly the same shade.
RColorBrewer
and dichromat
that have color palettes which are aesthetically pleasing, and, in many cases, colorblind friendly (dichromat
is better for that than RColorBrewer
). You can also take a look at other ways to find nice color palettes.We have a limited amount of memory that we can instantaneously utilize. This mental space, called short-term memory, holds information for active use, but only for a limited amount of time.
1 4 2 2 3 9 8 0 7 8
What was the third number?
Without rehearsing the information (repeating it over and over to yourself), the try it out task may have been challenging. Short term memory has a capacity of between 3 and 9 “bits” of information.
In charts and graphs, short term memory is important because we need to be able to associate information from e.g. a key, legend, or caption with information plotted on the graph. As a result, if you try to plot more than ~6 categories of information, your reader will have to shift between the legend and the graph repeatedly, increasing the amount of cognitive labor required to digest the information in the chart.
Where possible, try to keep your legends to 6 or 7 characteristics.
Implications and Guidelines
Limit the number of categories in your legends to minimize the short term memory demands on your reader.
Use colors and symbols which have implicit meaning to minimize the need to refer to the legend.
Add annotations on the plot, where possible, to reduce the need to re-read captions.
Imposing order on visual chaos.
What does Figure 21.5 look like to you?
When faced with ambiguity, our brains use available context and past experience to try to tip the balance between alternate interpretations of an image. When there is still some ambiguity, many times the brain will just decide to interpret an image as one of the possible options.
Did you see something like “3 circles, a triangle with a black outline, and a white triangle on top of that”? In reality, there are 3 angles and 3 pac-man shapes. But, it’s much more likely that we’re seeing layers of information, where some of the information is obscured (like the “mouth” of the pac-man circles, or the middle segment of each side of the triangle). This explanation is simpler, and more consistent with our experience.
Now, look at the logo for the Pittsburgh Zoo.
Do you see the gorilla and lionness? Or do you see a tree? Here, we’re not entirely sure which part of the image is the figure and which is the background.
The ambiguous figures shown above demonstrate that our brains are actively imposing order upon the visual stimuli we encounter. There are some heuristics for how this order is applied which impact our perception of statistical graphs.
The catchphrase of Gestalt psychology is
The whole is greater than the sum of the parts
That is, what we perceive and the meaning we derive from the visual scene is more than the individual components of that visual scene.
You can read about the gestalt rules here, but they are also demonstrated in the figure above.
In graphics, we can leverage the Gestalt principles of grouping to create order and meaning. If we color points by another variable, we are creating groups of similar points which assist with the perception of groups instead of individual observations. If we add a trend line, we create the perception that the points are moving “with” the line (in most cases), or occasionally, that the line is dividing up two groups of points. Depending on what features of the data you wish to emphasize, you might choose different aesthetics mappings, facet variables, and factor orders.
Suppose I want to emphasize the change in life expectancy between 1982 and 2007. For this, we’ll use the Gapminder [4] data which is found in the gapminder
packages in R and python.
I could use a bar chart (showing only 4 countries for space):
# %pip install gapminder
from gapminder import gapminder
import pandas as pd
import seaborn.objects as so
= my_gap[my_gap.year.isin([1982,2007])]
my_gap_82_07 ## NameError: name 'my_gap' is not defined
= my_gap_82_07[my_gap_82_07.country.\
subdata "Korea, Rep.", "China", "Afghanistan", "India"])]
isin([## NameError: name 'my_gap_82_07' is not defined
= subdata.assign(yearFactor=pd.Categorical(subdata.year))
subdata ## NameError: name 'subdata' is not defined
= so.Plot(subdata, x = "country", y = "lifeExp", color = "yearFactor").\
plot \
add(so.Bar(), so.Dodge()).= "Life Expectancy")
label(y ## NameError: name 'subdata' is not defined
plot.show()## NameError: name 'plot' is not defined
Or, I could use a line chart
= my_gap_82_07[my_gap_82_07.country.\
subdata2 "Korea, Rep.", "China", "Afghanistan", "India"])]
isin([## NameError: name 'my_gap_82_07' is not defined
= so.Plot(subdata, x = "year", y = "lifeExp", color = "country").\
plot \
add(so.Lines()).= "Life Expectancy")
label(y ## NameError: name 'subdata' is not defined
plot.show()## NameError: name 'plot' is not defined
Or, I could use a box plot
import seaborn as sns
= my_gap_82_07.assign(yearFactor=pd.Categorical(my_gap_82_07.year))
subdata3 ## NameError: name 'my_gap_82_07' is not defined
= subdata3, x = "year", y = "lifeExp")
sns.boxplot(data ## NameError: name 'subdata3' is not defined
plt.show()
Which one best demonstrates that in every country, the life expectancy increased?
The line segment plot connects related observations (from the same country) but allows you to assess similarity between the lines (e.g. almost all countries have positive slope). The same information goes into the creation of the other two plots, but the bar chart is extremely cluttered, and the boxplot doesn’t allow you to connect single country observations over time. So while you can see an aggregate relationship (overall, the average life expectancy increased) you can’t see the individual relationships.
The aesthetic mappings and choices you make when creating plots have a huge impact on the conclusions that you (and others) can easily make when examining those plots.3
In order to read data off of a chart correctly, several things must happen in sequence:
If step 1 is not done correctly, the chart is misleading or inaccurate. However, steps 2 and 3 depend on our brains accurately perceiving and estimating information mentally. These steps can involve a lot of effort, and as mental effort increases, we tend to take shortcuts. Sometimes, these shortcuts work well, but not always.
When you design a chart, it’s good to consider what mental tasks viewers of your chart need to perform. Then, ask yourself whether there is an equivalent way to represent the data that requires fewer mental operations, or a different representation that requires easier mental calculations.
When making judgments corresponding to numerical quantities, there is an order of tasks from easiest (1) to hardest (6), with equivalent tasks at the same level.4
If we compare a pie chart and a stacked bar chart, the bar chart asks readers to make judgments of position on a non-aligned scale, while a pie chart asks readers to assess angle. This is one reason why pie charts tend not to be a good general option – people must compare values using area or angle instead of position or length, which is a more difficult judgment under most circumstances. When there are a limited number of categories (2-4) and you have data that is easily compared to quarters of a circle, it may be justifiable to use a pie chart over a stacked bar chart - some studies have shown that pie charts are preferable under these conditions. As a general rule, though, we have an easier time comparing position than angle or area.
When creating a chart, it is helpful to consider which variables you want to show, and how accurate reader perception needs to be to get useful information from the chart. In many cases, less is more - you can easily overload someone, which may keep them from engaging with your chart at all. Variables which require the reader to notice small changes should be shown on position scales (x, y) rather than using color, alpha blending, etc.
Consider the hierarchy of graphical tasks again. You may notice a general increase in dimensionality from 1-3 to 4 (2d) to 5 (3d). In general, showing information in 3 dimensions when 2 will suffice can be misleading. Just how misleading depends a lot on the type of chart you’re using. Most of the time, the addition of an extra dimension causes an increase in chart area allocated to the item that is disproportionate to the actual numerical value being represented.
Ted ED: How to spot a misleading graph - Lea Gaslowitz
Business Insider: The Worst Graphs Ever
Extra dimensions and other annotations are sometimes called “chartjunk” and should only be used if they contribute to the overall numerical accuracy of the chart (e.g. they should not just be for decoration).
When the COVID-19 outbreak started, many maps were using white-to-red gradients to show case counts and/or deaths. The emotional association between red and blood, danger, and death may have caused people to become more frightened than what was reasonable given the available information.↩︎
Lisa Charlotte Rost. What to consider when choosing colors for data visualization.↩︎
See this paper for more details. This is the last chapter of my dissertation, for what it’s worth. It was a lot of fun. (no sarcasm, seriously, it was fun!)↩︎
See this paper for the major source of this ranking; other follow-up studies have been integrated, but the essential order is largely unchanged4. Most of the items in this ranking were not examined in the linked paper, but are a synthesis of different experiments and conceptual knowledge in psychology as well as statistical graphics.↩︎