18  Data Visualization Basics

Published

December 17, 2024

This section is intended as a very light overview of how you might create charts in R and python. Chapter 20 will be much more in depth.

18.1 Objectives

  • Use ggplot2/plotnine to create a chart
  • Begin to identify issues with data formatting

18.2 Package Installation

You will need the plotnine (python) and ggplot2 (R) packages for this section.

install.packages("ggplot2")

To install plotnine, pick one of the following methods (you can read more about them and decide which is appropriate for you in Section 9.8.3.1)

pip3 install plotnine matplotlib

This package installation method requires that you have a virtual environment set up (that is, if you are on Windows, don’t try to install packages this way).

reticulate::py_install(c("plotnine", "matplotlib"))

In a python chunk (or the python terminal), you can run the following command. This depends on something called “IPython magic” commands, so if it doesn’t work for you, try the System Terminal method instead.

%pip install plotnine matplotlib

Once you have run this command, please comment it out so that you don’t reinstall the same packages every time.

18.3 First Steps

Now that you can read data in to R and python and define new variables, you can create plots! Data visualization is a skill that takes a lifetime to learn, but for now, let’s start out easy: let’s talk about how to make (basic) plots in R (with ggplot2) and in python (with plotnine, which is a ggplot2 clone).

18.3.1 Graphing HBCU Enrollment

Let’s work with Historically Black College and University enrollment.

18.3.1.1 Loading Libraries

hbcu_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')

library(ggplot2)
import pandas as pd
from plotnine import *

hbcu_all = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')

18.3.2 Making a Line Chart

ggplot2 and plotnine work with data frames.

If you pass a data frame in as the data argument, you can refer to columns in that data with “bare” column names (you don’t have to reference the full data object using df$name or df.name; you can instead use name or "name").


ggplot(hbcu_all, aes(x = Year, y = `4-year`)) + geom_line() +
  ggtitle("4-year HBCU College Enrollment")


ggplot(hbcu_all, aes(x = "Year", y = "4-year")) + geom_line() + \
  ggtitle("4-year HBCU College Enrollment")
## <Figure Size: (640 x 480)>

18.3.3 Data Formatting

If your data is in the right format, ggplot2 is very easy to use; if your data aren’t formatted neatly, it can be a real pain. If you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into “long form”. We’ll learn more about how to do this type of data transition when we talk about reshaping data.

You don’t need to know exactly how this works, but it is helpful to see the difference in the two datasets:

library(tidyr)
hbcu_long <- pivot_longer(hbcu_all, -Year, names_to = "type", values_to = "value")
hbcu_long = pd.melt(hbcu_all, id_vars = ['Year'], value_vars = hbcu_all.columns[1:11])
Year Total enrollment Males Females 4-year 2-year Total - Public 4-year - Public 2-year - Public Total - Private 4-year - Private 2-year - Private
1976 222613 104669 117944 206676 15937 156836 143528 13308 65777 63148 2629
1980 233557 106387 127170 218009 15548 168217 155085 13132 65340 62924 2416
1982 228371 104897 123474 212017 16354 165871 151472 14399 62500 60545 1955
1984 227519 102823 124696 212844 14675 164116 151289 12827 63403 61555 1848
1986 223275 97523 125752 207231 16044 162048 147631 14417 61227 59600 1627
1988 239755 100561 139194 223250 16505 173672 158606 15066 66083 64644 1439
Year type value
1976 Total enrollment 222613
1976 Males 104669
1976 Females 117944
1976 4-year 206676
1976 2-year 15937
1976 Total - Public 156836

In the long form of the data, we have a row for each data point (year x measurement type), not for each year.

18.3.4 Making a (Better) Line Chart

If we had wanted to show all of the available data before, we would have needed to add a separate line for each column, coloring each one manually, and then we would have wanted to create a legend manually (which is a pain). Converting the data to long form means we can use ggplot2/plotnine to do all of this for us with only a single geom_line statement. Having the data in the right form to plot is very important if you want to get the plot you’re imagining with relatively little effort.


ggplot(hbcu_long, aes(x = Year, y = value, color = type)) + geom_line() +
  ggtitle("HBCU College Enrollment")


ggplot(hbcu_long, aes(x = "Year", y = "value", color = "variable")) + geom_line() + \
  ggtitle("HBCU College Enrollment") + \
  theme(subplots_adjust={'right':0.75}) # This moves the key so it takes up 25% of the area
## <Figure Size: (640 x 480)>

18.4 References