install.packages("ggplot2")
18 Data Visualization Basics
This section is intended as a very light overview of how you might create charts in R and python. Chapter 20 will be much more in depth.
18.1 Objectives
- Use ggplot2/plotnine to create a chart
- Begin to identify issues with data formatting
18.2 Package Installation
You will need the plotnine
(python) and ggplot2
(R) packages for this section.
To install plotnine, pick one of the following methods (you can read more about them and decide which is appropriate for you in Section 9.8.3.1)
pip3 install plotnine matplotlib
This package installation method requires that you have a virtual environment set up (that is, if you are on Windows, don’t try to install packages this way).
reticulate::py_install(c("plotnine", "matplotlib"))
In a python chunk (or the python terminal), you can run the following command. This depends on something called “IPython magic” commands, so if it doesn’t work for you, try the System Terminal method instead.
%pip install plotnine matplotlib
Once you have run this command, please comment it out so that you don’t reinstall the same packages every time.
18.3 First Steps
Now that you can read data in to R and python and define new variables, you can create plots! Data visualization is a skill that takes a lifetime to learn, but for now, let’s start out easy: let’s talk about how to make (basic) plots in R (with ggplot2
) and in python (with plotnine
, which is a ggplot2 clone).
18.3.1 Graphing HBCU Enrollment
Let’s work with Historically Black College and University enrollment.
18.3.1.1 Loading Libraries
import pandas as pd
from plotnine import *
= pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv') hbcu_all
18.3.2 Making a Line Chart
ggplot2 and plotnine work with data frames.
If you pass a data frame in as the data argument, you can refer to columns in that data with “bare” column names (you don’t have to reference the full data object using df$name
or df.name
; you can instead use name
or "name"
).
18.3.3 Data Formatting
If your data is in the right format, ggplot2 is very easy to use; if your data aren’t formatted neatly, it can be a real pain. If you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into “long form”. We’ll learn more about how to do this type of data transition when we talk about reshaping data.
You don’t need to know exactly how this works, but it is helpful to see the difference in the two datasets:
library(tidyr)
hbcu_long <- pivot_longer(hbcu_all, -Year, names_to = "type", values_to = "value")
= pd.melt(hbcu_all, id_vars = ['Year'], value_vars = hbcu_all.columns[1:11]) hbcu_long
Year | Total enrollment | Males | Females | 4-year | 2-year | Total - Public | 4-year - Public | 2-year - Public | Total - Private | 4-year - Private | 2-year - Private |
---|---|---|---|---|---|---|---|---|---|---|---|
1976 | 222613 | 104669 | 117944 | 206676 | 15937 | 156836 | 143528 | 13308 | 65777 | 63148 | 2629 |
1980 | 233557 | 106387 | 127170 | 218009 | 15548 | 168217 | 155085 | 13132 | 65340 | 62924 | 2416 |
1982 | 228371 | 104897 | 123474 | 212017 | 16354 | 165871 | 151472 | 14399 | 62500 | 60545 | 1955 |
1984 | 227519 | 102823 | 124696 | 212844 | 14675 | 164116 | 151289 | 12827 | 63403 | 61555 | 1848 |
1986 | 223275 | 97523 | 125752 | 207231 | 16044 | 162048 | 147631 | 14417 | 61227 | 59600 | 1627 |
1988 | 239755 | 100561 | 139194 | 223250 | 16505 | 173672 | 158606 | 15066 | 66083 | 64644 | 1439 |
Year | type | value |
---|---|---|
1976 | Total enrollment | 222613 |
1976 | Males | 104669 |
1976 | Females | 117944 |
1976 | 4-year | 206676 |
1976 | 2-year | 15937 |
1976 | Total - Public | 156836 |
In the long form of the data, we have a row for each data point (year x measurement type), not for each year.
18.3.4 Making a (Better) Line Chart
If we had wanted to show all of the available data before, we would have needed to add a separate line for each column, coloring each one manually, and then we would have wanted to create a legend manually (which is a pain). Converting the data to long form means we can use ggplot2/plotnine to do all of this for us with only a single geom_line
statement. Having the data in the right form to plot is very important if you want to get the plot you’re imagining with relatively little effort.
= "Year", y = "value", color = "variable")) + geom_line() + \
ggplot(hbcu_long, aes(x "HBCU College Enrollment") + \
ggtitle(={'right':0.75}) # This moves the key so it takes up 25% of the area
theme(subplots_adjust## <Figure Size: (640 x 480)>