18 Data Visualization Basics

Published

August 16, 2025

This section is intended as a very light overview of how you might create charts in R and python. Chapter 20 will be much more in depth.

Objectives

Use ggplot2/seaborn to create a chart
Begin to identify issues with data formatting

18.1 Package Installation

You will need the seaborn (python) and ggplot2 (R) packages for this section.

install.packages("ggplot2")

To install seaborn, pick one of the following methods (you can read more about them and decide which is appropriate for you in Section 10.3.2.1)

pip3 install seaborn matplotlib

This package installation method requires that you have a virtual environment set up (that is, if you are on Windows, don’t try to install packages this way).

reticulate::py_install(c("seaborn", "matplotlib"))

In a python chunk (or the python terminal), you can run the following command. This depends on something called “IPython magic” commands, so if it doesn’t work for you, try the System Terminal method instead.

%pip install seaborn matplotlib

Once you have run this command, please comment it out so that you don’t reinstall the same packages every time.

18.2 First Steps

Now that you can read data in to R and python and define new variables, you can create plots! Data visualization is a skill that takes a lifetime to learn, but for now, let’s start out easy: let’s talk about how to make (basic) plots in R (with ggplot2) and in python (with seaborn, which has similar).

18.2.1 Graphing HBCU Enrollment

Let’s work with Historically Black College and University enrollment.

hbcu_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')

library(ggplot2)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

hbcu_all = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')

18.2.2 Making a Line Chart

ggplot2 and seaborn work with data frames.

If you pass a data frame in as the data argument, you can refer to columns in that data with “bare” column names (you don’t have to reference the full data object using df$name or df.name; you can instead use name or "name").

R
Python

ggplot(hbcu_all, aes(x = Year, y = `4-year`)) + geom_line() +
  ggtitle("4-year HBCU College Enrollment")

plot = sns.lineplot(hbcu_all, x = "Year", y = "4-year")
plot.set_title("4-year HBCU College Enrollment")
plt.show()

18.2.3 Data Formatting

If your data is in the right format, ggplot2 is very easy to use; if your data aren’t formatted neatly, it can be a real pain. If you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into “long form”. We’ll learn more about how to do this type of data transition when we talk about reshaping data.

You don’t need to know exactly how this works, but it is helpful to see the difference in the two datasets:

library(tidyr)
hbcu_long <- pivot_longer(hbcu_all, -Year, names_to = "type", values_to = "value")

hbcu_long = pd.melt(hbcu_all, id_vars = ['Year'], value_vars = hbcu_all.columns[1:11])

Year	Total enrollment	Males	Females	4-year	2-year	Total - Public	4-year - Public	2-year - Public	Total - Private	4-year - Private	2-year - Private
1976	222613	104669	117944	206676	15937	156836	143528	13308	65777	63148	2629
1980	233557	106387	127170	218009	15548	168217	155085	13132	65340	62924	2416
1982	228371	104897	123474	212017	16354	165871	151472	14399	62500	60545	1955
1984	227519	102823	124696	212844	14675	164116	151289	12827	63403	61555	1848
1986	223275	97523	125752	207231	16044	162048	147631	14417	61227	59600	1627
1988	239755	100561	139194	223250	16505	173672	158606	15066	66083	64644	1439

Year	type	value
1976	Total enrollment	222613
1976	Males	104669
1976	Females	117944
1976	4-year	206676
1976	2-year	15937
1976	Total - Public	156836

In the long form of the data, we have a row for each data point (year x measurement type), not for each year.

18.2.4 Making a (Better) Line Chart

If we had wanted to show all of the available data before, we would have needed to add a separate line for each column, coloring each one manually, and then we would have wanted to create a legend manually (which is a pain). Converting the data to long form means we can use ggplot2/seaborn to do all of this for us with only a single plot statement (geom_line or sns.lineplot). Having the data in the right form to plot is very important if you want to get the plot you’re imagining with relatively little effort.

R
Python

ggplot(hbcu_long, aes(x = Year, y = value, color = type)) + geom_line() +
  ggtitle("HBCU College Enrollment")

plot = sns.lineplot(hbcu_long, x = "Year", y = "value", hue = "variable")
plot.set_title("4-year HBCU College Enrollment")
plt.show()