if (!"plotly" %in% installed.packages())
install.packages("plotly")
library(plotly)
library(readr) # reading in data
library(dplyr) # cleaning data
library(tidyr) # merging data
library(lubridate) # dates and times
library(stringr) # string manipulation
library(ggplot2) # plotting
# all of the data is located in the same folder of a github repo
# so let's not type it out 5x
url_stub <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/"
volcano <- read_csv(paste0(url_stub, "volcano.csv"))
eruptions <- read_csv(paste0(url_stub, "eruptions.csv"))
events <- read_csv(paste0(url_stub, "events.csv"))
sulfur <- read_csv(paste0(url_stub, "sulfur.csv"))
trees <- read_csv(paste0(url_stub, "tree_rings.csv"))
30 Interactive Graphics
31 Animated and Interactive Graphics
Interactive and animated graphics are one of the major advantages of using the Rmarkdown ecosystem - because you can easily create web pages in markdown (without the pain of HTML), you aren’t limited by paper any more. We’ll cover two different technologies that allow you to create different types of interactive charts, graphs, and interfaces.
It is helpful to think about interactivity in a couple of different ways:
What does it require? Do you need to be doing statistical calculations in the background, or can you precompute all of the data ahead of time?
-
What type of activity or interactivity do you need?
- Zoom in/out?
- Provide additional information in response to user actions (mouseover, click)
- Provide information over time (animation)
- Keep track of a data point over multiple plots? (linked plots)
- Keep track of one or more data points and change their appearance based on user interaction (brushing)
- Allow the user to change the underlying statistical model or data?
(This is not a full list of all of the types of interactivity, just a few of the more common options)
In this section, we’ll cover two ways to easily create interactive graphics or applets in R and python. There are, of course, many others – many javascript libraries have extensions to R or python that may facilitate creating interactive graphics.
31.1 Objectives
- Create animated and interactive charts using appropriate tools
31.2 Plotly
Plotly PlotlyOpenSource2022? is a graphing library that uses javascript to add interactivity to graphics. There are several different ways to create plotly graphs in R or python. Here, we’ll discuss 3 approaches: - Working with plotly in R directly - Working with plotly in python directly - Using ggplotly
, which converts a ggplot to a plotly plot automatically
Resources:
We’ll demonstrate plotly’s capabilities using the volcanoes
data from Tidy Tuesday.
# Uncomment and run this line if you don't have plotly installed
# %pip install plotly
import plotly.express as px
import plotly.io as pio # this allows plotly to play nice with markdown
import pandas as pd
# all of the data is located in the same folder of a github repo
# so let's not type it out 5x
= "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/"
url_stub = pd.read_csv(url_stub + "volcano.csv")
volcano = pd.read_csv(url_stub + "eruptions.csv")
eruptions = pd.read_csv(url_stub + "events.csv")
events = pd.read_csv(url_stub + "sulfur.csv")
sulfur = pd.read_csv(url_stub + "tree_rings.csv") trees
Let’s try out plotly while doing a bit of exploratory data analysis on this dataset.
volcano
= volcano.query("tectonic_settings != 'Unknown'")
volcano2 'zone', 'crust']] = volcano2.tectonic_settings.\
volcano2[[str.split(" / ", expand = True)
# Remove anything after ( as well as ? if it exists
= volcano2.assign(volcano_type =
volcano2 'primary_volcano_type'].\
volcano2[str.replace(r"(\(.*)?\??$", "", regex = True))
Let’s start by seeing whether the elevation of a volcano changes based on the type of zone it’s on - we might expect that Rift zone volcanos (where plates are pulling away from each other) might not be as high.
p <- volcano %>%
ggplot(aes(x = zone, fill = zone, y = elevation)) +
geom_boxplot() +
coord_flip()
ggplotly(p)
The plot_ly
function is pipe friendly.
Variable mappings are preceded with ~
to indicate that the visual appearance changes with the value of the variable.
= px.box(volcano2, x = "elevation", color = "zone")
fig
file = 'plotly-python/boxplot-volcano.html'
file = file, auto_open = False)
pl.io.write_html(fig, ## NameError: name 'pl' is not defined
## Error: object 'py' not found
It doesn’t really look like there’s much difference.
Since I’m trying to do this without the tidyverse, I’ll try out the new base R pipe, |>
, and the corresponding new anonymous function notation, \()
.1
# First, compute the density
elevation_dens <- split(volcano, ~volcano_type) |>
lapply(FUN = \(df) {
tmp <- density(df$elevation)[c("x", "y", "bw")] |>
as.data.frame()
}) |>
do.call(what = "rbind") |>
as.data.frame()
elevation_dens <- cbind(volcano_type = row.names(elevation_dens), elevation_dens) |>
transform(volcano_type = gsub("\\.\\d{1,}$", "", volcano_type))
plot_ly(data = elevation_dens, x = ~x, y = ~y, type = "scatter", mode = "line", color = ~volcano_type)
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly as pl
# This creates a list of vectors, one for each type of volcano
= volcano.groupby("volcano_type")["elevation"].count()
volcano ## KeyError: 'volcano_type'
= volcano.groupby("volcano_type").elevation.apply(list)
type_list ## KeyError: 'volcano_type'
= volcano.volcano_type.unique()
type_labels ## AttributeError: 'DataFrame' object has no attribute 'volcano_type'
= ff.create_distplot(type_list, group_labels = type_labels, show_hist=False)
fig ## NameError: name 'type_list' is not defined
file = 'plotly-python/distplot-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
Here, the interactivity actually helps a bit: we don’t need to use the legend to see what each curve corresponds to. We can see that submarine volcanoes are typically much lower in elevation (ok, duh), but also that subglacial volcanoes are found in a very limited range. If we double-click on a legend entry, we can get rid of all other curves and examine each curve one by one.
I added the rug layer after the initial bout because I was curious how much data each of these curves were based on. If we want only curves with n > 10 observations, we can do that:
p <- volcano %>%
group_by(volcano_type) %>% mutate(n = n()) %>%
filter(n > 15) %>%
ggplot(aes(x = elevation, color = volcano_type)) +
geom_density() +
# Rug plots show each observation as a tick just below the x axis
geom_rug(aes(text = paste0(volcano_name, ", ", country)))
ggplotly(p)
If we want to specify additional information that should show up in the tooltip, we can do that as well by adding the text
aesthetic even though geom_rug doesn’t take a text aesthetic. You may notice that ggplot2 complains about the unknown aesthetic I’ve added to geom_rug: That allows us to mouse over each data point in the rug plot and see what volcano it belongs to. So we can tell from the rug plot that the tallest volcano is Ojas de Salvado, in Chile/Argentina (I believe that translates to Eyes of Salvation?).
# First, compute the density
elevation_dens <- split(volcano, ~volcano_type) |>
lapply(FUN = \(df) {
tmp <- density(df$elevation)[c("x", "y", "bw")] |>
as.data.frame()
tmp$n = nrow(df)
tmp
}) |>
do.call(what = "rbind") |>
as.data.frame()
elevation_dens <- cbind(volcano_type = row.names(elevation_dens), elevation_dens) |>
transform(volcano_type = gsub("\\.\\d{1,}$", "", volcano_type)) |>
subset(n > 15)
plot_ly(data = elevation_dens, x = ~x, y = ~y,
type = "scatter", mode = "line", color = ~volcano_type)
import plotly.figure_factory as ff
= volcano.assign(count = volcano.groupby("volcano_type").\
volcano "count"))
volcano_type.transform(## KeyError: 'volcano_type'
= volcano.query("count > 15").sort_values(["volcano_type"])
common_volcano ## pandas.errors.UndefinedVariableError: name 'count' is not defined
"label"] = common_volcano.volcano_name + ", " + common_volcano.country
common_volcano[## NameError: name 'common_volcano' is not defined
# This creates a list of vectors, one for each type of volcano
= common_volcano.groupby("volcano_type").elevation.apply(list)
type_list ## NameError: name 'common_volcano' is not defined
# rug_text = common_volcano.groupby("volcano_type").label.apply(list)
= common_volcano.volcano_type.unique()
type_labels ## NameError: name 'common_volcano' is not defined
= ff.create_distplot(type_list, group_labels = type_labels, rug_text = rug_text, show_hist=False)
fig ## NameError: name 'type_list' is not defined
fig.show()
At any rate, there isn’t nearly as much variation as I was expecting in the elevation of different types of volcanoes.
ggplotly makes it very easy to generate plots that have a ggplot2 equivalent; you can customize these plots further using plotly functions that we’ll see in the next section. But first, try the interface out on your own.
Conduct an exploratory data analysis of the eruptions dataset. What do you find?
head(eruptions)
## # A tibble: 6 × 15
## volcano_number volcano_name eruption_number eruption_category area_of_activity
## <dbl> <chr> <dbl> <chr> <chr>
## 1 266030 Soputan 22354 Confirmed Erupti… <NA>
## 2 343100 San Miguel 22355 Confirmed Erupti… <NA>
## 3 233020 Fournaise, … 22343 Confirmed Erupti… <NA>
## 4 345020 Rincon de l… 22346 Confirmed Erupti… <NA>
## 5 353010 Fernandina 22347 Confirmed Erupti… <NA>
## 6 273070 Taal 22344 Confirmed Erupti… <NA>
## # ℹ 10 more variables: vei <dbl>, start_year <dbl>, start_month <dbl>,
## # start_day <dbl>, evidence_method_dating <chr>, end_year <dbl>,
## # end_month <dbl>, end_day <dbl>, latitude <dbl>, longitude <dbl>
summary(eruptions %>% mutate(eruption_category = factor(eruption_category)))
## volcano_number volcano_name eruption_number
## Min. :210010 Length:11178 Min. :10001
## 1st Qu.:263310 Class :character 1st Qu.:12817
## Median :290050 Mode :character Median :15650
## Mean :300284 Mean :15667
## 3rd Qu.:343030 3rd Qu.:18464
## Max. :600000 Max. :22355
##
## eruption_category area_of_activity vei
## Confirmed Eruption :9900 Length:11178 Min. :0.000
## Discredited Eruption: 166 Class :character 1st Qu.:1.000
## Uncertain Eruption :1112 Mode :character Median :2.000
## Mean :1.948
## 3rd Qu.:2.000
## Max. :7.000
## NA's :2906
## start_year start_month start_day evidence_method_dating
## Min. :-11345.0 Min. : 0.000 Min. : 0.000 Length:11178
## 1st Qu.: 680.0 1st Qu.: 0.000 1st Qu.: 0.000 Class :character
## Median : 1847.0 Median : 1.000 Median : 0.000 Mode :character
## Mean : 622.8 Mean : 3.451 Mean : 7.015
## 3rd Qu.: 1950.0 3rd Qu.: 7.000 3rd Qu.:15.000
## Max. : 2020.0 Max. :12.000 Max. :31.000
## NA's :1 NA's :193 NA's :196
## end_year end_month end_day latitude
## Min. :-475 Min. : 0.000 Min. : 0.00 Min. :-77.530
## 1st Qu.:1895 1st Qu.: 3.000 1st Qu.: 4.00 1st Qu.: -6.102
## Median :1957 Median : 6.000 Median :15.00 Median : 17.600
## Mean :1917 Mean : 6.221 Mean :13.32 Mean : 16.866
## 3rd Qu.:1992 3rd Qu.: 9.000 3rd Qu.:21.00 3rd Qu.: 40.821
## Max. :2020 Max. :12.000 Max. :31.00 Max. : 85.608
## NA's :6846 NA's :6849 NA's :6852
## longitude
## Min. :-179.97
## 1st Qu.: -77.66
## Median : 55.71
## Mean : 31.57
## 3rd Qu.: 139.39
## Max. : 179.58
##
# Historical (very historical) dates are a bit of a pain to work with, so I
# wrote a helper function which takes year, month, and day arguments and formats
# them properly
fix_date <- function(yyyy, mm, dd) {
# First, negative years (BCE) are a bit of a problem.
neg <- yyyy < 0
subtract_years <- pmax(-yyyy, 0) # Years to subtract off later
# for now, set to 0
year_fixed <- pmax(yyyy, 0) # this will set anything negative to 0
# sometimes the day or month isn't known, so just use 1 for both.
# recorded value may be NA or 0.
day_fixed <- ifelse(is.na(dd), 1, pmax(dd, 1))
month_fixed <- ifelse(is.na(mm), 1, pmax(mm, 1))
# Need to format things precisely, so use sprintf
# %0xd ensures that you have at least x digits, padding the left side with 0s
# lubridate doesn't love having 3-digit years.
date_str <- sprintf("%04d/%02d/%02d", year_fixed, month_fixed, day_fixed)
# Then we can convert the dates and subtract off the years for pre-CE dates
date <- ymd(date_str) - years(subtract_years)
}
erupt <- eruptions %>%
# Don't work with discredited eruptions
filter(eruption_category == "Confirmed Eruption") %>%
# Create start and end dates
mutate(
start_date = fix_date(start_year, start_month, start_day),
end_date = fix_date(end_year, end_month, end_day),
# To get duration, we have to start with a time interval,
# convert to duration, then convert to a numeric value
duration = interval(start = start_date, end = end_date) %>%
as.duration() %>%
as.numeric("days"))
Let’s start out seeing what month most eruptions occur in…
# Note, I'm using the original month, so 0 = unknown
p <- ggplot(erupt, aes(x = factor(start_month))) + geom_bar()
ggplotly(p)
# I could rename some of the factors to make this pretty, but... nah
Another numerical variable is VEI, volcano explosivity index. A VEI of 0 is non-explosive, a VEI of 4 is about what Mt. St. Helens hit in 1980, and a VEI of 5 is equivalent to the Krakatau explosion in 1883. A VEI of 8 would correspond to a major Yellowstone caldera eruption (which hasn’t happened for 600,000 years). Basically, VEI increase of 1 is an order of magnitude change in the amount of material the eruption released.
We can also look at the frequency of eruptions over time. We’ll expect some historical bias - we don’t have exact dates for some of these eruptions, and if no one was around to write the eruption down (or the records were destroyed) there’s not going to be a date listed here.
p <- erupt %>%
filter(!is.na(end_date)) %>%
filter(start_year > 0) %>%
ggplot(aes(x = start_date, xend = start_date,
y = 0, yend = duration,
color = evidence_method_dating)) +
geom_segment() +
geom_point(size = .5, aes(text = volcano_name)) +
xlab("Eruption Start") +
ylab("Eruption Duration (days)") +
facet_wrap(~vei, scales = "free_y")
ggplotly(p)
As expected, it’s pretty rare to see many eruptions before ~1800 AD, which is about when we have reliable historical records2 for most of the world (exceptions include e.g. Vestuvius, which we have extensive written information about).
p <- erupt %>%
filter(!is.na(end_date)) %>%
# Account for recency bias (sort of)
filter(start_year > 1800) %>%
ggplot(aes(x = factor(vei), y = duration)) +
geom_violin() +
xlab("VEI") +
ylab("Eruption Duration (days)") +
scale_y_sqrt()
ggplotly(p)
It seems that the really big eruptions might be less likely to last for a long time, but it is hard to tell because there aren’t that many of them (thankfully).
head(eruptions)
## # A tibble: 6 × 15
## volcano_number volcano_name eruption_number eruption_category area_of_activity
## <dbl> <chr> <dbl> <chr> <chr>
## 1 266030 Soputan 22354 Confirmed Erupti… <NA>
## 2 343100 San Miguel 22355 Confirmed Erupti… <NA>
## 3 233020 Fournaise, … 22343 Confirmed Erupti… <NA>
## 4 345020 Rincon de l… 22346 Confirmed Erupti… <NA>
## 5 353010 Fernandina 22347 Confirmed Erupti… <NA>
## 6 273070 Taal 22344 Confirmed Erupti… <NA>
## # ℹ 10 more variables: vei <dbl>, start_year <dbl>, start_month <dbl>,
## # start_day <dbl>, evidence_method_dating <chr>, end_year <dbl>,
## # end_month <dbl>, end_day <dbl>, latitude <dbl>, longitude <dbl>
# Historical (very historical) dates are a bit of a pain to work with, so I
# wrote a helper function which takes year, month, and day arguments and formats
# them properly
fix_date <- function(yyyy, mm, dd) {
# First, negative years (BCE) are a bit of a problem.
neg <- yyyy < 0
subtract_years <- pmax(-yyyy, 0) # Years to subtract off later
# for now, set to 0
year_fixed <- pmax(yyyy, 0) # this will set anything negative to 0
# sometimes the day or month isn't known, so just use 1 for both.
# recorded value may be NA or 0.
day_fixed <- ifelse(is.na(dd), 1, pmax(dd, 1))
month_fixed <- ifelse(is.na(mm), 1, pmax(mm, 1))
# Need to format things precisely, so use sprintf
# %0xd ensures that you have at least x digits, padding the left side with 0s
# lubridate doesn't love having 3-digit years.
date_str <- sprintf("%04d/%02d/%02d", year_fixed, month_fixed, day_fixed)
# Then we can convert the dates and subtract off the years for pre-CE dates
date <- ymd(date_str) - years(subtract_years)
}
erupt <- eruptions %>%
# Don't work with discredited eruptions
filter(eruption_category == "Confirmed Eruption") %>%
# Create start and end dates
mutate(
start_date = fix_date(start_year, start_month, start_day),
end_date = fix_date(end_year, end_month, end_day),
# To get duration, we have to start with a time interval,
# convert to duration, then convert to a numeric value
duration = interval(start = start_date, end = end_date) %>%
as.duration() %>%
as.numeric("days"))
Let’s start out seeing what month most eruptions occur in…
# Note, I'm using the original month, so 0 = unknown
erupt %>%
count(start_month) %>%
plot_ly(
data = .,
x = ~start_month,
y = ~n,
type = "bar"
)
Another numerical variable is VEI, volcano explosivity index. A VEI of 0 is non-explosive, a VEI of 4 is about what Mt. St. Helens hit in 1980, and a VEI of 5 is equivalent to the Krakatau explosion in 1883. A VEI of 8 would correspond to a major Yellowstone caldera eruption (which hasn’t happened for 600,000 years). Basically, VEI increase of 1 is an order of magnitude change in the amount of material the eruption released.
erupt %>%
filter(!is.na(end_date)) %>%
# Account for recency bias (sort of)
filter(start_year > 1800) %>%
plot_ly(x = ~ factor(vei),
y = ~ duration,
split = ~factor(vei),
type = "violin") %>%
layout(yaxis = list(type="log"))
It seems that the really big eruptions might be less likely to last for a long time, but it is hard to tell because there aren’t that many of them (thankfully).
In Python, negative dates are even more of a pain to work with if you’re using standard libraries, so we’ll install the astropy
class with pip install astropy
. BCE dates are still a pain in the … but they at least work.
eruptions.head()## volcano_number volcano_name ... latitude longitude
## 0 266030 Soputan ... 1.112 124.737
## 1 343100 San Miguel ... 13.434 -88.269
## 2 233020 Fournaise, Piton de la ... -21.244 55.708
## 3 345020 Rincon de la Vieja ... 10.830 -85.324
## 4 353010 Fernandina ... -0.370 -91.550
##
## [5 rows x 15 columns]
# Historical (very historical) dates are a bit of a pain to work with, so I
# wrote a helper function which takes year, month, and day arguments and formats
# them properly
from astropy.time import Time,TimeDelta
import numpy as np
import math
def fix_date(yyyy, mm, dd):
# The zero, one columns allow using pd.max(axis = 1) where we'd use pmax in R
= yyyy <= 0
neg = -yyyy
nyear
= max([yyyy, 1])
year = max([nyear, 0]) + neg
subtract_year
= dd
day = mm
month
if math.isnan(day):
= 1
day if math.isnan(month):
= 1
month
if day == 0:
= 1
day if month == 0:
= 1
month
= "%04d-%02d-%02d" % (year, month, day)
dateformat
= Time(dateformat, format = "iso", scale = 'ut1')
date = date - TimeDelta(subtract_year*365, format= 'jd')
datefix return datefix
= eruptions.query("eruption_category == 'Confirmed Eruption'")
erupt 0, inplace = True)
erupt.fillna('start_date'] = erupt.apply(lambda x: fix_date(x.start_year, x.start_month, x.start_day), axis = 1)
erupt['end_date'] = erupt.apply(lambda x: fix_date(x.end_year, x.end_month, x.end_day), axis = 1)
erupt['duration'] = erupt.end_date - erupt.start_date
erupt[# Convert back to numeric
'duration'] = erupt.duration.apply(lambda x: x.to_value("jd", "decimal")) # Julian day erupt[
Let’s start out seeing what month most eruptions occur in…
import plotly.express as px
= erupt.groupby("start_month").count()
tmp = tmp.reset_index()
tmp # Note, I'm using the original month, so 0 = unknown
= px.bar(tmp, x = 'start_month', y = 'volcano_number')
fig
file = 'plotly-python/eruptplot-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
Another numerical variable is VEI, volcano explosivity index. A VEI of 0 is non-explosive, a VEI of 4 is about what Mt. St. Helens hit in 1980, and a VEI of 5 is equivalent to the Krakatau explosion in 1883. A VEI of 8 would correspond to a major Yellowstone caldera eruption (which hasn’t happened for 600,000 years). Basically, VEI increase of 1 is an order of magnitude change in the amount of material the eruption released.
# VEI is volcano explosivity index
= px.bar(
fig "vei").count().reset_index(),
erupt.groupby(= "vei", y = "volcano_number")
x
file = 'plotly-python/vei-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
"duration_yr"] = erupt.duration/365.25
erupt[## TypeError: unsupported operand type(s) for /: 'decimal.Decimal' and 'float'
= px.box(
fig
erupt,= "vei",
x = "duration_yr",
y = "all"
points
)## ValueError: Value of 'y' is not the name of a column in 'data_frame'. Expected one of ['volcano_number', 'volcano_name', 'eruption_number', 'eruption_category', 'area_of_activity', 'vei', 'start_year', 'start_month', 'start_day', 'evidence_method_dating', 'end_year', 'end_month', 'end_day', 'latitude', 'longitude', 'start_date', 'end_date', 'duration'] but received: duration_yr
file = 'plotly-python/vei-duration-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
It seems that the really big eruptions might be less likely to last for a long time, but it is hard to tell because there aren’t that many of them (thankfully)
Plotly integration with ggplot2 is nice, but obviously not a universal summary of what it can do. Let’s look at another example of plotly in R/python without ggplot2 integration.
We start with a scatterplot of volcanoes along the earth’s surface:
plot_ly(type = "scattergeo", lon = volcano$longitude, lat = volcano$latitude)
= px.scatter_geo(volcano, lon = "longitude", lat = "latitude")
fig
file = 'plotly-python/scatter-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
And then we can start customizing.
plot_ly(type = "scattergeo", lon = volcano$longitude,
lat = volcano$latitude,
mode = "markers",
# Add information to mouseover
text = ~paste(volcano$volcano_name, "\n",
"Last Erupted: ", volcano$last_eruption_year),
# Change the markers because why not?
marker = list(color = "#d00000", opacity = 0.25)
)
= px.scatter_geo(volcano,
fig = "longitude", lat = "latitude",
lon = "volcano_name",
hover_name = ["last_eruption_year"])
hover_data =dict(size=12, opacity = 0.25, color = 'red'),
fig.update_traces(marker=dict(mode='markers')) selector
file = 'plotly-python/scatter-hover-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
Plotly will handle some variable mappings for you, depending on which “trace” (type of plot) you’re using.
The plot_ly
function is also pipe friendly. Variable mappings are preceded with ~
to indicate that the visual appearance changes with the value of the variable.
# Load RColorBrewer for palettes
library(RColorBrewer)
volcano %>%
group_by(volcano_type) %>%
mutate(n = n()) %>%
filter(n > 15) %>%
plot_ly(type = "scattergeo", lon = ~longitude, lat = ~latitude,
mode = "markers",
# Add information to mouseover
text = ~paste(volcano_name, "\n",
"Last Erupted: ", last_eruption_year),
color = ~ volcano_type,
# Specify a palette
colors = brewer.pal(length(unique(.$volcano_type)), "Paired"),
# Change the markers because why not?
marker = list(opacity = 0.5)
)
= volcano.groupby("volcano_type").agg({'volcano_number': ['size']})
volc_sub ## KeyError: 'volcano_type'
= ["n"]
volc_sub.columns ## NameError: name 'volc_sub' is not defined
= volc_sub.reset_index()
volc_sub ## NameError: name 'volc_sub' is not defined
= volc_sub.query("n >= 15")
volc_sub ## NameError: name 'volc_sub' is not defined
= pd.merge(volc_sub['volcano_type'], volcano, on = 'volcano_type', how = 'inner')
volc_sub ## NameError: name 'volc_sub' is not defined
= px.scatter_geo(volc_sub,
fig = "longitude", lat = "latitude",
lon = "volcano_type",
color = "volcano_name",
hover_name = ["last_eruption_year"])
hover_data ## NameError: name 'volc_sub' is not defined
=dict(size=12, opacity = 0.25),
fig.update_traces(marker=dict(mode='markers')) selector
file = 'plotly-python/scatter-hover-update-volcano.html'
file = file, auto_open = False) pl.io.write_html(fig,
31.3 Leaflet maps
Leaflet is another javascript library that allows for interactive data visualization. We’re only going to briefly talk about it here, but there is extensive documentation that includes details of how to work with different types of geographical data, chloropleth maps, plugins, and more.
To explore the leaflet package, we’ll start out playing with a dataset of Bigfoot sightings assembled from the Bigfoot Field Researchers Organization’s Google earth tool
if (!"leaflet" %in% installed.packages()) install.packages("leaflet")
library(leaflet)
library(readr)
bigfoot_data <- read_csv("https://query.data.world/s/egnaxxvegdkzzrhfhdh4izb6etmlms")
We can start out by plotting a map with the location of each sighting. I’ve colored the points in a seasonal color scheme, and added the description of each incident as a mouseover label.
bigfoot_data %>%
filter(classification == "Class A") %>%
mutate(seasoncolor = str_replace_all(season, c("Fall" = "orange",
"Winter" = "skyblue",
"Spring" = "green",
"Summer" = "yellow")),
# This code just wraps the description to the width of the R terminal
# and inserts HTML for a line break into the text at appropriate points
desc_wrap = purrr::map(observed, ~strwrap(.) %>%
paste(collapse = "<br/>") %>%
htmltools::HTML())) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(~longitude, ~latitude, color = ~seasoncolor, label = ~desc_wrap)
Of course, because this is an interactive map library, we aren’t limited to any one scale. We can also plot data at the city level:
# library(nycsquirrels18)
# data(squirrels)
squirrels <- readr::read_csv("data/nycsquirrels.csv")
## Error: 'data/nycsquirrels.csv' does not exist in current working directory ('/home/susan/Projects/Class/stat-computing-r-python/part-advanced-topics').
head(squirrels)
## Error: object 'squirrels' not found
squirrels %>%
mutate(color = tolower(primary_fur_color)) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(~long, ~lat, color = ~color)
## Error: object 'squirrels' not found
We can also plot regions, instead of just points. I downloaded a dataset released by the state of California, Crotch’s Bumble Bee Range - CDFW dataset, which shows the range of the Crotch’s Bumble Bee (Bombus crotchii).
I’ve set this chunk to not evaluate because it causes the book to be painfully large.
library(sf)
bees <- st_read("../data/Crotch_s_Bumble_Bee_Range_-_CDFW_[ds3095].geojson")
bees <- sf::st_transform(bees, 4326)
bees %>%
leaflet() %>%
addTiles() %>%
addPolygons(stroke = F, fillOpacity = 0.25,
fillColor = "yellow")
Download the Shapefiles for the 116th Congress Congressional Districts. Unzip the file and read it in using the code below (you’ll have to change the file path). Use the MIT Election Data and Science Lab’s US House election results, and merge this data with the shapefiles to plot the results of the 2018 midterms in a way that you think is useful (you can use any of the available data).
Some notes:
- FIPS codes are used to identify the state and district, with 00 indicating at-large districts (one district for the state) and 98 indicating non-voting districts.
- If you would like to add in the number of citizens of voting age, you can get that information here but you will have to do some cleaning in order to join the table with the others.
- Minnesota’s Democratic-farmer-labor party caucuses with the Democrats but maintains its name for historical reasons. You can safely recode this if you want to.
library(sf)
# Read in the districts
ziptemp <- tempfile(fileext=".zip")
shapeurl <- "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_cd116_5m.zip"
download.file(shapeurl, destfile = ziptemp, mode = "wb")
unzip(ziptemp, exdir = "data/116_congress")
congress_districts <- st_read("data/116_congress/cb_2018_us_cd116_5m.shp")
## Reading layer `cb_2018_us_cd116_5m' from data source
## `/home/susan/Projects/Class/stat-computing-r-python/part-advanced-topics/data/116_congress/cb_2018_us_cd116_5m.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 441 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.1473 ymin: -14.55255 xmax: 179.7785 ymax: 71.35256
## Geodetic CRS: NAD83
# Read in the results
election_results <- read_csv("data/1976-2020-house.csv") %>%
filter(year == 2018) %>%
mutate(state_fips = sprintf("%02d", as.integer(state_fips)),
district = sprintf("%02d", as.integer(district)))
## Error: 'data/1976-2020-house.csv' does not exist in current working directory ('/home/susan/Projects/Class/stat-computing-r-python/part-advanced-topics').
# Clean up congress districts
congress_districts <- congress_districts %>%
# Convert factors to characters
mutate(across(where(is.factor), as.character)) %>%
# Handle at-large districts
mutate(district = ifelse(CD116FP == "00", "01", CD116FP))
library(sf)
library(htmltools) # to mark labels as html code
# Read in the results
election_results <- election_results %>%
group_by(state, state_fips, state_po, district, stage) %>%
arrange(candidatevotes) %>%
mutate(pct = candidatevotes/totalvotes) %>%
mutate(party = str_to_lower(party)) %>%
# Keep the winner only
filter(pct == max(pct)) %>%
# Fix Minnesota
mutate(party = ifelse(party == "democratic-farmer-labor", "democrat", party))
## Error: object 'election_results' not found
# Read in the districts
congress_districts <- st_read("data/116_congress/cb_2018_us_cd116_5m.shp") %>%
mutate(geometry = st_transform(geometry, crs = st_crs("+proj=longlat +datum=WGS84")))
## Reading layer `cb_2018_us_cd116_5m' from data source
## `/home/susan/Projects/Class/stat-computing-r-python/part-advanced-topics/data/116_congress/cb_2018_us_cd116_5m.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 441 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.1473 ymin: -14.55255 xmax: 179.7785 ymax: 71.35256
## Geodetic CRS: NAD83
# Clean up congress districts
congress_districts <- congress_districts %>%
# Convert factors to characters
mutate(across(where(is.factor), as.character)) %>%
# Handle at-large districts
mutate(district = ifelse(CD116FP == "00", "01", CD116FP))
# Merge
congress_districts <- congress_districts %>%
left_join(election_results, by = c("STATEFP" = "state_fips", "CD116FP" = "district")) %>%
mutate(party = factor(party, levels = c("republican", "democrat")),
short_party = ifelse(party == "republican", "R", "D"),
label = paste0(state_po, "-", district, candidate, " (", short_party, ")"))
## Error: object 'election_results' not found
# Define a palette
region_pal <- colorFactor(c("#e9141d", "#0015bc"), congress_districts$party)
congress_districts %>%
leaflet() %>%
addTiles() %>%
addPolygons(stroke = TRUE, fillOpacity = ~pct/2,
# still want to see what's underneath, even in safe districts
fillColor = ~region_pal(party), color = ~region_pal(party),
label = ~label)
## Error in eval(f[[2]], metaData(data), environment(f)): object 'label' not found
31.4 Shiny
Take a few minutes and poke around the RStudio Shiny user showcase. It helps to have some motivation, and to get a sense of what is possible before you start learning something.
One of the more amusing ones I found was an exploration of lego demographics.
Shiny is a framework for building interactive web applications in R (and now in Python too!). Unlike plotly and other graphics engines, Shiny depends on an R instance on a server to do computations. This means Shiny is much more powerful and has more capabilities, but also that it’s harder to share and deploy - you have to have access to a web server with R installed on it. If you happen to have a server like that, though, Shiny is pretty awesome. Posit runs a service called shinyapps.io that will provide some limited free hosting, as well as paid plans for apps that have more web traffic, but you can also create Shiny apps for local use - I often do this for model debugging when I’m using neural networks, because they’re so complicated.
Posit has a set of well produced video tutorials to introduce Shiny. I’d recommend you at least listen to the introduction if you’re a visual/audio learner (the whole tutorial is about 2 hours long). There is also a written tutorial if you prefer to learn in written form (7 lessons, each is about 20 minutes long).
I generally think it’s better to send you to the source when there are well-produced resources, rather than trying to rehash something to put my own spin on it.
One other interesting feature to keep in mind when using Shiny - you can integrate Shiny reactivity into Rmarkdown by adding runtime: shiny
to the markdown header.
31.4.1 Other interactive tools
htmlwidgets - a generic wrapper for any Javascript library (htmlwidgets is used under the hood in both Leaflet and Plotly R integration)
dash - Another dashboard program supported by plotly.
dash
is the python equivalent ofshiny
, but also has R integration (though I’m not sure how well it’s supported).
31.4.2 Debugging
Debugging with Dean - Shiny debugging - YouTube video with debugging in realtime.
Using Shiny in Production - Joe Cheng
This is me experimentally trying to replace the tidyverse, and honestly, I’m not a fan.↩︎
There are obviously exceptions - we can figure out the exact date and approximate time that there was an earthquake along the Cascadia subduction zone based on a combination of oral histories of the indigenous people and records of a massive tsunami in Japan Excellent read, if you’re interested, and the Nature paper.↩︎