15  Programming With Data

Published

April 17, 2024

At this point, you’ve learned how to write functions. You know the basics of how to create new variables, how data frames and lists work, and how to use markdown.

And yet… these are skills that take some practice when applied to new data. We’re going to take a break from the fire-hose of syntax you’ve learned and focus on applying what you’ve learned to problems related to data. The goal is to reinforce the skills you’ve already learned and help you find your feet a bit as you work through data analysis.

I’ll provide sample code for tasks like basic plots and tables that we haven’t covered yet - you should feel free to modify and tinker with these chunks as you go along. This chapter will also provide a preview of some of the packages we’re going to work with in the next few sections (because I’m going to show you some code for e.g. summarizing a dataset and plot a few things, even without having covered that material).

It’s 100% expected that you would be oscillating between just maybe understanding something and feeling completely lost again during this chapter. Hopefully, that feeling will get better over the next few weeks… but for now, just stick with it.

As you’ve probably guessed by now, this section will primarily be focused on examples.

15.1 Objectives

  • Write functions to create simple plots and data summaries

  • Apply syntax knowledge to reference variables and observations in common data structures

  • Create new variables and columns or reformat existing columns in provided data structures

15.2 Artwork Dimensions

The Tate Art Museum assembled a collection of 70,000 artworks (last updated in 2014). They cataloged information including accession number, artwork dimensions, units, title, date, medium, inscription, and even URLs for images of the art.

15.2.1 Reading in the Data

library(readr)
artwork <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv')
import pandas as pd
artwork = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-01-12/artwork.csv')

15.2.2 Visual Summaries

When you first access a new dataset, it’s fun to explore it a bit. I’ve shown a summary of the variables (character variables summarized with completion rates and # unique values, numeric variables summarized with quantiles and mean/sd) generated using the R skimr and Python skimpy packages (which we’ll talk about in the next chapter).

You may need to run install.packages("skimr") in the R terminal if you have not used the package before.

library(skimr)
skim(artwork)
Data summary
Name artwork
Number of rows 69201
Number of columns 20
_______________________
Column type frequency:
character 12
logical 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
accession_number 0 1.00 6 7 0 69201 0
artist 0 1.00 4 120 0 3336 0
artistRole 0 1.00 5 24 0 19 0
title 0 1.00 1 320 0 43529 0
dateText 0 1.00 4 75 0 2736 0
medium 6384 0.91 3 120 0 3401 0
creditLine 3 1.00 14 820 0 3209 0
dimensions 2433 0.96 4 248 0 25575 0
units 3341 0.95 2 2 0 1 0
inscription 62895 0.09 14 14 0 1 0
thumbnailUrl 10786 0.84 55 57 0 58415 0
url 0 1.00 48 134 0 69201 0

Variable type: logical

skim_variable n_missing complete_rate mean count
thumbnailCopyright 69201 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 39148.03 25980.47 3 19096.00 37339 54712 129068 ▇▇▅▁▁
artistId 0 1.00 1201.06 2019.42 0 558.00 558 1137 19232 ▇▁▁▁▁
year 5397 0.92 1867.23 72.01 1545 1817.00 1831 1953 2012 ▁▁▇▆▆
acquisitionYear 45 1.00 1910.65 64.20 1823 1856.00 1856 1982 2013 ▇▁▁▁▅
width 3367 0.95 323.47 408.81 3 118.00 175 345 11960 ▇▁▁▁▁
height 3342 0.95 346.44 538.04 6 117.00 190 359 37500 ▇▁▁▁▁
depth 66687 0.04 479.20 1051.14 1 48.25 190 450 18290 ▇▁▁▁▁
# Base pandas
artwork.describe()
##                   id      artistId  ...         depth  thumbnailCopyright
## count   69201.000000  69201.000000  ...   2514.000000                 0.0
## mean    39148.026213   1201.063251  ...    479.197772                 NaN
## std     25980.468687   2019.422535  ...   1051.141734                 NaN
## min         3.000000      0.000000  ...      1.000000                 NaN
## 25%     19096.000000    558.000000  ...     48.250000                 NaN
## 50%     37339.000000    558.000000  ...    190.000000                 NaN
## 75%     54712.000000   1137.000000  ...    450.000000                 NaN
## max    129068.000000  19232.000000  ...  18290.000000                 NaN
## 
## [8 rows x 8 columns]

You may need to run pip install skimpy in the terminal if you have not used the package before.


# Skimpy package - like skimr
from skimpy import skim
skim(artwork)
## ╭─────────────────────────────── skimpy summary ───────────────────────────────╮
## │          Data Summary                Data Types                              │
## │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓                       │
## │ ┃ dataframe         ┃ Values ┃ ┃ Column Type ┃ Count ┃                       │
## │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩                       │
## │ │ Number of rows    │ 69201  │ │ string      │ 12    │                       │
## │ │ Number of columns │ 20     │ │ float64     │ 6     │                       │
## │ └───────────────────┴────────┘ │ int64       │ 2     │                       │
## │                                └─────────────┴───────┘                       │
## │                                   number                                     │
## │ ┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━┳━━━━━━┳━━━━━┳━━━━━━┳━━━━━┓  │
## │ ┃ colu ┃      ┃      ┃      ┃      ┃      ┃     ┃      ┃     ┃      ┃     ┃  │
## │ ┃ mn_n ┃      ┃      ┃      ┃      ┃      ┃     ┃      ┃     ┃      ┃ his ┃  │
## │ ┃ ame  ┃ NA   ┃ NA % ┃ mean ┃ sd   ┃ p0   ┃ p25 ┃ p50  ┃ p75 ┃ p100 ┃ t   ┃  │
## │ ┡━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━╇━━━━━━╇━━━━━╇━━━━━━╇━━━━━┩  │
## │ │ id   │    0 │    0 │ 3900 │ 2600 │    3 │ 190 │ 3700 │ 550 │ 1300 │ ▇▇▇ │  │
## │ │      │      │      │    0 │    0 │      │  00 │    0 │  00 │   00 │ ▁▁▁ │  │
## │ │ arti │    0 │    0 │ 1200 │ 2000 │    0 │ 560 │  560 │ 110 │ 1900 │  ▇  │  │
## │ │ stId │      │      │      │      │      │     │      │   0 │    0 │     │  │
## │ │ year │ 5397 │  7.8 │ 1900 │   72 │ 1500 │ 180 │ 1800 │ 200 │ 2000 │     │  │
## │ │      │      │      │      │      │      │   0 │      │   0 │      │ ▇▁▃ │  │
## │ │ acqu │   45 │ 0.07 │ 1900 │   64 │ 1800 │ 190 │ 1900 │ 200 │ 2000 │  ▇  │  │
## │ │ isit │      │      │      │      │      │   0 │      │   0 │      │ ▁▂▃ │  │
## │ │ ionY │      │      │      │      │      │     │      │     │      │     │  │
## │ │ ear  │      │      │      │      │      │     │      │     │      │     │  │
## │ │ widt │ 3367 │ 4.87 │  320 │  410 │    3 │ 120 │  180 │ 340 │ 1200 │  ▇  │  │
## │ │ h    │      │      │      │      │      │     │      │     │    0 │     │  │
## │ │ heig │ 3342 │ 4.83 │  350 │  540 │    6 │ 120 │  190 │ 360 │ 3800 │  ▇  │  │
## │ │ ht   │      │      │      │      │      │     │      │     │    0 │     │  │
## │ │ dept │ 6668 │ 96.3 │  480 │ 1100 │    1 │  48 │  190 │ 450 │ 1800 │  ▇  │  │
## │ │ h    │    7 │    7 │      │      │      │     │      │     │    0 │     │  │
## │ │ thum │ 6920 │  100 │  nan │  nan │  nan │ nan │  nan │ nan │  nan │     │  │
## │ │ bnai │    1 │      │      │      │      │     │      │     │      │     │  │
## │ │ lCop │      │      │      │      │      │     │      │     │      │     │  │
## │ │ yrig │      │      │      │      │      │     │      │     │      │     │  │
## │ │ ht   │      │      │      │      │      │     │      │     │      │     │  │
## │ └──────┴──────┴──────┴──────┴──────┴──────┴─────┴──────┴─────┴──────┴─────┘  │
## │                                   string                                     │
## │ ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓  │
## │ ┃ column_name         ┃ NA     ┃ NA %   ┃ words per row    ┃ total words  ┃  │
## │ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩  │
## │ │ accession_number    │      0 │      0 │                1 │        69201 │  │
## │ │ artist              │      0 │      0 │              3.3 │       226531 │  │
## │ │ artistRole          │      0 │      0 │                1 │        69504 │  │
## │ │ title               │      0 │      0 │                5 │       346084 │  │
## │ │ dateText            │      0 │      0 │              1.2 │        85288 │  │
## │ │ medium              │   6384 │   9.23 │              3.4 │       235115 │  │
## │ │ creditLine          │      3 │      0 │               10 │       698469 │  │
## │ │ dimensions          │   2433 │   3.52 │              5.3 │       363937 │  │
## │ │ units               │   3341 │   4.83 │             0.95 │        65860 │  │
## │ │ inscription         │  62895 │  90.89 │             0.18 │        12612 │  │
## │ │ thumbnailUrl        │  10786 │  15.59 │             0.84 │        58415 │  │
## │ │ url                 │      0 │      0 │                1 │        69201 │  │
## │ └─────────────────────┴────────┴────────┴──────────────────┴──────────────┘  │
## ╰──────────────────────────────────── End ─────────────────────────────────────╯

15.2.3 Accessing one column

First, let’s pull out the year for each piece of artwork in the dataset and see what we can do with it…

head(artwork$year)
## [1]   NA   NA 1785   NA 1826 1826

We reference a column of the dataset by name using dataset_name$column_name, and since our data is stored in artwork, and we want the column named year, we use artwork$year to get access to the data we want.

artwork.year.head()
## 0       NaN
## 1       NaN
## 2    1785.0
## 3       NaN
## 4    1826.0
## Name: year, dtype: float64

We reference a column of the dataset by name using dataset_name.column_name or dataset_name['column_name'], and since our data is stored in artwork and we want the column year, we use artwork.year or artwork['year'] to access the data we want.

I’ve used the head command to show only the first few values (so that the output isn’t overwhelming).

15.2.4 Variable Summary

When we have output like this, it is useful to summarize the output in some way:

summary(artwork$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1545    1817    1831    1867    1953    2012    5397

That’s much less output, but we might want to instead make a chart:

hist(artwork$year, breaks = 30)

artwork.year.describe()
## count    63804.000000
## mean      1867.227823
## std         72.012718
## min       1545.000000
## 25%       1817.000000
## 50%       1831.000000
## 75%       1953.000000
## max       2012.000000
## Name: year, dtype: float64

The df.describe() command provides us with a 5-number summary and then some additional statistics.

We can also create a chart:

artwork.year.hist(bins = 30)

Personally, I much prefer the graphical version. It’s informative (though it does leave out NA values) and shows that there are pieces going back to the 1500s, but that most pieces were made in the early 1800s or late 1900s.

15.2.5 Create a Histogram (base graphics/matplotlib)

We might be interested in the aspect ratio of the artwork - let’s take a look at the input variables and define new variables related to aspect ratio(s).

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=1, ncols=3) # 3 subplots

artwork.width.hist(bins = 30, ax = axes[0])
artwork.depth.hist(bins = 30, ax = axes[1])
artwork.height.hist(bins = 30, ax= axes[2])

# Set subplot titles
axes[0].title.set_text("width")
axes[1].title.set_text("depth")
axes[2].title.set_text("height")

plt.show()

So all of our variables are skewed quite a bit, and we know from the existence of the units column that they may not be in the same unit, either.

15.2.6 Summary Tables

Let’s make a table of the units column so that we can see what the frequency of various units are in the dataset.

table(artwork$units, useNA = 'ifany')
## 
##    mm  <NA> 
## 65860  3341
artwork.units.value_counts(dropna=False)
## units
## mm     65860
## NaN     3341
## Name: count, dtype: int64

Everything that has specified units is in mm. That makes things easier.

15.2.7 Defining a new variable

To define a new variable that exists on its own, we might do something like this:

aspect_hw <- artwork$height/artwork$width
par(mfrow = c(1, 2))
hist(aspect_hw, breaks = 30)
hist(log(aspect_hw), breaks = 30)

import numpy as np

fig, axes = plt.subplots(nrows=1, ncols=2) # 2 subplots

aspect_hw = artwork.height/artwork.width
aspect_hw.hist(bins = 30, ax = axes[0])
np.log(aspect_hw).hist(bins = 30, ax = axes[1])

Most things are pretty square-ish, but there are obviously quite a few exceptions in both directions.

The one problem with how we’ve done this is that we now have a data frame with all of our data in it, and a separate variable aspect_hw, that is not attached to our data frame. That’s not ideal - it’s easy to lose track of the variable, it’s easy to accidentally “sort” the variable so that the row order isn’t the same as in the original data frame… there are all sorts of potential issues.

15.2.8 Adding a new column

The better way to define a new variable is to add a new column to the data frame:

To define a new variable that exists on its own, we might do something like this:

artwork$aspect_hw <- artwork$height/artwork$width
artwork['aspect_hw'] = artwork.height/artwork.width

Note that when you create a new column in a pandas dataframe, you have to use df['colname'] on the left hand side, even if you use df.colname syntax on the right hand side.

(We’ll learn a shorter way to do this later, but this is functional, if not pretty, for now).

The downside to this is that we have to write out artwork$aspect_hw or artwork.aspect_hw each time we want to reference the variable. That is a pain, but one that’s relatively temporary (we’ll get to a better way to do this in a couple of weeks). A little bit of extra typing is definitely worth it if you don’t lose data you want to keep.

Assign your calculations to a variable or column!

One mistake I see people make frequently is to calculate height/width, but then not assign that value to a variable.

If you’re not using <- in R1 or = in Python, then you’re not saving that information to be referenced later - you’re just calculating values temporarily and possibly printing them as output.

15.2.9 Conclusions

It’s important to keep track of where you’re putting the pieces you create during an analysis - just as important as keeping track of the different sub-components when you’re putting a lego set together or making a complex recipe in the kitchen. Forgetting to assign your calculation to a variable is like dumping your glaze down the sink or throwing that small lego component into the trash.

15.3 Dogs of NYC

New York City provides a whole host of open-data resources, including a dataset of dogs licensed in the city on an annual basis (link is to the NYC Open Data Page).

CSV link (this data is ~23 MB)

15.3.1 Read in data

library(readr)

if (!file.exists("../data/NYC_dogs.csv")) {
  # if the file doesn't exist, download it!
  download.file(
    "https://data.cityofnewyork.us/api/views/nu7n-tubp/rows.csv?accessType=DOWNLOAD", # url for download
    destfile = "../data/NYC_dogs.csv", # location to store the file
    mode = "wb" # need this to get downloads to work on windows
  )
}

dogs <- read_csv("../data/NYC_dogs.csv")
head(dogs)
## # A tibble: 6 × 8
##   AnimalName AnimalGender AnimalBirthYear BreedName    ZipCode LicenseIssuedDate
##   <chr>      <chr>                  <dbl> <chr>          <dbl> <chr>            
## 1 PAIGE      F                       2014 American Pi…   10035 09/12/2014       
## 2 YOGI       M                       2010 Boxer          10465 09/12/2014       
## 3 ALI        M                       2014 Basenji        10013 09/12/2014       
## 4 QUEEN      F                       2013 Akita Cross…   10013 09/12/2014       
## 5 LOLA       F                       2009 Maltese        10028 09/12/2014       
## 6 IAN        M                       2006 Unknown        10013 09/12/2014       
## # ℹ 2 more variables: LicenseExpiredDate <chr>, `Extract Year` <dbl>
from os.path import exists # to test whether files exist
import pandas as pd
import requests # to download a file

if ~exists("../data/NYC_dogs.csv"):
  response = requests.get("https://data.cityofnewyork.us/api/views/nu7n-tubp/rows.csv?accessType=DOWNLOAD")
  open("../data/NYC_dogs.csv", "wb").write(response.content)
## 38715790

dogs = pd.read_csv("../data/NYC_dogs.csv")
dogs.head()
##   AnimalName AnimalGender  ... LicenseExpiredDate Extract Year
## 0      PAIGE            F  ...         09/12/2017         2016
## 1       YOGI            M  ...         10/02/2017         2016
## 2        ALI            M  ...         09/12/2019         2016
## 3      QUEEN            F  ...         09/12/2017         2016
## 4       LOLA            F  ...         10/09/2017         2016
## 
## [5 rows x 8 columns]

15.3.2 Work with Dates

One thing we might want to do first is to transform the license dates (LicenseIssuedDate, LicenseExpiredDate) into actual dates instead of characters.

We will use the lubridate package to do this, because it is designed to make working with dates and times very easy.

You may need to run install.packages("lubridate") in the R console if you have not used the package before.

library(lubridate)
head(dogs$LicenseExpiredDate) # Dates are in month-day-year format
## [1] "09/12/2017" "10/02/2017" "09/12/2019" "09/12/2017" "10/09/2017"
## [6] "10/30/2019"

dogs$LicenseExpiredDate <- mdy(dogs$LicenseExpiredDate)
dogs$LicenseIssuedDate <- mdy(dogs$LicenseIssuedDate)

head(dogs$LicenseExpiredDate)
## [1] "2017-09-12" "2017-10-02" "2019-09-12" "2017-09-12" "2017-10-09"
## [6] "2019-10-30"

You may need to run pip install datetime in the terminal if you have not used the package before.

from datetime import date

dogs[['LicenseExpiredDate','LicenseIssuedDate']].head() # Before
##   LicenseExpiredDate LicenseIssuedDate
## 0         09/12/2017        09/12/2014
## 1         10/02/2017        09/12/2014
## 2         09/12/2019        09/12/2014
## 3         09/12/2017        09/12/2014
## 4         10/09/2017        09/12/2014

format_str = "%m/%d/%Y" # date format in the dataset

dogs['LicenseExpiredDate'] = pd.to_datetime(dogs.LicenseExpiredDate, format = format_str)
dogs['LicenseIssuedDate'] = pd.to_datetime(dogs.LicenseIssuedDate, format = format_str)

dogs[['LicenseExpiredDate','LicenseIssuedDate']].head() # After
##   LicenseExpiredDate LicenseIssuedDate
## 0         2017-09-12        2014-09-12
## 1         2017-10-02        2014-09-12
## 2         2019-09-12        2014-09-12
## 3         2017-09-12        2014-09-12
## 4         2017-10-09        2014-09-12

It might be interesting to see when licenses have been issued over time, so let’s make a histogram. This time, I’m going to use ggplot graphics with the ggplot2 package in R and the plotnine package in python (which is the python version of the R package).

15.3.3 Create a Histogram (ggplot2/plotnine)

You may need to run install.packages("ggplot2") in the R console if you have not used ggplot2 before.

library(ggplot2)

ggplot(
  data = dogs, 
  aes(x = LicenseIssuedDate) # Specify we want LicenseIssueDate on the x-axis
) + 
  geom_histogram() # Create a histogram

You may need to run pip install plotnine in the terminal if you have not used the package before.

from plotnine import *

(
  ggplot(aes(x = 'LicenseIssuedDate'), data = dogs) + 
  geom_histogram() # Create a histogram
)
## ggplot.__init__() got multiple values for argument 'data'

There is an interesting periodicity to the license issue dates.

15.3.4 Compute License Length

I’m also curious about how long a license tends to be held for - we can get this information by subtracting the issue date from the expiration date.

dogs$LicenseLength <- dogs$LicenseExpiredDate - dogs$LicenseIssuedDate
summary(dogs$LicenseLength)
##   Length    Class     Mode 
##   616890 difftime  numeric
head(dogs$LicenseLength)
## Time differences in days
## [1] 1096 1116 1826 1096 1123 1874

We can see that directly subtracting date-times gives us a license length in days. That’s useful enough, I guess, but it might be more useful in years… unfortunately, that’s not an option for difftime()

library(ggplot2)
dogs$LicenseLength <- difftime(dogs$LicenseExpiredDate, dogs$LicenseIssuedDate, units = "weeks")

# 52 weeks in a year so we'll just convert as we plot
ggplot(data = dogs, aes(x = LicenseLength / 52 )) + geom_histogram() + 
  scale_x_continuous(limits = c(0,10))

dogs["License_length"] = dogs.LicenseExpiredDate - dogs.LicenseIssuedDate

dogs.License_length.describe()
## count                         616808
## mean     513 days 21:59:06.164122400
## std      389 days 21:45:53.476324672
## min                  1 days 00:00:00
## 25%                365 days 00:00:00
## 50%                366 days 00:00:00
## 75%                411 days 00:00:00
## max               7913 days 00:00:00
## Name: License_length, dtype: object
dogs.License_length.head()
## 0   1096 days
## 1   1116 days
## 2   1826 days
## 3   1096 days
## 4   1123 days
## Name: License_length, dtype: timedelta64[ns]

dogs["License_length_yr"] = dogs.License_length.dt.days/365.25
(
  ggplot(aes(x = "License_length_yr"), data = dogs) + 
  geom_histogram(bins = 30)+
  scale_x_continuous(limits = (0,10))
)
## ggplot.__init__() got multiple values for argument 'data'

In python, we have to first access the “days” attribute of the timedelta64 data type (this gives us a number) using dogs.Licence_length.dt.days and then divide by 365.25 (number of days in a year, on average).

15.3.5 Explore Boroughs

Another question that I have when looking at this dataset is a bit more superficial - are the characteristics of different areas different? The dogs data frame has a Borough column, but it’s not actually filled in, so we’ll need to get rid of it and then add Borough back in by zip code.

To look at this, we’ll need a bit more data. I found a list of NYC zip codes by borough, which we can merge in with the data we already have to get puppy registrations by borough. Then, we can see if e.g. the top 10 breeds are different for different boroughs. To simplify this, I’m going to link to a file to merge in, and not show you the specifics of how I read the table from this site.

borough_zip <- read_csv("https://raw.githubusercontent.com/srvanderplas/unl-stat850/main/data/nyc_zip_borough.csv")
## Error in open.connection(structure(4L, class = c("curl", "connection"), conn_id = <pointer: 0x58e>), : HTTP error 404.

# Remove the Borough column from dogs
dogs <- dogs[, which(names(dogs) != "Borough")]
dogs <- merge(dogs, borough_zip, by = "ZipCode")
## Error in eval(expr, envir, enclos): object 'borough_zip' not found
head(dogs)
## # A tibble: 6 × 9
##   AnimalName AnimalGender AnimalBirthYear BreedName    ZipCode LicenseIssuedDate
##   <chr>      <chr>                  <dbl> <chr>          <dbl> <date>           
## 1 PAIGE      F                       2014 American Pi…   10035 2014-09-12       
## 2 YOGI       M                       2010 Boxer          10465 2014-09-12       
## 3 ALI        M                       2014 Basenji        10013 2014-09-12       
## 4 QUEEN      F                       2013 Akita Cross…   10013 2014-09-12       
## 5 LOLA       F                       2009 Maltese        10028 2014-09-12       
## 6 IAN        M                       2006 Unknown        10013 2014-09-12       
## # ℹ 3 more variables: LicenseExpiredDate <date>, `Extract Year` <dbl>,
## #   LicenseLength <drtn>
borough_zip = pd.read_csv("https://raw.githubusercontent.com/srvanderplas/unl-stat850/main/data/nyc_zip_borough.csv")
## HTTP Error 404: Not Found

dogs = dogs.drop('Borough', axis = 1) # drop borough column
## "['Borough'] not found in axis"
dogs = pd.merge(dogs, borough_zip, on = 'ZipCode')
## name 'borough_zip' is not defined
dogs.head()
##   AnimalName AnimalGender  ... License_length License_length_yr
## 0      PAIGE            F  ...      1096 days          3.000684
## 1       YOGI            M  ...      1116 days          3.055441
## 2        ALI            M  ...      1826 days          4.999316
## 3      QUEEN            F  ...      1096 days          3.000684
## 4       LOLA            F  ...      1123 days          3.074606
## 
## [5 rows x 10 columns]

Now that we have borough, let’s write a function that will take a dataset and spit out a list of the top 5 dog breeds registered in that area.

15.3.6 Custom Summary Function

top_5_breeds <- function(data) {
  # Inside the function, our dataset is called data, not dogs
  tmp <- table(data$BreedName) 
  return(sort(tmp, decreasing = T)[1:5]) # top 5 breeds with counts
}

def top_5_breeds(data):
  tmp = pd.value_counts(data.BreedName)
  return tmp.iloc[0:5]

15.3.7 For Loop Summary

Now, using that function, lets write a for loop that loops through the 5 boroughs and spits out the top 5 breeds in each borough:

boroughs <- unique(borough_zip$Borough) # get a list of the 5 boroughs
## Error in eval(expr, envir, enclos): object 'borough_zip' not found
for (i in boroughs) {
  # Get subset of data frame corresponding to the Borough
  dogs_sub <- dogs[dogs$Borough == i,]
  # Get top 5 dog breeds
  result <- as.data.frame(top_5_breeds(dogs_sub))
  # set names
  names(result) <- c("Breed", "Freq")
  # Add Borough as a new column
  result$Borough <- i
  # Add rank as a new column
  result$rank <- 1:5
  
  print(result)
}
## Error in eval(expr, envir, enclos): object 'boroughs' not found
boroughs = borough_zip.Borough.unique()
## name 'borough_zip' is not defined
for i in boroughs:
  # get subset of data frame corresponding to the borough
  dogs_sub = dogs.query("Borough == @i")
  # Get top 5 breeds
  result = top_5_breeds(dogs_sub)
  # Convert to DataFrame and make the index another column
  result = result.to_frame().reset_index()
  # Rename columns
  result.rename(columns = {'index':'BreedName','BreedName':'count'})
  # Add Borough column
  result["Borough"] = i
  # Add rank column
  result["rank"] = range(1, 6)

  print(result)
## name 'boroughs' is not defined

More information on pandas query function (use \@varname to use a variable in a query).

15.3.8 Summary Data Frame

If we wanted to save these results as a summary data frame, we could totally do that!

breeds_by_borough <- data.frame() # create a blank data frame

for (i in boroughs) {
  # Get subset of data frame corresponding to the Borough
  dogs_sub <- subset(dogs, Borough == i)
  # Get top 5 dog breeds
  result <- as.data.frame(top_5_breeds(dogs_sub))
  # set names
  names(result) <- c("Breed", "Freq")
  # Add Borough as a new column
  result$Borough <- i
  # Add rank as a new column
  result$rank <- 1:5
  
  breeds_by_borough <- rbind(breeds_by_borough, result)
}
## Error in eval(expr, envir, enclos): object 'boroughs' not found

breeds_by_borough
## data frame with 0 columns and 0 rows

We could even sort our data by the rank and Borough for easier comparisons:


breeds_by_borough[order(breeds_by_borough$rank, 
                        breeds_by_borough$Borough),]
## Error in order(breeds_by_borough$rank, breeds_by_borough$Borough): argument 1 is not a vector
breeds_by_borough = pd.DataFrame() # Create a blank dataframe

for i in boroughs:
  print(i)
  # get subset of data frame corresponding to the borough
  dogs_sub = dogs.query("Borough== @i")
  # Get top 5 breeds
  result = top_5_breeds(dogs_sub)
  # Convert to DataFrame and make the index another column
  result = result.to_frame().reset_index()
  # Rename columns
  result.rename(columns = {'index':'BreedName','BreedName':'count'})
  # Add Borough column
  result["Borough"] = i
  # Add rank column
  result["rank"] = range(1, 6)
  # Append to blank dataframe
  breeds_by_borough = breeds_by_borough.append(result)
## name 'boroughs' is not defined

breeds_by_borough.head()
## Empty DataFrame
## Columns: []
## Index: []
breeds_by_borough.tail()
## Empty DataFrame
## Columns: []
## Index: []

We could even sort our data by the rank and Borough for easier comparisons:


breeds_by_borough.sort_values(['rank', 'Borough'])
## 'rank'

Soon we’ll learn a much shorter set of commands to get these types of summaries, but it’s important to know how a for loop connects to the concept of summarizing data by a factor (in this case, by borough).

Try it out: NYC dogs

Look at the name, age, or gender of dogs registered in NYC and see if you can come up with a similar function and way of summarizing the data in a for-loop. You may want to calculate the mean or quantiles (for numeric variables), or list the most common dog names/genders in each borough.

15.4 Swearing in Tarantino Films

Content warning: This section contains analysis of swear words and deaths. I will not censor the words used in these movies, as they are legitimate data and could lead to an interesting analysis. Feel free to skip this example if it makes you uncomfortable.

Quentin Jerome Tarantino (/ˌtærənˈtiːnoʊ/; born March 27, 1963) is an American film director, screenwriter, producer, actor, and author. His films are characterized by stylized violence, extended dialogue including a pervasive use of profanity, and references to popular culture. [1]

15.4.1 Read in data

library(readr)

tarantino <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/tarantino/tarantino.csv")
head(tarantino)
## # A tibble: 6 × 4
##   movie          type  word     minutes_in
##   <chr>          <chr> <chr>         <dbl>
## 1 Reservoir Dogs word  dick           0.4 
## 2 Reservoir Dogs word  dicks          0.43
## 3 Reservoir Dogs word  fucked         0.55
## 4 Reservoir Dogs word  fucking        0.61
## 5 Reservoir Dogs word  bullshit       0.61
## 6 Reservoir Dogs word  fuck           0.66
import pandas as pd

tarantino = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/tarantino/tarantino.csv")
tarantino.head()
##             movie  type      word  minutes_in
## 0  Reservoir Dogs  word      dick        0.40
## 1  Reservoir Dogs  word     dicks        0.43
## 2  Reservoir Dogs  word    fucked        0.55
## 3  Reservoir Dogs  word   fucking        0.61
## 4  Reservoir Dogs  word  bullshit        0.61

15.4.2 Create a Density Plot (ggplot2/plotnine)

You may need to run install.packages("ggplot2") in the R console if you have not used ggplot2 before.

library(ggplot2)

ggplot(
  data = tarantino, 
  aes(x = minutes_in, color = movie)
) + 
  geom_density() + 
  facet_wrap(~type)

You may need to run pip install plotnine in the terminal if you have not used the package before.

from plotnine import *

(
  ggplot(tarantino, aes(x = 'minutes_in', color = 'movie')) + 
  geom_density() + 
  facet_wrap("type")
)
## <Figure Size: (640 x 480)>

So, from these plots, we can see that in at least two movies, there are high spikes in deaths about 1/3 and 2/3 of the way in; in another movie, most of the deaths occur in the first 25 minutes. Swearing, on the other hand, seems to be fairly evenly distributed throughout the movies.

15.4.3 Compare Swear Words Used by Movie

As there are a very large number of swear words and variants in Tarantino movies, let’s work with only the 6 most common swear words in the data set. To do this, we have to:

  1. Select only rows that have words (as opposed to deaths)
  2. Assemble a list of the 6 most common words
  3. Select only rows with those words
library(ggplot2)
library(dplyr)
# Step 1
tarantino_words <- tarantino[tarantino$type == "word",]

# Step 2
word_freq <- sort(table(tarantino_words$word), decreasing = T)
# word_freq has the counts of how many times the words appear
# we need the names that are above those counts
swear6 <- names(word_freq)[1:6]

# Step 3
word_6 <- tarantino_words[tarantino_words$word %in% swear6,]



ggplot(
  data = word_6, 
  aes(x = movie, fill = word)
) + 
  geom_bar() + 
  coord_flip()

from plotnine import *

# Step 1 - remove deaths
tarantino_words = tarantino.query("type == 'word'")

# Step 2 - 6 most common words


(
  ggplot(tarantino, aes(x = 'minutes_in', color = 'movie')) + 
  geom_density() + 
  facet_wrap("type")
)
## <Figure Size: (640 x 480)>

XXX Under construction - I will add more as I get time.

[1]
Wikipedia contributors, “Quentin Tarantino,” Wikipedia. Aug. 2023 [Online]. Available: https://en.wikipedia.org/wiki/Quentin_Tarantino. [Accessed: Aug. 29, 2023]