library(dplyr) # load the pipe %>%
rnorm(100) %>%
mean()
## [1] -0.1306156
rnorm(100) |> mean()
## [1] 0.1521778
6 Data Structures
Aside: Data Pipes
Pipes are useful items for moving things from one place to another. In data programming, pipes are operators that let us move data around. In R, we have two primary pipes that are similar (you may see both used if you google for code online). Any R version after 4.1 has a built-in pipe, |>
; the tidyverse
libraries use a pipe from the magrittr
package, %>%
.
For right now, it’s ok to think of the two pipes as essentially the same (but you can read about the differences here).
Fundamentally, a pipe allows you to take a function b()
and apply it to x
, like b(x)
, but write it as x |> b()
or x %>% b()
. This is particularly useful in cases where there are multiple sequential analysis steps, because where in regular notation you have to read the functions from the inside out to understand the sequential steps, with pipes, you have a clear step-by-step list of the order of operations.
In Python, there is a pipe
function in the Pandas library that works using .pipe(function)
notation. You can see this example for more information. From what I’ve seen reading code online, however, pipes are less commonly used in Python code than they are in R code. That’s ok - languages have different conventions, and it is usually best to adopt the convention of the language you’re working in so that your code can be read, run, and maintained by others more easily.
Use the rnorm
function in R to generate 100 draws from a standard normal distribution, then use the pipe to calculate the mean.
Calculate the mean of 100 random normal variables in python.
import numpy as np
import pandas as pd
= pd.Series(np.random.normal(size = 100))
nums
nums.mean()## -0.08844537735507918
The conclusion here is that it’s far easier to not use the pipe in python because the .function
notation that python uses mimics the step-by-step approach of pipes in R even without using the actual pipe function. When you use data frames instead of Series, you might start using the pipe, but only in some circumstances.
6.1 Motivation: Working with Multiple Vectors
In the previous chapter, we talked about homogeneous structures: arrangements of data, like vectors and matrices, where every entry in the larger structure has the same type. In this chapter, we’ll be talking about the root of most data science analysis projects: the data frame.
Like an excel spreadsheet, data frames are arrangements of data in columns and rows.
This format has two main restrictions:
- Every entry in each column must have the same data type
- Every column must have the same number of rows
The picture above shows a data frame of 4 columns, each with a different data type (brick size/hue). The data frame has 12 rows. This picture may look similar to one that we used to show logical indexing in the last chapter, and that is not a coincidence. You can get everything from a data frame that you would get from a collection of 4 separate vectors… but there are advantages to keeping things in a data frame instead.
In the previous chapter, we learned how to make different vectors in R, numpy, and pandas. Consider for a moment https://worldpopulationreview.com/states, which lists the population of each state. You can find this dataset in CSV form here.
Multiple vectors in Python
(I’m going to cheat and read this in using pandas functions we haven’t learned yet to demonstrate why this stuff matters.)
import pandas as pd
= pd.read_html("https://worldpopulationreview.com/states")[0]
data list(data.columns) # get names
# Create a few population series
## ['Rank', 'State', '2022 Population', 'Growth 2022', '2021 Population', '2010 Census', 'Growth Since 2010', '% of US', 'Density (mi²)']
= pd.Series(data['2022 Population'].values, index = data['State'].values)
population2022 = pd.Series(data['2021 Population'].values, index = data['State'].values)
population2021 = pd.Series(data['2010 Census'].values, index = data['State'].values) population2010
Suppose that we want to sort each population vector by the population in that year.
import pandas as pd
= pd.read_html("https://worldpopulationreview.com/states")[0]
data
= pd.Series(data['2022 Population'].values, index = data['State'].values).sort_values()
population2022 = pd.Series(data['2021 Population'].values, index = data['State'].values).sort_values()
population2021 = pd.Series(data['2010 Census'].values, index = data['State'].values).sort_values()
population2010
population2022.head()## Wyoming 582233
## Vermont 622882
## District of Columbia 718355
## Alaska 720763
## North Dakota 774008
## dtype: int64
population2021.head()## Wyoming 581075
## Vermont 623251
## District of Columbia 714153
## Alaska 724357
## North Dakota 770026
## dtype: int64
population2010.head()## Wyoming 564487
## District of Columbia 605226
## Vermont 625879
## North Dakota 674715
## Alaska 713910
## dtype: int64
The only problem is that by doing this, we’ve now lost the ordering that matched across all 3 vectors. Pandas Series are great for this, because they use labels that allow us to reconstitute which value corresponds to which label, but in R or even in numpy arrays, vectors don’t inherently come with labels. In these situations, sorting by one value can actually destroy the connection between two vectors!
Vector-based analysis in R
<- read.csv("https://raw.githubusercontent.com/srvanderplas/Stat151/main/data/population2022.csv")
df
# Use vectors instead of the data frame
<- df$State
state <- df$Pop
pop2022 <- df$Pop2021
pop2021 <- df$Pop2010
pop2010
# Create a vector to index population in 2022 in order
<- order(pop2022)
order2022
# To keep variables together, we have to do things like this:
head(state[order2022])
## [1] "Wyoming" "Vermont" "District of Columbia"
## [4] "Alaska" "North Dakota" "South Dakota"
head(pop2022[order2022])
## [1] 582233 622882 718355 720763 774008 902542
# It makes more sense just to reorder the whole data frame:
head(df[order2022,])
## rank State Pop Growth Pop2021 Pop2010 growthSince2010
## 52 52 Wyoming 582233 0.0020 581075 564487 0.0314
## 51 51 Vermont 622882 -0.0006 623251 625879 -0.0048
## 50 50 District of Columbia 718355 0.0059 714153 605226 0.1869
## 49 49 Alaska 720763 -0.0050 724357 713910 0.0096
## 48 48 North Dakota 774008 0.0052 770026 674715 0.1472
## 47 47 South Dakota 902542 0.0066 896581 816166 0.1058
## Percent density
## 52 0.0017 5.9967
## 51 0.0019 67.5797
## 50 0.0021 11776.3115
## 49 0.0021 1.2631
## 48 0.0023 11.2173
## 47 0.0027 11.9052
The primary advantage to data frames is that rows of data are kept together. Since we often think of a row of data as a single observation in a sample, this is an extremely important feature that makes data frames a huge improvement on a collection of vectors of the same length: it’s much harder for observations in a single row to get shuffled around and mismatched!
In R, data frames are built in as type data.frame
, though there are packages that provide other implementations of data frames that have additional features, such as the tibble
package used in many other common packages. We will cover functions from both base R and the tibble
package in this chapter.
In Python, we will use the pandas
library, which is conventionally abbreviated pd
. So before you use any data frames in python, you will need to add the following line to your code: import pandas as pd
.
6.2 Creating Data Frames
6.2.1 From Scratch
If you want to create a data frame “from scratch” in either R or python, the easiest way to do so is to construct a list of vectors.
Data sourced from Wikipedia’s List of Oldest dogs
Creating Data frames from scratch in R
<- c("Bluey", "Bramble", "Chanel", "Max")
dog_names <- c(29.41, 25, 21, 29.77)
dog_ages
# Using the data.frame function
<- data.frame(Name = dog_names, Age = dog_ages)
data
data## Name Age
## 1 Bluey 29.41
## 2 Bramble 25.00
## 3 Chanel 21.00
## 4 Max 29.77
# Using the tibble function
library(tibble)
<- tibble(Name = dog_names, Age = dog_ages)
data # Notice the difference in how the object is printed...
data## # A tibble: 4 × 2
## Name Age
## <chr> <dbl>
## 1 Bluey 29.4
## 2 Bramble 25
## 3 Chanel 21
## 4 Max 29.8
# Using the tribble function in the tibble package
<- tribble(~Name, ~Age,
data "Bluey", 29.41,
"Bramble", 25,
"Chanel", 21,
"Max", 29.77)
# This allows you to write out the data yourself in table format
# Column Names are indicated by putting ~ before the (bare) column name
data## # A tibble: 4 × 2
## Name Age
## <chr> <dbl>
## 1 Bluey 29.4
## 2 Bramble 25
## 3 Chanel 21
## 4 Max 29.8
Creating Data frames from scratch in python
import pandas as pd
# Create a list of lists
= [['Bluey', 29.41],
data 'Bramble', 25],
['Chanel', 21],
['Max', 29.77]]
[
= pd.DataFrame(data, columns = ['Name', 'Age'])
data
# Create a dict with lists
= {'Name': ['Bluey', 'Bramble', 'Chanel', 'Max'],
data 'Age': [29.41, 25, 21, 29.77]}
= pd.DataFrame(data) data
I am intentionally not discussing dictionaries (dicts) in Python at this point - my goal is to get you up and running to do data analysis in Python with as little overhead as possible. If you are interested, you can read up on dicts in Python 4 Everybody. We will hopefully find time to come back and discuss the finer points of lists, dicts, tuples, and other constructs later in the semester or in a subsequent course.
6.2.2 Reading in Data
One of the easier ways to create a data frame (rather than typing the whole thing in) is to read in data from somewhere else - a file, a table on a webpage, etc. We’re not going to go into the finer points of this (you’ll get into that in Stat 251, Data Wrangling), but it is important to at least know how to read in relatively nicely formatted data.
One nice source of (relatively neat) data is the TidyTuesday github repository1.
In Base R, we can read the data in using the read.csv
function
<- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-08/airmen.csv')
airmen head(airmen)
## name last_name first_name graduation_date
## 1 Adams, John H., Jr. Adams John H., Jr. 1945-04-15T00:00:00Z
## 2 Adams, Paul Adams Paul 1943-04-29T00:00:00Z
## 3 Adkins, Rutherford H. Adkins Rutherford H. 1944-10-16T00:00:00Z
## 4 Adkins, Winston A. Adkins Winston A. 1944-02-08T00:00:00Z
## 5 Alexander, Halbert L. Alexander Halbert L. 1944-11-20T00:00:00Z
## 6 Alexander, Harvey R. Alexander Harvey R. 1944-04-15T00:00:00Z
## rank_at_graduation class graduated_from pilot_type
## 1 2nd Lt SE-45-B TAAF Single engine
## 2 2nd Lt SE-43-D TAAF Single engine
## 3 2nd Lt SE-44-I-1 TAAF Single engine
## 4 2nd Lt TE-44-B TAAF Twin engine
## 5 2nd Lt SE-44-I TAAF Single engine
## 6 2nd Lt TE-44-D TAAF Twin engine
## military_hometown_of_record state aerial_victory_credits
## 1 Kansas City KS <NA>
## 2 Greenville SC <NA>
## 3 Alexandria VA <NA>
## 4 Chicago IL <NA>
## 5 Georgetown IL <NA>
## 6 Georgetown IL <NA>
## number_of_aerial_victory_credits reported_lost reported_lost_date
## 1 0 <NA> <NA>
## 2 0 <NA> <NA>
## 3 0 <NA> <NA>
## 4 0 <NA> <NA>
## 5 0 <NA> <NA>
## 6 0 <NA> <NA>
## reported_lost_location web_profile
## 1 <NA> https://cafriseabove.org/john-h-adams-jr/
## 2 <NA> https://cafriseabove.org/paul-adams/
## 3 <NA> https://cafriseabove.org/rutherford-h-adkins/
## 4 <NA> <NA>
## 5 <NA> https://cafriseabove.org/halbert-l-alexander/
## 6 <NA> https://cafriseabove.org/harvey-r-alexander/
If we want instead to create a tibble, we can use the readr
package’s read_csv
function, which is a bit more robust.
library(readr)
<- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-08/airmen.csv')
airmen head(airmen)
## # A tibble: 6 × 16
## name last_name first_name graduation_date rank_at_graduat… class
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 Adams, John H… Adams John H., … 1945-04-15 00:00:00 2nd Lt SE-4…
## 2 Adams, Paul Adams Paul 1943-04-29 00:00:00 2nd Lt SE-4…
## 3 Adkins, Ruthe… Adkins Rutherfor… 1944-10-16 00:00:00 2nd Lt SE-4…
## 4 Adkins, Winst… Adkins Winston A. 1944-02-08 00:00:00 2nd Lt TE-4…
## 5 Alexander, Ha… Alexander Halbert L. 1944-11-20 00:00:00 2nd Lt SE-4…
## 6 Alexander, Ha… Alexander Harvey R. 1944-04-15 00:00:00 2nd Lt TE-4…
## # … with 10 more variables: graduated_from <chr>, pilot_type <chr>,
## # military_hometown_of_record <chr>, state <chr>,
## # aerial_victory_credits <chr>, number_of_aerial_victory_credits <dbl>,
## # reported_lost <chr>, reported_lost_date <dttm>,
## # reported_lost_location <chr>, web_profile <chr>
In pandas
, we can read the csv using pd.read_csv
import pandas as pd
= pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-08/airmen.csv")
airmen
airmen.head()## name ... web_profile
## 0 Adams, John H., Jr. ... https://cafriseabove.org/john-h-adams-jr/
## 1 Adams, Paul ... https://cafriseabove.org/paul-adams/
## 2 Adkins, Rutherford H. ... https://cafriseabove.org/rutherford-h-adkins/
## 3 Adkins, Winston A. ... NaN
## 4 Alexander, Halbert L. ... https://cafriseabove.org/halbert-l-alexander/
##
## [5 rows x 16 columns]
6.3 Working with Data Frames
6.3.1 Summaries (part 1)
Often, we want to know what a data frame contains. R and pandas both have easy summary methods for a data frame:
summary(airmen)
## name last_name first_name
## Length:1006 Length:1006 Length:1006
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## graduation_date rank_at_graduation class
## Min. :1942-03-06 00:00:00.000 Length:1006 Length:1006
## 1st Qu.:1943-10-22 00:00:00.000 Class :character Class :character
## Median :1944-05-23 00:00:00.000 Mode :character Mode :character
## Mean :1944-07-02 13:18:52.462
## 3rd Qu.:1945-04-15 00:00:00.000
## Max. :1948-10-12 00:00:00.000
## NA's :11
## graduated_from pilot_type military_hometown_of_record
## Length:1006 Length:1006 Length:1006
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## state aerial_victory_credits number_of_aerial_victory_credits
## Length:1006 Length:1006 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.1118
## 3rd Qu.:0.0000
## Max. :4.0000
##
## reported_lost reported_lost_date reported_lost_location
## Length:1006 Min. :1943-07-02 Length:1006
## Class :character 1st Qu.:1943-07-02 Class :character
## Mode :character Median :1943-07-02 Mode :character
## Mean :1943-07-02
## 3rd Qu.:1943-07-02
## Max. :1943-07-02
## NA's :1004
## web_profile
## Length:1006
## Class :character
## Mode :character
##
##
##
##
library(skimr) # Fancier summaries
skim(airmen)
Name | airmen |
Number of rows | 1006 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 13 |
numeric | 1 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 9 | 28 | 0 | 1003 | 0 |
last_name | 0 | 1.00 | 3 | 12 | 0 | 617 | 0 |
first_name | 0 | 1.00 | 3 | 17 | 0 | 804 | 0 |
rank_at_graduation | 5 | 1.00 | 3 | 14 | 0 | 7 | 0 |
class | 20 | 0.98 | 3 | 9 | 0 | 72 | 0 |
graduated_from | 0 | 1.00 | 4 | 23 | 0 | 4 | 0 |
pilot_type | 0 | 1.00 | 11 | 13 | 0 | 5 | 0 |
military_hometown_of_record | 9 | 0.99 | 3 | 19 | 0 | 366 | 0 |
state | 11 | 0.99 | 2 | 5 | 0 | 48 | 0 |
aerial_victory_credits | 934 | 0.07 | 31 | 137 | 0 | 50 | 0 |
reported_lost | 1004 | 0.00 | 1 | 1 | 0 | 1 | 0 |
reported_lost_location | 1004 | 0.00 | 23 | 23 | 0 | 1 | 0 |
web_profile | 813 | 0.19 | 34 | 95 | 0 | 190 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
number_of_aerial_victory_credits | 0 | 1 | 0.11 | 0.46 | 0 | 0 | 0 | 0 | 4 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
graduation_date | 11 | 0.99 | 1942-03-06 | 1948-10-12 | 1944-05-23 | 52 |
reported_lost_date | 1004 | 0.00 | 1943-07-02 | 1943-07-02 | 1943-07-02 | 1 |
Notice that the type of summary depends on the data type.
# All variables - strings are summarized with NaNs
= 'all')
airmen.describe(include
# Only summarize numeric variables
## name ... web_profile
## count 1006 ... 193
## unique 1003 ... 190
## top Brothers, James E. ... https://cafriseabove.org/captain-graham-smith-...
## freq 2 ... 2
## mean NaN ... NaN
## std NaN ... NaN
## min NaN ... NaN
## 25% NaN ... NaN
## 50% NaN ... NaN
## 75% NaN ... NaN
## max NaN ... NaN
##
## [11 rows x 16 columns]
= [np.number])
airmen.describe(include
# Only summarize string variables (objects)
## number_of_aerial_victory_credits
## count 1006.000000
## mean 0.111829
## std 0.457844
## min 0.000000
## 25% 0.000000
## 50% 0.000000
## 75% 0.000000
## max 4.000000
= ['O'])
airmen.describe(include
# Get counts of how many NAs in each column
## name ... web_profile
## count 1006 ... 193
## unique 1003 ... 190
## top Brothers, James E. ... https://cafriseabove.org/captain-graham-smith-...
## freq 2 ... 2
##
## [4 rows x 15 columns]
=True)
airmen.info(show_counts## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1006 entries, 0 to 1005
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 name 1006 non-null object
## 1 last_name 1006 non-null object
## 2 first_name 1006 non-null object
## 3 graduation_date 995 non-null object
## 4 rank_at_graduation 999 non-null object
## 5 class 986 non-null object
## 6 graduated_from 1006 non-null object
## 7 pilot_type 1006 non-null object
## 8 military_hometown_of_record 997 non-null object
## 9 state 995 non-null object
## 10 aerial_victory_credits 72 non-null object
## 11 number_of_aerial_victory_credits 1006 non-null float64
## 12 reported_lost 2 non-null object
## 13 reported_lost_date 2 non-null object
## 14 reported_lost_location 2 non-null object
## 15 web_profile 193 non-null object
## dtypes: float64(1), object(15)
## memory usage: 125.9+ KB
In pandas, you will typically want to separate out .describe() calls for numeric and non-numeric columns. Another handy function in pandas is .info(), which you can use to show the number of non-NA values. This is particularly useful in sparse datasets where there may be a LOT of missing values and you may want to find out which columns have useful information for more than just a few rows.
6.3.2 Indexing
To access a subset of a data frame, we index by [row, column]
in both languages (though in python we need a helper function tagged on the end of the object).
Indexing in python (lots of output)
# .iloc allows for integer location-based indexing
0:4,] # leave the space for cols blank to get all columns
airmen.iloc[## name ... web_profile
## 0 Adams, John H., Jr. ... https://cafriseabove.org/john-h-adams-jr/
## 1 Adams, Paul ... https://cafriseabove.org/paul-adams/
## 2 Adkins, Rutherford H. ... https://cafriseabove.org/rutherford-h-adkins/
## 3 Adkins, Winston A. ... NaN
##
## [4 rows x 16 columns]
0:4,[0, 3, 5]] # include a vector of column indices
airmen.iloc[
# .loc allows for using the row and column indexes
## name graduation_date class
## 0 Adams, John H., Jr. 1945-04-15T00:00:00Z SE-45-B
## 1 Adams, Paul 1943-04-29T00:00:00Z SE-43-D
## 2 Adkins, Rutherford H. 1944-10-16T00:00:00Z SE-44-I-1
## 3 Adkins, Winston A. 1944-02-08T00:00:00Z TE-44-B
'0':'4',]
airmen.loc[## name ... web_profile
## 0 Adams, John H., Jr. ... https://cafriseabove.org/john-h-adams-jr/
## 1 Adams, Paul ... https://cafriseabove.org/paul-adams/
## 2 Adkins, Rutherford H. ... https://cafriseabove.org/rutherford-h-adkins/
## 3 Adkins, Winston A. ... NaN
## 4 Alexander, Halbert L. ... https://cafriseabove.org/halbert-l-alexander/
## .. ... ... ...
## 395 Hicks, Frederick P. ... NaN
## 396 Higginbotham, Mitchell L. ... NaN
## 397 Highbaugh, Earl B. ... NaN
## 398 Highbaugh, Richard B. ... NaN
## 399 Hill, Charles A., Jr. ... NaN
##
## [400 rows x 16 columns]
0:4,'name':'first_name'] # columns between name and first_name
airmen.loc[## name last_name first_name
## 0 Adams, John H., Jr. Adams John H., Jr.
## 1 Adams, Paul Adams Paul
## 2 Adkins, Rutherford H. Adkins Rutherford H.
## 3 Adkins, Winston A. Adkins Winston A.
## 4 Alexander, Halbert L. Alexander Halbert L.
0:4,[0,3,5]] # can't use position indexes with .loc
airmen.loc[## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: "None of [Int64Index([0, 3, 5], dtype='int64')] are in the [columns]"
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 961, in __getitem__
## return self._getitem_tuple(key)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1149, in _getitem_tuple
## return self._getitem_tuple_same_dim(tup)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 827, in _getitem_tuple_same_dim
## retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1191, in _getitem_axis
## return self._getitem_iterable(key, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1132, in _getitem_iterable
## keyarr, indexer = self._get_listlike_indexer(key, axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1327, in _get_listlike_indexer
## keyarr, indexer = ax._get_indexer_strict(key, axis_name)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
## self._raise_if_missing(keyarr, indexer, axis_name)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
## raise KeyError(f"None of [{key}] are in the [{axis_name}]")
This uses a function of pandas we have not previously explored: slicing. Slicing in pandas acts very similar to R’s seq method for integers, in that you can set your start and end points and use : between them. However, in python, the last index is the (non-inclusive) endpoint, so 0:3 will give you 0, 1, 2. If you want all columns after a certain index, you can use x:
, where x is the starting index.
Indexing in R
1:5, ]
airmen[## # A tibble: 5 × 16
## name last_name first_name graduation_date rank_at_graduat… class
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 Adams, John H… Adams John H., … 1945-04-15 00:00:00 2nd Lt SE-4…
## 2 Adams, Paul Adams Paul 1943-04-29 00:00:00 2nd Lt SE-4…
## 3 Adkins, Ruthe… Adkins Rutherfor… 1944-10-16 00:00:00 2nd Lt SE-4…
## 4 Adkins, Winst… Adkins Winston A. 1944-02-08 00:00:00 2nd Lt TE-4…
## 5 Alexander, Ha… Alexander Halbert L. 1944-11-20 00:00:00 2nd Lt SE-4…
## # … with 10 more variables: graduated_from <chr>, pilot_type <chr>,
## # military_hometown_of_record <chr>, state <chr>,
## # aerial_victory_credits <chr>, number_of_aerial_victory_credits <dbl>,
## # reported_lost <chr>, reported_lost_date <dttm>,
## # reported_lost_location <chr>, web_profile <chr>
1:5, c(1, 4, 6)]
airmen[## # A tibble: 5 × 3
## name graduation_date class
## <chr> <dttm> <chr>
## 1 Adams, John H., Jr. 1945-04-15 00:00:00 SE-45-B
## 2 Adams, Paul 1943-04-29 00:00:00 SE-43-D
## 3 Adkins, Rutherford H. 1944-10-16 00:00:00 SE-44-I-1
## 4 Adkins, Winston A. 1944-02-08 00:00:00 TE-44-B
## 5 Alexander, Halbert L. 1944-11-20 00:00:00 SE-44-I
1:5, c("name", "first_name")]
airmen[## # A tibble: 5 × 2
## name first_name
## <chr> <chr>
## 1 Adams, John H., Jr. John H., Jr.
## 2 Adams, Paul Paul
## 3 Adkins, Rutherford H. Rutherford H.
## 4 Adkins, Winston A. Winston A.
## 5 Alexander, Halbert L. Halbert L.
In R, we can also easily pull out a single column using the $
method. Note that this gives us a vector (that is, we’ve lost the connection to the row index).
$name[1:5]
airmen## [1] "Adams, John H., Jr." "Adams, Paul" "Adkins, Rutherford H."
## [4] "Adkins, Winston A." "Alexander, Halbert L."
head(airmen["name"]) # head() just gives us the first few rows
## # A tibble: 6 × 1
## name
## <chr>
## 1 Adams, John H., Jr.
## 2 Adams, Paul
## 3 Adkins, Rutherford H.
## 4 Adkins, Winston A.
## 5 Alexander, Halbert L.
## 6 Alexander, Harvey R.
In python, we can also easily pull out a single column:
airmen.name## 0 Adams, John H., Jr.
## 1 Adams, Paul
## 2 Adkins, Rutherford H.
## 3 Adkins, Winston A.
## 4 Alexander, Halbert L.
## ...
## 1001 Young, Albert L.
## 1002 Young, Benjamin, Jr.
## 1003 Young, Eddie Lee
## 1004 Young, Lee W.
## 1005 Young, William W.
## Name: name, Length: 1006, dtype: object
'name']
airmen[## 0 Adams, John H., Jr.
## 1 Adams, Paul
## 2 Adkins, Rutherford H.
## 3 Adkins, Winston A.
## 4 Alexander, Halbert L.
## ...
## 1001 Young, Albert L.
## 1002 Young, Benjamin, Jr.
## 1003 Young, Eddie Lee
## 1004 Young, Lee W.
## 1005 Young, William W.
## Name: name, Length: 1006, dtype: object
The df.column
notation, called attribute access only works in some circumstances: where the column name is not the same as a method (e.g. min
) and is a valid Python identifier (so df.1
does not work). When attribute access does not work, you can still access the column by name using df['colname']
(standard indexing).
Indexing in python with pandas (Pandas documentation) This includes good information on which indexing operations are most efficient and recommended for production code.
Slicing dataframes in R - The Pirate’s Guide to R
Indexing, Slicing, and Subsetting DataFrames in Python - Visualization in Python for Ecologists
6.3.3 Row and Column Names
In both R and python, data frames have two sets of names: a set of row names, and a set of column names. In my experience, it is much more common to use column names in R and less common to actually use row names2; in Python it seems that people tend to use both sets of names frequently.
Row and Column Names in Python
Let’s start with column names. Column names can be defined by creating a Series
object (remember, that’s just a fancy name for an indexed vector) and assigning it to the df.columns
object, where df
is the name of the data frame.
# Get index of column names
airmen.columns
# We can set the names using simple assignment
## Index(['name', 'last_name', 'first_name', 'graduation_date',
## 'rank_at_graduation', 'class', 'graduated_from', 'pilot_type',
## 'military_hometown_of_record', 'state', 'aerial_victory_credits',
## 'number_of_aerial_victory_credits', 'reported_lost',
## 'reported_lost_date', 'reported_lost_location', 'web_profile'],
## dtype='object')
= pd.Series(['Name', 'Last', 'First', 'Graduation_Date', 'Graduation_Rank', 'Class', 'Graduated_From', 'Pilot_Type', 'Hometown', 'State', 'Aerial_Victory_Credits', 'Num_Aerial_Victory_Credits', 'Reported_Lost', 'Reported_Lost_Date', 'Reported_Lost_Location', 'Web_Profile'])
airmen.columns
# Now the names are capitalized
airmen.columns## Index(['Name', 'Last', 'First', 'Graduation_Date', 'Graduation_Rank', 'Class',
## 'Graduated_From', 'Pilot_Type', 'Hometown', 'State',
## 'Aerial_Victory_Credits', 'Num_Aerial_Victory_Credits', 'Reported_Lost',
## 'Reported_Lost_Date', 'Reported_Lost_Location', 'Web_Profile'],
## dtype='object')
# Get index of row names
# this structure has numeric row names
airmen.index # we can access individual rows using the numeric index (iloc)
## RangeIndex(start=0, stop=1006, step=1)
0]
airmen.iloc[# we can also access individual rows using the regular index (loc)
## Name Adams, John H., Jr.
## Last Adams
## First John H., Jr.
## Graduation_Date 1945-04-15T00:00:00Z
## Graduation_Rank 2nd Lt
## Class SE-45-B
## Graduated_From TAAF
## Pilot_Type Single engine
## Hometown Kansas City
## State KS
## Aerial_Victory_Credits NaN
## Num_Aerial_Victory_Credits 0.0
## Reported_Lost NaN
## Reported_Lost_Date NaN
## Reported_Lost_Location NaN
## Web_Profile https://cafriseabove.org/john-h-adams-jr/
## Name: 0, dtype: object
0]
airmen.loc[# this doesn't work because the row names are integers
## Name Adams, John H., Jr.
## Last Adams
## First John H., Jr.
## Graduation_Date 1945-04-15T00:00:00Z
## Graduation_Rank 2nd Lt
## Class SE-45-B
## Graduated_From TAAF
## Pilot_Type Single engine
## Hometown Kansas City
## State KS
## Aerial_Victory_Credits NaN
## Num_Aerial_Victory_Credits 0.0
## Reported_Lost NaN
## Reported_Lost_Date NaN
## Reported_Lost_Location NaN
## Web_Profile https://cafriseabove.org/john-h-adams-jr/
## Name: 0, dtype: object
'Adams, John H., Jr.']
airmen.loc[
# We can set row names using simple assignment
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'Adams, John H., Jr.'
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 967, in __getitem__
## return self._getitem_axis(maybe_callable, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1202, in _getitem_axis
## return self._get_label(key, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1153, in _get_label
## return self.obj.xs(label, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/generic.py", line 3876, in xs
## loc = index.get_loc(key)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 389, in get_loc
## raise KeyError(key)
= airmen.name
airmen.index
# Row names are changed
## Error in py_call_impl(callable, dots$args, dots$keywords): AttributeError: 'DataFrame' object has no attribute 'name'
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/generic.py", line 5583, in __getattr__
## return object.__getattribute__(self, name)
# now the row name index is a string and we can look names up this way
airmen.index
# we can still access individual rows using the numeric index (iloc)
## RangeIndex(start=0, stop=1006, step=1)
0]
airmen.iloc[# we can mo longer access individual rows using the regular index (loc)
# with a numeric value
## Name Adams, John H., Jr.
## Last Adams
## First John H., Jr.
## Graduation_Date 1945-04-15T00:00:00Z
## Graduation_Rank 2nd Lt
## Class SE-45-B
## Graduated_From TAAF
## Pilot_Type Single engine
## Hometown Kansas City
## State KS
## Aerial_Victory_Credits NaN
## Num_Aerial_Victory_Credits 0.0
## Reported_Lost NaN
## Reported_Lost_Date NaN
## Reported_Lost_Location NaN
## Web_Profile https://cafriseabove.org/john-h-adams-jr/
## Name: 0, dtype: object
0]
airmen.loc[# but because we set the row names to be the individuals actual names,
# we can use those in the .loc statement
## Name Adams, John H., Jr.
## Last Adams
## First John H., Jr.
## Graduation_Date 1945-04-15T00:00:00Z
## Graduation_Rank 2nd Lt
## Class SE-45-B
## Graduated_From TAAF
## Pilot_Type Single engine
## Hometown Kansas City
## State KS
## Aerial_Victory_Credits NaN
## Num_Aerial_Victory_Credits 0.0
## Reported_Lost NaN
## Reported_Lost_Date NaN
## Reported_Lost_Location NaN
## Web_Profile https://cafriseabove.org/john-h-adams-jr/
## Name: 0, dtype: object
'Adams, John H., Jr.']
airmen.loc[## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'Adams, John H., Jr.'
##
## Detailed traceback:
## File "<string>", line 1, in <module>
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 967, in __getitem__
## return self._getitem_axis(maybe_callable, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1202, in _getitem_axis
## return self._get_label(key, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexing.py", line 1153, in _get_label
## return self.obj.xs(label, axis=axis)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/generic.py", line 3876, in xs
## loc = index.get_loc(key)
## File "/__w/Stat151/Stat151/renv/python/virtualenvs/renv-python-3.8/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 389, in get_loc
## raise KeyError(key)
When we select certain rows using the row names, we typically refer to the row names as the key.
Read more about database keys here if you are interested. This is material that we will cover in Stat 351, but it may be useful for you now if you are interested in learning to program more efficiently from the start.
Row and Column Names in R
In R, column and row names are just normal vectors - no special data types here!
names(airmen)
## [1] "name" "last_name"
## [3] "first_name" "graduation_date"
## [5] "rank_at_graduation" "class"
## [7] "graduated_from" "pilot_type"
## [9] "military_hometown_of_record" "state"
## [11] "aerial_victory_credits" "number_of_aerial_victory_credits"
## [13] "reported_lost" "reported_lost_date"
## [15] "reported_lost_location" "web_profile"
# Set new column names
names(airmen) <- c('Name', 'Last', 'First', 'Graduation_Date', 'Graduation_Rank', 'Class', 'Graduated_From', 'Pilot_Type', 'Hometown', 'State', 'Aerial_Victory_Credits', 'Num_Aerial_Victory_Credits', 'Reported_Lost', 'Reported_Lost_Date', 'Reported_Lost_Location', 'Web_Profile')
# Using new names
$Name[1:5]
airmen## [1] "Adams, John H., Jr." "Adams, Paul" "Adkins, Rutherford H."
## [4] "Adkins, Winston A." "Alexander, Halbert L."
If we want to set row names in R, we can try the obvious approach: ::: {.cell}
rownames(airmen) <- airmen$Name
## Warning: Setting row names on a tibble is deprecated.
## Warning: non-unique values when setting 'row.names': 'Brothers, James E.',
## 'Walker, William H.', 'Watkins, Edward W.'
## Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
:::
But this runs into trouble, since we have some duplicate names. We can see which names are duplicates by using the table
command combined with sort
and head
to truncate the output. I’m going to use the pipe
command, |>
, to separate these steps. This is equivalent to head(sort(table(airmen$Name), decreasing = T))
but is much easier to read since it can be read as a “recipe” of steps.
table(airmen$Name) |>
sort(decreasing = T) |>
head()
##
## Brothers, James E. Walker, William H. Watkins, Edward W.
## 2 2 2
## Adams, John H., Jr. Adams, Paul Adkins, Rutherford H.
## 1 1 1
R requires rownames to be unique, so we are better off with some other identifier that we don’t have here (SSN, military ID number, etc.) that is guaranteed to be unique. Since we don’t have that kind of information, we really don’t get any advantage by setting the rownames in R.
6.3.4 Creating New Columns
In both Python and R, it is very easy to add new (derived) columns to a data frame, using methods similar to how you access data from a pre-existing column:
$initials <- paste0(substr(airmen$First, 1, 1), substr(airmen$Last, 1, 1))
airmenhead(airmen$initials)
## [1] "JA" "PA" "RA" "WA" "HA" "HA"
'initials'] = airmen['First'].str.slice(0,1) + airmen['Last'].str.slice(0,1)
airmen['initials'][0:7]
airmen[## 0 JA
## 1 PA
## 2 RA
## 3 WA
## 4 HA
## 5 HA
## 6 RA
## Name: initials, dtype: object
Another way to create new variables in R involves the use of the dplyr
package. There are at least 2 advantages to using this package for these types of tasks:
- There is a consistent way to call each function and engage with the data (this API - application programming interface - is designed around using the pipe,
%>%
discussed above) - You don’t have to reference the data frame name and the column name using
df$colname
; instead, you use the function on the data frame and work with “bare” column names inside that function.
library(dplyr)
<- airmen %>%
airmen mutate(initials = paste0(substr(First, 1, 1), substr(Last, 1, 1)))
select(airmen, 1:3, initials) # Select lets us choose to only show some columns
## # A tibble: 1,006 × 4
## Name Last First initials
## <chr> <chr> <chr> <chr>
## 1 Adams, John H., Jr. Adams John H., Jr. JA
## 2 Adams, Paul Adams Paul PA
## 3 Adkins, Rutherford H. Adkins Rutherford H. RA
## 4 Adkins, Winston A. Adkins Winston A. WA
## 5 Alexander, Halbert L. Alexander Halbert L. HA
## 6 Alexander, Harvey R. Alexander Harvey R. HA
## 7 Alexander, Robert R. Alexander Robert R. RA
## 8 Alexander, Walter G., III Alexander Walter G., III WA
## 9 Allen, Carl V. Allen Carl V. CA
## 10 Allen, Clarence W. Allen Clarence W. CA
## # … with 996 more rows
Notice that by running mutate on the data frame, we automatically get a tibble back.
You can do a similar trick with the .assign
function in pandas, but unlike in R, you still have to reference the dataframe object within the .assign function.
airmen.assign(= airmen['First'].str.slice(0,1) + airmen['Last'].str.slice(0,1)
initials
)## Name ... initials
## 0 Adams, John H., Jr. ... JA
## 1 Adams, Paul ... PA
## 2 Adkins, Rutherford H. ... RA
## 3 Adkins, Winston A. ... WA
## 4 Alexander, Halbert L. ... HA
## ... ... ... ...
## 1001 Young, Albert L. ... AY
## 1002 Young, Benjamin, Jr. ... BY
## 1003 Young, Eddie Lee ... EY
## 1004 Young, Lee W. ... LY
## 1005 Young, William W. ... WY
##
## [1006 rows x 17 columns]
6.3.5 Subsets of Rows
Another major operation commonly performed on data frames is to choose only a certain set of rows, either randomly or using a logical condition.
In base R, we would use the subset() function:
subset(airmen, Last == "Young")
## # A tibble: 5 × 17
## Name Last First Graduation_Date Graduation_Rank Class Graduated_From
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 Young, A… Young Albe… 1944-03-12 00:00:00 2nd Lt SE-4… TAAF
## 2 Young, B… Young Benj… 1945-05-23 00:00:00 2nd Lt SE-4… TAAF
## 3 Young, E… Young Eddi… 1946-05-14 00:00:00 Flight Officer SE-4… TAAF
## 4 Young, L… Young Lee … 1945-06-27 00:00:00 Flight Officer SE-4… TAAF
## 5 Young, W… Young Will… 1945-09-08 00:00:00 2nd Lt SE-4… TAAF
## # … with 10 more variables: Pilot_Type <chr>, Hometown <chr>, State <chr>,
## # Aerial_Victory_Credits <chr>, Num_Aerial_Victory_Credits <dbl>,
## # Reported_Lost <chr>, Reported_Lost_Date <dttm>,
## # Reported_Lost_Location <chr>, Web_Profile <chr>, initials <chr>
In python, we use the .query()
function:
'Last == "Young"')
airmen.query(## Name ... initials
## 1001 Young, Albert L. ... AY
## 1002 Young, Benjamin, Jr. ... BY
## 1003 Young, Eddie Lee ... EY
## 1004 Young, Lee W. ... LY
## 1005 Young, William W. ... WY
##
## [5 rows x 17 columns]
In dplyr
, we use the filter
function:
%>%
airmen filter(Last == "Young")
## # A tibble: 5 × 17
## Name Last First Graduation_Date Graduation_Rank Class Graduated_From
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 Young, A… Young Albe… 1944-03-12 00:00:00 2nd Lt SE-4… TAAF
## 2 Young, B… Young Benj… 1945-05-23 00:00:00 2nd Lt SE-4… TAAF
## 3 Young, E… Young Eddi… 1946-05-14 00:00:00 Flight Officer SE-4… TAAF
## 4 Young, L… Young Lee … 1945-06-27 00:00:00 Flight Officer SE-4… TAAF
## 5 Young, W… Young Will… 1945-09-08 00:00:00 2nd Lt SE-4… TAAF
## # … with 10 more variables: Pilot_Type <chr>, Hometown <chr>, State <chr>,
## # Aerial_Victory_Credits <chr>, Num_Aerial_Victory_Credits <dbl>,
## # Reported_Lost <chr>, Reported_Lost_Date <dttm>,
## # Reported_Lost_Location <chr>, Web_Profile <chr>, initials <chr>
We can also sample from a data frame. This can be useful when working with a large dataset, or when simulating to determine how a method performs on some data.
For simplicity, I’m going to primarily show you the dplyr and pandas equivalents - it is possible to do this in base R, but it’s much easier to do in dplyr
.
Sampling in R
%>% sample_frac(.01)
airmen ## # A tibble: 10 × 17
## Name Last First Graduation_Date Graduation_Rank Class Graduated_From
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 Cheek, … Cheek Quen… 1945-09-08 00:00:00 2nd Lt TE-4… TAAF
## 2 Bratche… Brat… Ever… 1943-08-30 00:00:00 2nd Lt SE-4… TAAF
## 3 Hill, C… Hill Char… 1944-02-08 00:00:00 2nd Lt TE-4… TAAF
## 4 Goodwin… Good… Luth… 1944-09-08 00:00:00 1st Lt TE-4… TAAF
## 5 Cross, … Cross Will… 1943-10-01 00:00:00 Flight Officer SE-4… TAAF
## 6 Foreman… Fore… Walt… 1943-04-29 00:00:00 2nd Lt SE-4… TAAF
## 7 Ramsey,… Rams… Pier… 1945-09-08 00:00:00 Flight Officer TE-4… TAAF
## 8 Hockada… Hock… Wend… 1944-05-23 00:00:00 2nd Lt SE-4… TAAF
## 9 DeBow, … DeBow Char… 1942-03-06 00:00:00 2nd Lt SE-4… TAAF
## 10 Rhodes,… Rhod… Geor… 1943-10-01 00:00:00 2nd Lt SE-4… TAAF
## # … with 10 more variables: Pilot_Type <chr>, Hometown <chr>, State <chr>,
## # Aerial_Victory_Credits <chr>, Num_Aerial_Victory_Credits <dbl>,
## # Reported_Lost <chr>, Reported_Lost_Date <dttm>,
## # Reported_Lost_Location <chr>, Web_Profile <chr>, initials <chr>
%>% sample_n(5)
airmen ## # A tibble: 5 × 17
## Name Last First Graduation_Date Graduation_Rank Class Graduated_From
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr>
## 1 Smith, E… Smith Edwa… 1943-07-28 00:00:00 2nd Lt SE-4… TAAF
## 2 Terry, R… Terry Robe… NA Flight Officer <NA> TAAF
## 3 Johnson,… John… Lang… 1943-05-28 00:00:00 2nd Lt SE-4… TAAF
## 4 Long, Cl… Long Clyd… 1945-04-15 00:00:00 Flight Officer SE-4… TAAF
## 5 Jones, M… Jones Majo… 1944-04-15 00:00:00 2nd Lt SE-4… TAAF
## # … with 10 more variables: Pilot_Type <chr>, Hometown <chr>, State <chr>,
## # Aerial_Victory_Credits <chr>, Num_Aerial_Victory_Credits <dbl>,
## # Reported_Lost <chr>, Reported_Lost_Date <dttm>,
## # Reported_Lost_Location <chr>, Web_Profile <chr>, initials <chr>
Sampling rows is similarly easy in python:
=.01)
airmen.sample(frac## Name ... initials
## 750 Rogers, Amos A. ... AR
## 299 Gay, Thomas L. ... TG
## 324 Gray, Elliott H. ... EG
## 59 Bohannon, Horace A. ... HB
## 154 Cisco, Arnold W. ... AC
## 20 Armstrong, William P. ... WA
## 931 Whyte, James W., Jr. ... JW
## 528 Lester, Clarence D. ... CL
## 862 Toney, Mitchel N ... MT
## 331 Greenlee, George B., Jr. ... GG
##
## [10 rows x 17 columns]
= 5)
airmen.sample(n ## Name Last ... Web_Profile initials
## 514 Lane, Earl R. Lane ... NaN EL
## 322 Govan, Claude B. Govan ... NaN CG
## 799 Smith, Frederick D. Smith ... NaN FS
## 475 Johnson, Theopolis W. Johnson ... NaN TJ
## 360 Harris, Cassius Harris ... NaN CH
##
## [5 rows x 17 columns]
6.3.6 Summaries (part 2)
Using subsets of rows and columns, we can also create summaries of data frames that are more customized to what we want. This is particularly powerful when we combine it with material we learned in the previous chapter: for loops.
There are more efficient ways to generate summaries than what I will show you here, but it’s important to see how for loops might apply to this use case before we talk about ways to do this using e.g. the dplyr
package in R. This will help you understand what those more efficient functions are doing “under the hood”.
If we want to create a data frame that has one row for each unique value of a variable in our source data, we can iterate through the data taking subsets and creating a set of summary variables that are customized to our problem.
Suppose we want to take the airmen
data and calculate the total number of aerial victory credits from each state.
We start out with the following game plan:
Create an empty summary dataset. We plan for the result to have one row for each state.
Iterate through the states in the dataset
For each state, subset the data within the loop to get only values from that state
Calculate the total number of aerial victory credits
Add a row to the summary dataset with the values from the state
Exit the loop
We want to check and make sure that the reported aerial victory credit count is what we’re expecting:
table(airmen$Num_Aerial_Victory_Credits, useNA = 'ifany')
##
## 0 1 1.5 2 3 4
## 934 43 1 19 6 3
'Num_Aerial_Victory_Credits'].value_counts()
airmen[## 0.0 934
## 1.0 43
## 2.0 19
## 3.0 6
## 4.0 3
## 1.5 1
## Name: Num_Aerial_Victory_Credits, dtype: int64
Creating a summary dataset in R with a loop
# Create an empty data frame to store the results
<- data.frame(State = NULL, total_victory_credits = NULL)
state_victories
for (i in unique(airmen$State)) {
# Get a subset of the data with only airmen from the selected state
<- filter(airmen, State == i)
state_sub
<- sum(state_sub$Num_Aerial_Victory_Credits)
victory_credits
# Append our row for this state onto the summary data frame
<- rbind(
state_victories
state_victories, data.frame(State = i, total_victory_credits = victory_credits)
)
}
%>%
state_victories arrange(desc(total_victory_credits)) # Sort in descending order
## State total_victory_credits
## 1 NY 14.0
## 2 CA 12.0
## 3 IL 10.0
## 4 MO 10.0
## 5 VA 6.5
## 6 OH 6.0
## 7 KS 5.0
## 8 PA 4.0
## 9 IA 4.0
## 10 GA 4.0
## 11 OK 4.0
## 12 NJ 3.0
## 13 AL 3.0
## 14 IN 3.0
## 15 TX 3.0
## 16 OR 3.0
## 17 WA 3.0
## 18 WV 2.0
## 19 FL 2.0
## 20 TN 2.0
## 21 WI 2.0
## 22 SC 1.0
## 23 MD 1.0
## 24 NC 1.0
## 25 MA 1.0
## 26 CT 1.0
## 27 DC 1.0
## 28 NE 1.0
## 29 In 0.0
## 30 RI 0.0
## 31 Haiti 0.0
## 32 <NA> 0.0
## 33 MI 0.0
## 34 LA 0.0
## 35 CO 0.0
## 36 MS 0.0
## 37 MN 0.0
## 38 AZ 0.0
## 39 AR 0.0
## 40 KY 0.0
## 41 HT 0.0
## 42 VT 0.0
## 43 CN 0.0
## 44 VI 0.0
## 45 DE 0.0
## 46 KN 0.0
## 47 TD 0.0
## 48 Unk 0.0
## 49 WY 0.0
Creating a summary dataset in python with a loop
# Create an empty data frame to store the results
= pd.DataFrame(columns = ['state', 'total_victory_credits'])
state_victories
for i in airmen.State.unique():
# Get a subset of the data with only airmen from the selected state
= airmen.loc[airmen.State == i]
state_sub
= state_sub['Num_Aerial_Victory_Credits'].sum()
victory_credits
= [i, victory_credits]
state_victories.loc[i]
\
state_victories. "total_victory_credits"], ascending = False). \
sort_values([1:10,]
iloc[# Putting slashes at the end of the line here allows me to separate
# the operations into steps, rather like using a pipe in R
# It only works if nothing is after the \ though, so
# I have to put the comment below the code instead of inline
## state total_victory_credits
## CA CA 12.0
## IL IL 10.0
## MO MO 10.0
## VA VA 6.5
## OH OH 6.0
## KS KS 5.0
## GA GA 4.0
## IA IA 4.0
## OK OK 4.0
Writing this code from scratch, not knowing Python that well, required the following google queries and pages:
python summary table: https://pbpython.com/sidetable.html (I didn’t use sidetable, but I did use the pandas code here that shows how to use
.value_counts()
)create an empty pandas dataframe: Query result. I used Method 3.
unique values in python: Query result (not helpful). This worked for numpy but not for pandas series/dataframe columns.
pandas column unique values: Query result (helpful)
pandas sort decreasing: Query result
Because I outlined my steps above, though, I could figure out pretty easily how to google for exactly what I needed at each step. Writing out the steps ahead of time, and knowing how to work using data frames and row-by-row operations helps with googling solutions that work!
6.3.7 Try it (all) Out!
Let’s look at a dataset of dog breed rankings (1 = low, 5 = high) for different traits. We can read in this data using the following code:
<- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv') breed_traits
= pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv') breed_traits
Can you complete the following tasks?
Pull out any dogs that have “Terrier” in the name. Hint: Testing with strings in python, Using
grep
andgrepl
in RCreate a data frame of all dogs who are 4 or 5 on “Affectionate with Family” and name it good_with_family. Make sure you keep all of the columns in the original data frame.
Create a data frame with all variables relating to coat (shedding level, coat type, coat length, grooming frequency) as well as the dog breed. Make sure you keep all rows, but only include the necessary columns.
Draw a random sample of 10 dogs from your dataset.
Create a new variable, named mess, that is the product of the dog’s shedding level and their drooling level. (It’s probably a bad idea to multiply categorical variables, but in this case we’re going to go with it.) Summarize this numeric variable.
Solutions (R)
library(dplyr) # data frame manipulations
library(stringr) # string comparison functions in tidyverse
# 1: Terriers
<- breed_traits[grepl("Terrier", breed_traits$Breed),]
terrorists <- breed_traits %>%
terrorists2 filter(str_detect(Breed, "Terrier"))
# 2: Good with Family
<- breed_traits[breed_traits$Affectionate.With.Family > 3,]
good_with_family <- breed_traits %>%
good_with_family2 filter(Affectionate.With.Family > 3)
# 3: Coat variables
<- breed_traits[,c(1, 4:5, 7:8)] # count column by column
coat <- breed_traits %>%
coat2 select(Breed, matches(c("Coat", "Grooming", "Shedding"))) # tidy way
<- breed_traits[,grepl("Coat|Grooming|Shedding|Breed", names(breed_traits))] # match string names
coat3
# 4: random sample of 10 dogs
<- breed_traits[sample(1:nrow(breed_traits), 10),] # base R way
sample_dogs <- breed_traits %>% sample_n(10) # tidy way
sample_dogs2
# 5: mess
$mess <- breed_traits$Drooling.Level * breed_traits$Shedding.Level
breed_traits<- breed_traits %>%
breed_traits mutate(mess2 = Drooling.Level*Shedding.Level)
summary(breed_traits$mess)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 4.841 6.000 20.000
%>% select(mess2) %>% summary()
breed_traits ## mess2
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 3.000
## Mean : 4.841
## 3rd Qu.: 6.000
## Max. :20.000
Solutions (python)
# 1: Terriers
= breed_traits.loc[breed_traits.Breed.str.match(r".*Terrier.*"),]
terrorists
# 2: Good with Family
= breed_traits.query("`Affectionate With Family` > 3")
good_with_family 'Affectionate With Family'].min() # just checking
good_with_family[
# 3: Coat variables
## 4
= breed_traits.loc[:,["Breed", "Shedding Level", "Coat Grooming Frequency", "Coat Type", "Coat Length"]]
coat = breed_traits.iloc[:,breed_traits.columns.str.match(r"Breed|Coat|Shedding")]
coat2
# 4: random sample of 10 dogs
= breed_traits.sample(n = 10)
sample_dogs
# 5: mess
'mess'] = breed_traits['Drooling Level'] * breed_traits['Shedding Level']
breed_traits[
breed_traits.mess.describe## <bound method NDFrame.describe of 0 8
## 1 9
## 2 8
## 3 8
## 4 9
## ..
## 190 2
## 191 3
## 192 2
## 193 6
## 194 3
## Name: mess, Length: 195, dtype: int64>
min()
breed_traits.mess.## 0
max()
breed_traits.mess.## 20
6.4 Basic Plotting Examples
Now that you can read data in to R and python and define new variables, you can create plots! We’ll focus a bit more on this later in the class, but for now, I’d like to take a few minutes to explain how to make (basic) plots in R (with ggplot2
) and in python (with plotnine
, which is a ggplot2 clone).
Let’s work with Historically Black College and University enrollment in this example:
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv')
hbcu_all
library(ggplot2)
import pandas as pd
from plotnine import *
= pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-02/hbcu_all.csv') hbcu_all
ggplot2 and plotnine work with data frames. If you pass a data frame in as the data argument, you can refer to columns in that data with “bare” column names (you don’t have to reference the full data object using df$name
or df.name
; you can instead use name
or "name"
):
ggplot(hbcu_all, aes(x = Year, y = `4-year`)) + geom_line() +
ggtitle("4-year HBCU College Enrollment")
= "Year", y = "4-year")) + geom_line() + \
ggplot(hbcu_all, aes(x "4-year HBCU College Enrollment")
ggtitle(## <ggplot: (8775799194874)>
If your data is in the right format, ggplot2 is very easy to use; if your data aren’t formatted neatly, it can be a real pain. If you want to plot multiple lines, you need to either list each variable you want to plot, one by one, or (more likely) you want to get your data into “long form”. You don’t need to know exactly how this works, but it is helpful to see the difference in the two datasets:
library(tidyr)
<- pivot_longer(hbcu_all, -Year, names_to = "type", values_to = "value")
hbcu_long
head(hbcu_all)
## # A tibble: 6 × 12
## Year `Total enrollment` Males Females `4-year` `2-year` `Total - Public`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1976 222613 104669 117944 206676 15937 156836
## 2 1980 233557 106387 127170 218009 15548 168217
## 3 1982 228371 104897 123474 212017 16354 165871
## 4 1984 227519 102823 124696 212844 14675 164116
## 5 1986 223275 97523 125752 207231 16044 162048
## 6 1988 239755 100561 139194 223250 16505 173672
## # … with 5 more variables: `4-year - Public` <dbl>, `2-year - Public` <dbl>,
## # `Total - Private` <dbl>, `4-year - Private` <dbl>, `2-year - Private` <dbl>
head(hbcu_long)
## # A tibble: 6 × 3
## Year type value
## <dbl> <chr> <dbl>
## 1 1976 Total enrollment 222613
## 2 1976 Males 104669
## 3 1976 Females 117944
## 4 1976 4-year 206676
## 5 1976 2-year 15937
## 6 1976 Total - Public 156836
= pd.melt(hbcu_all, id_vars = ['Year'], value_vars = hbcu_all.columns[1:11]) hbcu_long
In the long form of the data, we have a row for each data point (year x measurement type), not for each year.
ggplot(hbcu_long, aes(x = Year, y = value, color = type)) + geom_line() +
ggtitle("HBCU College Enrollment")
= "Year", y = "value", color = "variable")) + geom_line() + \
ggplot(hbcu_long, aes(x "HBCU College Enrollment") + \
ggtitle(={'right':0.75}) # This moves the key so it takes up 25% of the area
theme(subplots_adjust## <ggplot: (8775799133931)>
Footnotes
Tidy Tuesday is a collaborative project where the R community gets together and explores a dataset, cleaning it, visualizing it, and generally working to collectively hone R skills together. You can find some very nice YouTube livestreams, as well as lots of examples using the #tidytuesday twitter tag.↩︎
More advanced data-frame like structures, such as
tibbles
, actively discourage the use of row names.↩︎