## url <- "https://www.mesonet.org/index.php/dataMdfMts/dataController/getFile/202206070000/mdf/TEXT/"
data <- read.fwf(url,
skip = 3, # Skip the first 2 lines (useless) + header line
widths = c(5, 6, 6, 7, 7, 7, 7, 6, 7, 7, 7, 8, 9, 6, 7, 7, 7, 7, 7, 7,
7, 8, 8, 8)) # There is a row with the column names specified
data[1:6,] # first 6 rows
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
## 1 ACME 110 0 60 29.9 4.4 4.3 111 9.0 0.8 6.4 0.00 959.37 267 29.6 3.6 25.4
## 2 ADAX 1 0 69 29.3 1.7 1.6 98 24.9 0.6 3.4 0.00 971.26 251 29.0 0.6 24.6
## 3 ALTU 2 0 52 31.7 5.5 5.4 89 7.6 1.0 7.8 0.00 956.12 287 31.3 3.5 26.5
## 4 ALV2 116 0 57 30.1 2.5 2.4 108 10.3 0.5 3.6 55.63 954.01 266 30.1 1.7 23.3
## 5 ANT2 135 0 75 29.1 1.1 1.1 44 21.1 0.3 2.0 0.00 985.35 121 28.9 0.5 25.9
## 6 APAC 111 0 58 29.9 5.1 5.1 107 8.5 0.7 6.6 0.00 954.47 224 29.7 3.6 26.2
## V18 V19 V20 V21 V22 V23 V24
## 1 29.4 27.4 22.5 20.6 1.55 1.48 1.40
## 2 28.7 25.6 24.3 -998.0 1.46 1.52 -998.00
## 3 32.1 27.6 24.0 -998.0 1.72 1.50 -998.00
## 4 30.3 26.2 21.1 -998.0 1.49 1.40 -998.00
## 5 29.0 26.3 22.8 21.4 1.51 1.39 1.41
## 6 29.1 26.6 24.3 20.5 1.59 1.47 1.40
17 Data Input
17.1 Objectives
- Read in data from common formats into R or Python
- Identify delimiters, headers, and other essential components of files
These may be worth printing off as you work through this module.
17.2 Overview: Data Formats
In order to use statistical software to do anything interesting, we need to be able to get data into the program so that we can work with it effectively. For the moment, we’ll focus on tabular data - data that is stored in a rectangular shape, with rows indicating observations and columns that show variables. This type of data can be stored on the computer in multiple ways:
as raw text, usually in a file that ends with .txt, .tsv, .csv, .dat, or sometimes, there will be no file extension at all. These types of files are human-readable. If part of a text file gets corrupted, the rest of the file may be recoverable.
-
as a binary file. Binary files are compressed files that are readable by computers but not by humans. They generally take less space to store on disk (but the same amount of space when read into computer memory). If part of a binary file is corrupted, the entire file is usually affected.
- R, SAS, Stata, SPSS, and Minitab all have their own formats for storing binary data. Packages such as
foreign
in R will let you read data from other programs, and packages such ashaven
in R will let you write data into binary formats used by other programs. - [1] describes why binary file formats exist, and why they’re not necessarily optimal.
- R, SAS, Stata, SPSS, and Minitab all have their own formats for storing binary data. Packages such as
in a spreadsheet. Spreadsheets, such as those created by MS Excel, Google Sheets, or LibreOffice Calc, are not binary formats, but they’re also not raw text files either. They’re a hybrid - a special type of markup that is specific to the filetype and the program it’s designed to work with. Practically, they may function like a poorly laid-out database, a text file, or a total nightmare, depending on who designed the spreadsheet.
- in a database. Databases are typically composed of a set of one or more tables, with information that may be related across tables. Data stored in a database may be easier to access, and may not require that the entire data set be stored in computer memory at the same time, but you may have to join several tables together to get the full set of data you want to work with.
There are, of course, many other non-tabular data formats – some open and easy to work with, some inpenetrable. A few which you may come across:
Web data structures: XML (eXtensible markup language), JSON (JavaScript Object Notation), YAML. These structures have their own formats and field delimiters, but more importantly, are not necessarily easily converted to tabular structures. They are, however, useful for handling nested objects, such as trees. When read into R or SAS, these file formats are usually treated as lists, and may be restructured afterwards into a format useful for statistical analysis. See Chapter 27 for some tools to work with these files.
Spatial files: Shapefiles are the most common version of spatial files, though there are a seemingly infinite number of different formats, and new formats pop up at the most inconvenient times. Spatial files often include structured encodings of geographic information plus corresponding tabular format data that goes with the geographic information. Chapter 28 covers some of the tools available for working with spatial data.
To be minimally functional in R and Python, it’s important to know how to read in text files (CSV, tab-delimited, etc.). It can be helpful to also know how to read in XLSX files. We will briefly discuss binary files and databases, but it is less critical to remember how to read these in without consulting an online reference.
17.3 Text Files
There are several different variants of text data which are relatively common, but for the most part, text data files can be broken down into fixed-width and delimited formats. What’s the difference, you say?
17.3.1 Fixed-width files
Col1 Col2 Col3
3.4 4.2 5.4
27.3 -2.4 15.9
In a fixed-width text file, the position of the data indicates which field (variable/column) it belongs to. These files are fairly common outputs from older FORTRAN-based programs, but may be found elsewhere as well - if you have a very large amount of data, a fixed-width format may be more efficient to read, because you can select only the portions of the file which matter for a particular analysis (and so you don’t have to read the whole thing into memory).
17.3.1.1 Fixed Width File IO
In base R (no extra packages), you can read fixed-width files in using read.fwf
, but you must specify the column breaks yourself, which can be painful.
You can count all of those spaces by hand (not shown), you can use a different function, or you can write code to do it for you.
# I like to cheat a bit....
# Read the first few lines in
tmp <- readLines(url, n = 20)[-c(1:2)]
# split each line into a series of single characters
tmp_chars <- strsplit(tmp, '')
# Bind the lines together into a character matrix
# do.call applies a function to an entire list - so instead of doing 18 rbinds,
# one command will put all 18 rows together
tmp_chars <- do.call("rbind", tmp_chars) # (it's ok if you don't get this line)
# Make into a logical matrix where T = space, F = not space
tmp_chars_space <- tmp_chars == " "
# Add up the number of rows where there is a non-space character
# space columns would have 0s/FALSE
tmp_space <- colSums(!tmp_chars_space)
# We need a nonzero column followed by a zero column
breaks <- which(tmp_space != 0 & c(tmp_space[-1], 0) == 0)
# Then, we need to get the widths between the columns
widths <- diff(c(0, breaks))
# Now we're ready to go
mesodata <- read.fwf(url, skip = 3, widths = widths, header = F)
# read header separately - if you use header = T, it errors for some reason.
# It's easier just to work around the error than to fix it :)
mesodata_names <- read.fwf(url, skip = 2, n = 1, widths = widths, header = F,
stringsAsFactors = F)
names(mesodata) <- as.character(mesodata_names)
mesodata[1:6,] # first 6 rows
## STID STNM TIME RELH TAIR WSPD WVEC WDIR WDSD WSSD
## 1 ACME 110 0 60 29.9 4.4 4.3 111 9.0 0.8
## 2 ADAX 1 0 69 29.3 1.7 1.6 98 24.9 0.6
## 3 ALTU 2 0 52 31.7 5.5 5.4 89 7.6 1.0
## 4 ALV2 116 0 57 30.1 2.5 2.4 108 10.3 0.5
## 5 ANT2 135 0 75 29.1 1.1 1.1 44 21.1 0.3
## 6 APAC 111 0 58 29.9 5.1 5.1 107 8.5 0.7
## WMAX RAIN PRES SRAD TA9M WS2M TS10 TB10 TS05
## 1 6.4 0.00 959.37 267 29.6 3.6 25.4 29.4 27.4
## 2 3.4 0.00 971.26 251 29.0 0.6 24.6 28.7 25.6
## 3 7.8 0.00 956.12 287 31.3 3.5 26.5 32.1 27.6
## 4 3.6 55.63 954.01 266 30.1 1.7 23.3 30.3 26.2
## 5 2.0 0.00 985.35 121 28.9 0.5 25.9 29.0 26.3
## 6 6.6 0.00 954.47 224 29.7 3.6 26.2 29.1 26.6
## TS25 TS60 TR05 TR25 TR60
## 1 22.5 20.6 1.55 1.48 1.40
## 2 24.3 -998.0 1.46 1.52 -998.00
## 3 24.0 -998.0 1.72 1.50 -998.00
## 4 21.1 -998.0 1.49 1.40 -998.00
## 5 22.8 21.4 1.51 1.39 1.41
## 6 24.3 20.5 1.59 1.47 1.40
You can also write fixed-width files if you really want to:
if (!"gdata" %in% installed.packages()) install.packages("gdata")
library(gdata)
write.fwf(mtcars, file = tempfile())
The readr
package creates data-frame like objects called tibbles (a souped-up data frame), but it is much friendlier to use.
library(readr) # Better data importing in R
read_table(url, skip = 2) # Gosh, that was much easier!
## # A tibble: 120 × 24
## STID STNM TIME RELH TAIR WSPD WVEC WDIR WDSD WSSD WMAX RAIN
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ACME 110 0 60 29.9 4.4 4.3 111 9 0.8 6.4 0
## 2 ADAX 1 0 69 29.3 1.7 1.6 98 24.9 0.6 3.4 0
## 3 ALTU 2 0 52 31.7 5.5 5.4 89 7.6 1 7.8 0
## 4 ALV2 116 0 57 30.1 2.5 2.4 108 10.3 0.5 3.6 55.6
## 5 ANT2 135 0 75 29.1 1.1 1.1 44 21.1 0.3 2 0
## 6 APAC 111 0 58 29.9 5.1 5.1 107 8.5 0.7 6.6 0
## 7 ARD2 126 0 61 31.2 3.3 3.2 109 9.1 0.6 4.3 0
## 8 ARNE 6 0 49 30.4 4.5 4.4 111 11.1 0.9 6.4 0
## 9 BEAV 8 0 42 30.5 6.1 6 127 8.7 0.9 7.9 0
## 10 BESS 9 0 53 -999 5.3 5.2 115 8.6 0.6 7 0
## # ℹ 110 more rows
## # ℹ 12 more variables: PRES <dbl>, SRAD <dbl>, TA9M <dbl>, WS2M <dbl>,
## # TS10 <dbl>, TB10 <dbl>, TS05 <dbl>, TS25 <dbl>, TS60 <dbl>, TR05 <dbl>,
## # TR25 <dbl>, TR60 <dbl>
By default, pandas’ read_fwf
will guess at the format of your fixed-width file.
import pandas as pd
= "https://www.mesonet.org/index.php/dataMdfMts/dataController/getFile/202006070000/mdf/TEXT/"
url = pd.read_fwf(url, skiprows = 2) # Skip the first 2 lines (useless)
data
data.head()## STID STNM TIME RELH TAIR WSPD ... TS05 TS25 TS60 TR05 TR25 TR60
## 0 ACME 110.0 0.0 53.0 31.8 5.2 ... 31.6 25.2 21.7 3.09 2.22 1.48
## 1 ADAX 1.0 0.0 55.0 32.4 1.0 ... 29.6 26.8 -998.0 2.61 1.88 -998.00
## 2 ALTU 2.0 0.0 31.0 35.6 8.9 ... 30.7 26.1 -998.0 3.39 2.47 -998.00
## 3 ALV2 116.0 0.0 27.0 35.8 6.7 ... 25.6 22.6 -998.0 2.70 1.60 -998.00
## 4 ANT2 135.0 0.0 73.0 27.8 0.0 ... 30.2 26.8 23.8 1.96 1.73 1.33
##
## [5 rows x 24 columns]
17.3.2 Delimited Text Files
Delimited text files are files where fields are separated by a specific character, such as space, comma, semicolon, tabs, etc. Often, delimited text files will have the column names as the first row in the file.
17.3.2.1 Comma Delimited Files
url <- "https://raw.githubusercontent.com/srvanderplas/datasets/main/clean/pokemon_gen_1-9.csv"
pokemon_info <- read.csv(url, header = T, stringsAsFactors = F)
pokemon_info[1:6, 1:6] # Show only the first 6 lines & cols
gen pokedex_no
1 1 1
2 1 2
3 1 3
4 1 3
5 1 4
6 1 5
img_link
1 https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png
2 https://img.pokemondb.net/sprites/sword-shield/icon/ivysaur.png
3 https://img.pokemondb.net/sprites/sword-shield/icon/venusaur.png
4 https://img.pokemondb.net/sprites/sword-shield/icon/venusaur-mega.png
5 https://img.pokemondb.net/sprites/sword-shield/icon/charmander.png
6 https://img.pokemondb.net/sprites/sword-shield/icon/charmeleon.png
name variant type
1 Bulbasaur <NA> Grass,Poison
2 Ivysaur <NA> Grass,Poison
3 Venusaur <NA> Grass,Poison
4 Venusaur Mega Grass,Poison
5 Charmander <NA> Fire
6 Charmeleon <NA> Fire
There is a family of read_xxx
functions in readr
to read files delimited with commas (read_csv
), tabs (read_tsv
), or generic delimited files (read_delim
).
The most common delimited text format is CSV: comma-separated value.
library(readr)
url <- "https://raw.githubusercontent.com/srvanderplas/datasets/main/clean/pokemon_gen_1-9.csv"
pokemon_info <- read_csv(url)
pokemon_info[1:6, 1:6] # Show only the first 6 lines & cols
# A tibble: 6 × 6
gen pokedex_no img_link name variant type
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 1 https://img.pokemondb.net/sprites/sword-… Bulb… <NA> Gras…
2 1 2 https://img.pokemondb.net/sprites/sword-… Ivys… <NA> Gras…
3 1 3 https://img.pokemondb.net/sprites/sword-… Venu… <NA> Gras…
4 1 3 https://img.pokemondb.net/sprites/sword-… Venu… Mega Gras…
5 1 4 https://img.pokemondb.net/sprites/sword-… Char… <NA> Fire
6 1 5 https://img.pokemondb.net/sprites/sword-… Char… <NA> Fire
There is a family of read_xxx
functions in pandas
including functions to read files delimited with commas (read_csv
) as well as generic delimited files (read_table
).
import pandas as pd
<- "https://raw.githubusercontent.com/srvanderplas/datasets/main/clean/pokemon_gen_1-9.csv" url
TypeError: bad operand type for unary -: 'str'
= pd.read_csv(url)
pokemon_info 2:51] pokemon_info.iloc[:,
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[123 rows x 0 columns]
Sometimes, data is available in files that use other characters as delimiters. This can happen when commas are an important part of the data stored in the file, but can also just be a choice made by the person generating the file. Either way, we can’t let it keep us from accessing the data.
17.3.2.2 Other Character Delimited Files
# Download from web
download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", destfile = '../data/NE_Features.zip')
## Error in download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", : cannot open URL 'https://geonames.usgs.gov/docs/stategaz/NE_Features.zip'
# Unzip to `data/` folder
unzip('../data/NE_Features.zip', exdir = '../data/')
# List files matching the file type and pick the first one
fname <- list.files("../data/", 'NE_Features_20', full.names = T)[1]
# see that the file is delimited with |
readLines(fname, n = 5)
## [1] "FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|STATE_ALPHA|STATE_NUMERIC|COUNTY_NAME|COUNTY_NUMERIC|PRIMARY_LAT_DMS|PRIM_LONG_DMS|PRIM_LAT_DEC|PRIM_LONG_DEC|SOURCE_LAT_DMS|SOURCE_LONG_DMS|SOURCE_LAT_DEC|SOURCE_LONG_DEC|ELEV_IN_M|ELEV_IN_FT|MAP_NAME|DATE_CREATED|DATE_EDITED"
## [2] "171013|Peetz Table|Area|CO|08|Logan|075|405840N|1030332W|40.9777645|-103.0588116|||||1341|4400|Peetz|10/13/1978|"
## [3] "171029|Sidney Draw|Valley|NE|31|Cheyenne|033|410816N|1030116W|41.1377213|-103.021044|405215N|1040353W|40.8709614|-104.0646558|1255|4117|Brownson|10/13/1978|03/08/2018"
## [4] "182687|Highline Canal|Canal|CO|08|Sedgwick|115|405810N|1023137W|40.9694351|-102.5268556|||||1119|3671|Sedgwick|10/13/1978|"
## [5] "182688|Cottonwood Creek|Stream|CO|08|Sedgwick|115|405511N|1023355W|40.9197132|-102.5651893|405850N|1030107W|40.9805426|-103.0185329|1095|3592|Sedgwick|10/13/1978|10/23/2009"
# a file delimited with |
nebraska_locations <- read.delim(fname, sep = "|", header = T)
nebraska_locations[1:6, 1:6]
## FEATURE_ID FEATURE_NAME FEATURE_CLASS STATE_ALPHA STATE_NUMERIC
## 1 171013 Peetz Table Area CO 8
## 2 171029 Sidney Draw Valley NE 31
## 3 182687 Highline Canal Canal CO 8
## 4 182688 Cottonwood Creek Stream CO 8
## 5 182689 Sand Draw Valley CO 8
## 6 182690 Sedgwick Draw Valley CO 8
## COUNTY_NAME
## 1 Logan
## 2 Cheyenne
## 3 Sedgwick
## 4 Sedgwick
## 5 Sedgwick
## 6 Sedgwick
There is a family of read_xxx
functions in readr
to read files delimited with commas (read_csv
), tabs (read_tsv
), or generic delimited files (read_delim
).
# Download from web
download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", destfile = '../data/NE_Features.zip')
## Error in download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", : cannot open URL 'https://geonames.usgs.gov/docs/stategaz/NE_Features.zip'
# Unzip to `data/` folder
unzip('../data/NE_Features.zip', exdir = '../data/')
# List files matching the file type and pick the first one
fname <- list.files("../data/", 'NE_Features_20', full.names = T)[1]
# see that the file is delimited with |
readLines(fname, n = 5)
## [1] "FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|STATE_ALPHA|STATE_NUMERIC|COUNTY_NAME|COUNTY_NUMERIC|PRIMARY_LAT_DMS|PRIM_LONG_DMS|PRIM_LAT_DEC|PRIM_LONG_DEC|SOURCE_LAT_DMS|SOURCE_LONG_DMS|SOURCE_LAT_DEC|SOURCE_LONG_DEC|ELEV_IN_M|ELEV_IN_FT|MAP_NAME|DATE_CREATED|DATE_EDITED"
## [2] "171013|Peetz Table|Area|CO|08|Logan|075|405840N|1030332W|40.9777645|-103.0588116|||||1341|4400|Peetz|10/13/1978|"
## [3] "171029|Sidney Draw|Valley|NE|31|Cheyenne|033|410816N|1030116W|41.1377213|-103.021044|405215N|1040353W|40.8709614|-104.0646558|1255|4117|Brownson|10/13/1978|03/08/2018"
## [4] "182687|Highline Canal|Canal|CO|08|Sedgwick|115|405810N|1023137W|40.9694351|-102.5268556|||||1119|3671|Sedgwick|10/13/1978|"
## [5] "182688|Cottonwood Creek|Stream|CO|08|Sedgwick|115|405511N|1023355W|40.9197132|-102.5651893|405850N|1030107W|40.9805426|-103.0185329|1095|3592|Sedgwick|10/13/1978|10/23/2009"
nebraska_locations <- read_delim(fname, delim = "|")
nebraska_locations[1:6, 1:6]
## # A tibble: 6 × 6
## FEATURE_ID FEATURE_NAME FEATURE_CLASS STATE_ALPHA STATE_NUMERIC COUNTY_NAME
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 171013 Peetz Table Area CO 08 Logan
## 2 171029 Sidney Draw Valley NE 31 Cheyenne
## 3 182687 Highline Canal Canal CO 08 Sedgwick
## 4 182688 Cottonwood Cre… Stream CO 08 Sedgwick
## 5 182689 Sand Draw Valley CO 08 Sedgwick
## 6 182690 Sedgwick Draw Valley CO 08 Sedgwick
We can actually read in the file without unzipping it, so long as we download it first - readr
does not support reading remote zipped files, but it does support reading zipped files locally. If we know ahead of time what our delimiter is, this is the best choice as it reduces the amount of file clutter we have in our working directory.
nebraska_locations <- read_delim("../data/NE_Features.zip", delim = "|")
## Error: '../data/NE_Features.zip' does not exist in current working directory ('/home/susan/Projects/Class/stat-computing-r-python/part-wrangling').
nebraska_locations[1:6, 1:6]
## # A tibble: 6 × 6
## FEATURE_ID FEATURE_NAME FEATURE_CLASS STATE_ALPHA STATE_NUMERIC COUNTY_NAME
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 171013 Peetz Table Area CO 08 Logan
## 2 171029 Sidney Draw Valley NE 31 Cheyenne
## 3 182687 Highline Canal Canal CO 08 Sedgwick
## 4 182688 Cottonwood Cre… Stream CO 08 Sedgwick
## 5 182689 Sand Draw Valley CO 08 Sedgwick
## 6 182690 Sedgwick Draw Valley CO 08 Sedgwick
There is a family of read_xxx
functions in pandas
including functions to read files delimited with commas (read_csv
) as well as generic delimited files (read_table
).
Pandas can access zipped data files and unzip them while reading the data in, so we don’t have to download the file and unzip it first.
# a file delimited with |
= "https://geonames.usgs.gov/docs/stategaz/NE_Features.zip"
url = pd.read_table(url, delimiter = "|")
nebraska_locations ## urllib.error.HTTPError: HTTP Error 503: Service Unavailable
nebraska_locations## NameError: name 'nebraska_locations' is not defined
Rebrickable.com contains tables of almost any information imaginable concerning Lego sets, conveniently available at their download page. Because these data sets are comparatively large, they are available as compressed CSV files - that is, the .gz extension is a gzip compression applied to the CSV.
The readr
package and pandas
can handle .csv.gz files with no problems. Try reading in the data using the appropriate function from that package. Can you save the data as an uncompressed csv?
import pandas as pd
= pd.read_csv("https://cdn.rebrickable.com/media/downloads/sets.csv.gz")
legosets "../data/lego_sets_py.csv") legosets.to_csv(
17.4 Spreadsheets
17.4.1 Spreadsheet IO
This example uses from NYC SLice. The author maintains a google sheet of the slices he has photographed, which we can download as an excel sheet and import.
In R, the easiest way to read Excel data in is to use the readxl
package. There are many other packages with different features, however - I have used openxlsx
in the past to format spreadsheets to send to clients, for instance. By far and away you are more likely to have problems with the arcane format of the Excel spreadsheet than with the package used to read the data in. It is usually helpful to open the spreadsheet up in a graphical program first to make sure the formatting is as you expected it to be.
if (!"readxl" %in% installed.packages()) install.packages("readxl")
library(readxl)
url <- "https://docs.google.com/spreadsheets/d/1EY3oi9ttxybG0A0Obtwey6BFu7QLqdHe02JApijgztg/export?format=xlsx"
# Only download the data if it doesn't exist in the data folder
if (!file.exists("../data/nyc_slice.xlsx")) {
download.file(url, destfile = "../data/nyc_slice.xlsx", mode = "wb")
}
# Read in the downloaded data
pizza_data <- read_xlsx("../data/nyc_slice.xlsx", sheet = 1)
pizza_data[1:10, 1:6]
## # A tibble: 10 × 6
## `Link to IG Post` Name location_lat location_lng Date
## <chr> <chr> <chr> <chr> <chr>
## 1 https://www.instagram.com/p/CjszJ-fOP5… Ange… 40.6232544 -73.9379223… 2022…
## 2 https://www.instagram.com/p/CjdcPNAufP… Ozon… 40.6808917 -73.8426307 2022…
## 3 https://www.instagram.com/p/CjQdNsaOZl… Pino… 40.6000148 -73.9994551 2022…
## 4 https://www.instagram.com/p/Ci5XblnOnM… La R… 40.7133354 -73.8294102 2022…
## 5 https://www.instagram.com/p/CiiLAtkON_… Rony… 40.7482509 -73.9923498 2022…
## 6 https://www.instagram.com/p/CiS-44nucN… John… 40.8545616 -73.8658818… 2022…
## 7 https://www.instagram.com/p/CiSmQnjutQ… Preg… 40.8631291 -73.8585108 2022…
## 8 https://www.instagram.com/p/CiIO6oFuxp… N & … 40.6004632 -73.9430723… 2022…
## 9 https://www.instagram.com/p/ChaZUsxuFs… Pepp… 40.9036613 -73.8504667… 2022…
## 10 https://www.instagram.com/p/ChNd9wqOqG… Rocc… 40.8676344 -73.8836046 2022…
## # ℹ 1 more variable: `Date Expanded (times in EST)` <chr>
import pandas as pd
= pd.read_excel("../data/nyc_slice.xlsx")
pizza_data ## ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.
pizza_data## NameError: name 'pizza_data' is not defined
In general, it is better to avoid working in Excel, as it is not easy to reproduce the results (and Excel is horrible about dates and times, among other issues). Saving your data in more reproducible formats will make writing reproducible code much easier.
The Nebraska Department of Motor Vehicles publishes a database of vehicle registrations by type of license plate. Link
Read in the data using your language(s) of choice. Be sure to look at the structure of the excel file, so that you can read the data in properly!
url <- "https://dmv.nebraska.gov/sites/dmv.nebraska.gov/files/doc/data/ld-totals/NE_Licensed_Drivers_by_Type_2021.xls"
download.file(url, destfile = "../data/NE_Licensed_Drivers_by_Type_2021.xls", mode = "wb")
library(readxl)
ne_plates <- read_xls(path = "../data/NE_Licensed_Drivers_by_Type_2021.xls", skip = 2)
ne_plates[1:10,1:6]
## # A tibble: 10 × 6
## Age \nOperator's \nLicense …¹ Operator's\nLicense …² Motor-\ncycle\nLicen…³
## <chr> <dbl> <dbl> <dbl>
## 1 <NA> NA NA NA
## 2 14 0 0 0
## 3 15 0 0 0
## 4 16 0 0 0
## 5 17 961 33 0
## 6 18 18903 174 0
## 7 19 22159 251 0
## 8 20 22844 326 1
## 9 21 21589 428 0
## 10 22 22478 588 0
## # ℹ abbreviated names: ¹`\nOperator's \nLicense -\nClass O`,
## # ²`Operator's\nLicense - \nClass O/\nMotorcycle\nClass M`,
## # ³`Motor-\ncycle\nLicense /\nClass M`
## # ℹ 2 more variables: `Commercial Driver's License` <chr>, ...6 <chr>
You may need to install xlrd
via pip for this code to work.
import pandas as pd
= pd.read_excel("https://dmv.nebraska.gov/sites/dmv.nebraska.gov/files/doc/data/ld-totals/NE_Licensed_Drivers_by_Type_2021.xls", skiprows = 2)
ne_plates ## ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd.
ne_plates## NameError: name 'ne_plates' is not defined
17.4.2 Google Sheets
Of course, some spreadsheets are available online via Google sheets. There are specific R and python packages to interface with Google sheets, and these can do more than just read data in - they can create, format, and otherwise manipulate Google sheets programmatically. We’re not going to get into the power of these packages just now, but it’s worth a look if you’re working with collaborators who use Google sheets.
This section is provided for reference, but the details of API authentication are a bit too complicated to require of anyone who is just learning to program. Feel free to skip it and come back later if you need it.
The first two tabs below show authentication-free options for publicly available spreadsheets. For anything that is privately available, you will have to use API authentication via GSpread
or googlesheets4
in python and R respectively.
17.4.2.1 Reading Google Sheets
Let’s demonstrate reading in data from google sheets in R and python using the Data Is Plural archive.
One simple hack-ish way to read google sheets in Python (so long as the sheet is publicly available) is to modify the sheet url to export the data to CSV and then just read that into pandas as usual. This method is described in [2].
import pandas as pd
= "1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk"
sheet_id = "Items"
sheet_name = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
url
= pd.read_csv(url) data_is_plural
This method would likely work just as well in R and would not require the googlesheets4
package.
This method is described in [2] for Python, but I have adapted the code to use in R.
This code is set not to run when the textbook is compiled because it requires some interactive authentication.
Copy this code and run it on your computer to read in a sheet from google drive directly. You will see a prompt in the R console that you’ll have to interact with, and there may also be a browser pop-up where you will need to sign in to google.
library(googlesheets4)
gs4_auth(scopes = "https://www.googleapis.com/auth/drive.readonly") # Read-only permissions
data_is_plural <- read_sheet("https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0")
These instructions are derived from [3]. We will have to install the GSpread
package: type pip install gspread
into the terminal.
Then, you will need to obtain a client token JSON file following these instructions.
import gspread as gs
import pandas as pd
I’ve stopped here because I can’t get the authentication working, but the method seems solid if you’re willing to fiddle around with it.
Using a method of your choice, read in this spreadsheet of dog sizes and make a useful plot of dog height and weight ranges by breed.
Coming soon!
17.5 Binary Files
R has binary file formats which store data in a more compact form. It is relatively common for government websites, in particular, to provide SAS data in binary form. Python, as a more general computing language, has many different ways to interact with binary data files, as each programmer and application might want to save their data in binary form in a different way. As a result, there is not a general-purpose binary data format for Python data. If you are interested in reading binary data in Python, see [4].
17.5.1 Binary File IO
.Rdata
is perhaps the most common R binary data format, and can store several objects (along with their names) in the same file.
If we look at the file sizes of lego_sets.csv
(619 KB) and R_binary.Rdata
(227.8 KB), the size difference between binary and flat file formats is obvious.
We can load the R binary file back in using the load()
function.
rm(legos, my_var) # clear the files out
ls() # all objects in the working environment
## [1] "breaks" "data" "data_is_plural"
## [4] "fname" "legosets" "mesodata"
## [7] "mesodata_names" "ne_plates" "nebraska_locations"
## [10] "openai_key" "pizza_data" "pokemon_info"
## [13] "sheet_id" "sheet_name" "tmdb_key"
## [16] "tmp" "tmp_chars" "tmp_chars_space"
## [19] "tmp_space" "url" "widths"
load("../data/R_binary.Rdata")
ls() # all objects in the working environment
## [1] "breaks" "data" "data_is_plural"
## [4] "fname" "legos" "legosets"
## [7] "mesodata" "mesodata_names" "my_var"
## [10] "ne_plates" "nebraska_locations" "openai_key"
## [13] "pizza_data" "pokemon_info" "sheet_id"
## [16] "sheet_name" "tmdb_key" "tmp"
## [19] "tmp_chars" "tmp_chars_space" "tmp_space"
## [22] "url" "widths"
Another (less common) binary format used in R is the RDS format. Unlike Rdata, the RDS format does not save the object name - it only saves its contents (which also means you can save only one object at a time). As a result, when you read from an RDS file, you need to store the result of that function into a variable.
Because RDS formats don’t save the object name, you can be sure that you’re not over-writing some object in your workspace by loading a different file. The downside to this is that you have to save each object to its own RDS file separately.
We first need to install the pyreadr
package by running pip install pyreadr
in the terminal.
import pyreadr
= pyreadr.read_r('../data/R_binary.Rdata')
rdata_result "legos"] # Access the variables using the variable name as a key
rdata_result[## set_num ... img_url
## 0 0003977811-1 ... https://cdn.rebrickable.com/media/sets/0003977...
## 1 001-1 ... https://cdn.rebrickable.com/media/sets/001-1.jpg
## 2 0011-2 ... https://cdn.rebrickable.com/media/sets/0011-2.jpg
## 3 0011-3 ... https://cdn.rebrickable.com/media/sets/0011-3.jpg
## 4 0012-1 ... https://cdn.rebrickable.com/media/sets/0012-1.jpg
## ... ... ... ...
## 23860 YODACHRON-1 ... https://cdn.rebrickable.com/media/sets/yodachr...
## 23861 YOTO-1 ... https://cdn.rebrickable.com/media/sets/yoto-1.jpg
## 23862 YOTO-2 ... https://cdn.rebrickable.com/media/sets/yoto-2.jpg
## 23863 YTERRIER-1 ... https://cdn.rebrickable.com/media/sets/yterrie...
## 23864 ZX8000-1 ... https://cdn.rebrickable.com/media/sets/zx8000-...
##
## [23865 rows x 6 columns]
"my_var"]
rdata_result[## my_var
## 0 This variable contains a string
= pyreadr.read_r('../data/RDSlego.rds')
rds_result None] # for RDS files, access the data using None as the key since RDS files have no object name.
rds_result[## set_num ... img_url
## 0 0003977811-1 ... https://cdn.rebrickable.com/media/sets/0003977...
## 1 001-1 ... https://cdn.rebrickable.com/media/sets/001-1.jpg
## 2 0011-2 ... https://cdn.rebrickable.com/media/sets/0011-2.jpg
## 3 0011-3 ... https://cdn.rebrickable.com/media/sets/0011-3.jpg
## 4 0012-1 ... https://cdn.rebrickable.com/media/sets/0012-1.jpg
## ... ... ... ...
## 23860 YODACHRON-1 ... https://cdn.rebrickable.com/media/sets/yodachr...
## 23861 YOTO-1 ... https://cdn.rebrickable.com/media/sets/yoto-1.jpg
## 23862 YOTO-2 ... https://cdn.rebrickable.com/media/sets/yoto-2.jpg
## 23863 YTERRIER-1 ... https://cdn.rebrickable.com/media/sets/yterrie...
## 23864 ZX8000-1 ... https://cdn.rebrickable.com/media/sets/zx8000-...
##
## [23865 rows x 6 columns]
First, let’s download the NHTS data.
library(httr)
# Download the file and write to disk
res <- GET("https://query.data.world/s/y7jo2qmjqfcnmublmwjvkn7wl4xeax",
write_disk("../data/cen10pub.sas7bdat", overwrite = T))
You can see more information about this data here [5].
if (!"sas7bdat" %in% installed.packages()) install.packages("sas7bdat")
library(sas7bdat)
data <- read.sas7bdat("../data/cen10pub.sas7bdat")
head(data)
## HOUSEID HH_CBSA10 RAIL10 CBSASIZE10 CBSACAT10 URBAN10 URBSIZE10 URBRUR10
## 1 20000017 XXXXX 02 02 03 04 06 02
## 2 20000231 XXXXX 02 03 03 01 03 01
## 3 20000521 XXXXX 02 03 03 01 03 01
## 4 20001283 35620 01 05 01 01 05 01
## 5 20001603 -1 02 06 04 04 06 02
## 6 20001649 XXXXX 02 03 03 01 02 01
If you are curious about what this data means, then by all means, take a look at the codebook (XLSX file). For now, it’s enough that we can see roughly how it’s structured.
First, we need to download the SAS data file. This required writing a function to actually write the file downloaded from the URL, which is what this code chunk does.
# Source: https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests
import requests
def download_file(url, local_filename):
# NOTE the stream=True parameter below
with requests.get(url, stream=True) as r:
r.raise_for_status()with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)return local_filename
"https://query.data.world/s/y7jo2qmjqfcnmublmwjvkn7wl4xeax", "../data/cen10pub.sas7bdat")
download_file(## '../data/cen10pub.sas7bdat'
You can see more information about this data here [5].
To read SAS files, we use the read_sas
function in Pandas.
import pandas as pd
= pd.read_sas("../data/cen10pub.sas7bdat")
data
data## HOUSEID HH_CBSA10 RAIL10 ... URBAN10 URBSIZE10 URBRUR10
## 0 b'20000017' b'XXXXX' b'02' ... b'04' b'06' b'02'
## 1 b'20000231' b'XXXXX' b'02' ... b'01' b'03' b'01'
## 2 b'20000521' b'XXXXX' b'02' ... b'01' b'03' b'01'
## 3 b'20001283' b'35620' b'01' ... b'01' b'05' b'01'
## 4 b'20001603' b'-1' b'02' ... b'04' b'06' b'02'
## ... ... ... ... ... ... ... ...
## 150142 b'69998896' b'XXXXX' b'02' ... b'01' b'03' b'01'
## 150143 b'69998980' b'33100' b'01' ... b'01' b'05' b'01'
## 150144 b'69999718' b'XXXXX' b'02' ... b'01' b'03' b'01'
## 150145 b'69999745' b'XXXXX' b'02' ... b'01' b'03' b'01'
## 150146 b'69999811' b'31080' b'01' ... b'01' b'05' b'01'
##
## [150147 rows x 8 columns]
Read in two of the files from an earlier example, and save the results as an Rdata file with two objects. Then save each one as an RDS file. (Obviously, use R for this)
In RStudio, go to Session -> Clear Workspace. (This will clear your environment)
Now, using your RDS files, load the objects back into R with different names.
Finally, load your Rdata file. Are the two objects the same? (You can actually test this with all.equal()
if you’re curious)
Then, load the two RDS files and the Rdata file in Python. Are the objects the same?
library(readxl)
library(readr)
pizza <- read_xlsx("../data/nyc_slice.xlsx", sheet = 1, guess_max = 7000)
legos <- read_csv("../data/lego_sets.csv")
save(pizza, legos, file = "../data/04_Try_Binary.Rdata")
saveRDS(pizza, "../data/04_Try_Binary1.rds")
saveRDS(legos, "../data/04_Try_Binary2.rds")
rm(pizza, legos) # Limited clearing of workspace...
load("../data/04_Try_Binary.Rdata")
pizza_compare <- readRDS("../data/04_Try_Binary1.rds")
lego_compare <- readRDS("../data/04_Try_Binary2.rds")
all.equal(pizza, pizza_compare)
## [1] TRUE
all.equal(legos, lego_compare)
## [1] TRUE
import pyreadr
= pyreadr.read_r('data/04_Try_Binary.Rdata')
robjs ## pyreadr.custom_errors.PyreadrError: File b'data/04_Try_Binary.Rdata' does not exist!
= robjs["pizza"]
pizza ## NameError: name 'robjs' is not defined
= robjs["legos"] # Access the variables using the variable name as a key
legos ## NameError: name 'robjs' is not defined
= pyreadr.read_r('data/04_Try_Binary1.rds')[None]
pizza_compare ## pyreadr.custom_errors.PyreadrError: File b'data/04_Try_Binary1.rds' does not exist!
= pyreadr.read_r('data/04_Try_Binary2.rds')[None]
lego_compare ## pyreadr.custom_errors.PyreadrError: File b'data/04_Try_Binary2.rds' does not exist!
pizza.equals(pizza_compare)## NameError: name 'pizza' is not defined
legos.equals(lego_compare)## NameError: name 'legos' is not defined
17.6 Learn more
- Slides from Jenny Bryan’s talk on spreadsheets (sadly, no audio. It was a good talk.)
- The
vroom
package works likeread_csv
but allows you to read in and write to many files at incredible speeds.