17  Data Input

Published

December 17, 2024

17.1 Objectives

  • Read in data from common formats into R or Python
  • Identify delimiters, headers, and other essential components of files
Cheatsheets!

These may be worth printing off as you work through this module.

17.2 Overview: Data Formats

In order to use statistical software to do anything interesting, we need to be able to get data into the program so that we can work with it effectively. For the moment, we’ll focus on tabular data - data that is stored in a rectangular shape, with rows indicating observations and columns that show variables. This type of data can be stored on the computer in multiple ways:

  • as raw text, usually in a file that ends with .txt, .tsv, .csv, .dat, or sometimes, there will be no file extension at all. These types of files are human-readable. If part of a text file gets corrupted, the rest of the file may be recoverable.

  • as a binary file. Binary files are compressed files that are readable by computers but not by humans. They generally take less space to store on disk (but the same amount of space when read into computer memory). If part of a binary file is corrupted, the entire file is usually affected.

    • R, SAS, Stata, SPSS, and Minitab all have their own formats for storing binary data. Packages such as foreign in R will let you read data from other programs, and packages such as haven in R will let you write data into binary formats used by other programs.
    • [1] describes why binary file formats exist, and why they’re not necessarily optimal.
  • in a spreadsheet. Spreadsheets, such as those created by MS Excel, Google Sheets, or LibreOffice Calc, are not binary formats, but they’re also not raw text files either. They’re a hybrid - a special type of markup that is specific to the filetype and the program it’s designed to work with. Practically, they may function like a poorly laid-out database, a text file, or a total nightmare, depending on who designed the spreadsheet.

Note

There is a collection of spreadsheet horror stories here and a series of even more horrifying tweets here.
Also, there’s this amazing comic:
Created by Randall Munroe, xkcd.com. CC-by-A-NC-2.5

  • in a database. Databases are typically composed of a set of one or more tables, with information that may be related across tables. Data stored in a database may be easier to access, and may not require that the entire data set be stored in computer memory at the same time, but you may have to join several tables together to get the full set of data you want to work with.

There are, of course, many other non-tabular data formats – some open and easy to work with, some inpenetrable. A few which you may come across:

  • Web data structures: XML (eXtensible markup language), JSON (JavaScript Object Notation), YAML. These structures have their own formats and field delimiters, but more importantly, are not necessarily easily converted to tabular structures. They are, however, useful for handling nested objects, such as trees. When read into R or SAS, these file formats are usually treated as lists, and may be restructured afterwards into a format useful for statistical analysis. See Chapter 27 for some tools to work with these files.

  • Spatial files: Shapefiles are the most common version of spatial files, though there are a seemingly infinite number of different formats, and new formats pop up at the most inconvenient times. Spatial files often include structured encodings of geographic information plus corresponding tabular format data that goes with the geographic information. Chapter 28 covers some of the tools available for working with spatial data.

To be minimally functional in R and Python, it’s important to know how to read in text files (CSV, tab-delimited, etc.). It can be helpful to also know how to read in XLSX files. We will briefly discuss binary files and databases, but it is less critical to remember how to read these in without consulting an online reference.

17.3 Text Files

There are several different variants of text data which are relatively common, but for the most part, text data files can be broken down into fixed-width and delimited formats. What’s the difference, you say?

17.3.1 Fixed-width files

Col1    Col2    Col3
 3.4     4.2     5.4
27.3    -2.4    15.9

In a fixed-width text file, the position of the data indicates which field (variable/column) it belongs to. These files are fairly common outputs from older FORTRAN-based programs, but may be found elsewhere as well - if you have a very large amount of data, a fixed-width format may be more efficient to read, because you can select only the portions of the file which matter for a particular analysis (and so you don’t have to read the whole thing into memory).

17.3.1.1 Fixed Width File IO

In base R (no extra packages), you can read fixed-width files in using read.fwf, but you must specify the column breaks yourself, which can be painful.

## url <- "https://www.mesonet.org/index.php/dataMdfMts/dataController/getFile/202206070000/mdf/TEXT/"
data <- read.fwf(url, 
         skip = 3, # Skip the first 2 lines (useless) + header line
         widths = c(5, 6, 6, 7, 7, 7, 7, 6, 7, 7, 7, 8, 9, 6, 7, 7, 7, 7, 7, 7, 
7, 8, 8, 8)) # There is a row with the column names specified

data[1:6,] # first 6 rows
##      V1  V2 V3 V4   V5  V6  V7  V8   V9 V10 V11   V12    V13 V14  V15 V16  V17
## 1  ACME 110  0 60 29.9 4.4 4.3 111  9.0 0.8 6.4  0.00 959.37 267 29.6 3.6 25.4
## 2  ADAX   1  0 69 29.3 1.7 1.6  98 24.9 0.6 3.4  0.00 971.26 251 29.0 0.6 24.6
## 3  ALTU   2  0 52 31.7 5.5 5.4  89  7.6 1.0 7.8  0.00 956.12 287 31.3 3.5 26.5
## 4  ALV2 116  0 57 30.1 2.5 2.4 108 10.3 0.5 3.6 55.63 954.01 266 30.1 1.7 23.3
## 5  ANT2 135  0 75 29.1 1.1 1.1  44 21.1 0.3 2.0  0.00 985.35 121 28.9 0.5 25.9
## 6  APAC 111  0 58 29.9 5.1 5.1 107  8.5 0.7 6.6  0.00 954.47 224 29.7 3.6 26.2
##    V18  V19  V20    V21  V22  V23     V24
## 1 29.4 27.4 22.5   20.6 1.55 1.48    1.40
## 2 28.7 25.6 24.3 -998.0 1.46 1.52 -998.00
## 3 32.1 27.6 24.0 -998.0 1.72 1.50 -998.00
## 4 30.3 26.2 21.1 -998.0 1.49 1.40 -998.00
## 5 29.0 26.3 22.8   21.4 1.51 1.39    1.41
## 6 29.1 26.6 24.3   20.5 1.59 1.47    1.40

You can count all of those spaces by hand (not shown), you can use a different function, or you can write code to do it for you.


# I like to cheat a bit....
# Read the first few lines in
tmp <- readLines(url, n = 20)[-c(1:2)]

# split each line into a series of single characters
tmp_chars <- strsplit(tmp, '') 

# Bind the lines together into a character matrix
# do.call applies a function to an entire list - so instead of doing 18 rbinds, 
# one command will put all 18 rows together
tmp_chars <- do.call("rbind", tmp_chars) # (it's ok if you don't get this line)

# Make into a logical matrix where T = space, F = not space
tmp_chars_space <- tmp_chars == " "

# Add up the number of rows where there is a non-space character
# space columns would have 0s/FALSE
tmp_space <- colSums(!tmp_chars_space)

# We need a nonzero column followed by a zero column
breaks <- which(tmp_space != 0 & c(tmp_space[-1], 0) == 0)

# Then, we need to get the widths between the columns
widths <- diff(c(0, breaks))

# Now we're ready to go
mesodata <- read.fwf(url, skip = 3, widths = widths, header = F)
# read header separately - if you use header = T, it errors for some reason.
# It's easier just to work around the error than to fix it :)
mesodata_names <- read.fwf(url, skip = 2, n = 1, widths = widths, header = F, 
                           stringsAsFactors = F)
names(mesodata) <- as.character(mesodata_names)

mesodata[1:6,] # first 6 rows
##    STID   STNM   TIME    RELH    TAIR    WSPD    WVEC   WDIR    WDSD    WSSD
## 1  ACME    110      0      60    29.9     4.4     4.3    111     9.0     0.8
## 2  ADAX      1      0      69    29.3     1.7     1.6     98    24.9     0.6
## 3  ALTU      2      0      52    31.7     5.5     5.4     89     7.6     1.0
## 4  ALV2    116      0      57    30.1     2.5     2.4    108    10.3     0.5
## 5  ANT2    135      0      75    29.1     1.1     1.1     44    21.1     0.3
## 6  APAC    111      0      58    29.9     5.1     5.1    107     8.5     0.7
##      WMAX     RAIN      PRES   SRAD    TA9M    WS2M    TS10    TB10    TS05
## 1     6.4     0.00    959.37    267    29.6     3.6    25.4    29.4    27.4
## 2     3.4     0.00    971.26    251    29.0     0.6    24.6    28.7    25.6
## 3     7.8     0.00    956.12    287    31.3     3.5    26.5    32.1    27.6
## 4     3.6    55.63    954.01    266    30.1     1.7    23.3    30.3    26.2
## 5     2.0     0.00    985.35    121    28.9     0.5    25.9    29.0    26.3
## 6     6.6     0.00    954.47    224    29.7     3.6    26.2    29.1    26.6
##      TS25    TS60     TR05     TR25     TR60
## 1    22.5    20.6     1.55     1.48     1.40
## 2    24.3  -998.0     1.46     1.52  -998.00
## 3    24.0  -998.0     1.72     1.50  -998.00
## 4    21.1  -998.0     1.49     1.40  -998.00
## 5    22.8    21.4     1.51     1.39     1.41
## 6    24.3    20.5     1.59     1.47     1.40

You can also write fixed-width files if you really want to:

if (!"gdata" %in% installed.packages()) install.packages("gdata")
library(gdata)
write.fwf(mtcars, file = tempfile())

The readr package creates data-frame like objects called tibbles (a souped-up data frame), but it is much friendlier to use.

library(readr) # Better data importing in R

read_table(url, skip = 2) # Gosh, that was much easier!
## # A tibble: 120 × 24
##    STID   STNM  TIME  RELH   TAIR  WSPD  WVEC  WDIR  WDSD  WSSD  WMAX  RAIN
##    <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ACME    110     0    60   29.9   4.4   4.3   111   9     0.8   6.4   0  
##  2 ADAX      1     0    69   29.3   1.7   1.6    98  24.9   0.6   3.4   0  
##  3 ALTU      2     0    52   31.7   5.5   5.4    89   7.6   1     7.8   0  
##  4 ALV2    116     0    57   30.1   2.5   2.4   108  10.3   0.5   3.6  55.6
##  5 ANT2    135     0    75   29.1   1.1   1.1    44  21.1   0.3   2     0  
##  6 APAC    111     0    58   29.9   5.1   5.1   107   8.5   0.7   6.6   0  
##  7 ARD2    126     0    61   31.2   3.3   3.2   109   9.1   0.6   4.3   0  
##  8 ARNE      6     0    49   30.4   4.5   4.4   111  11.1   0.9   6.4   0  
##  9 BEAV      8     0    42   30.5   6.1   6     127   8.7   0.9   7.9   0  
## 10 BESS      9     0    53 -999     5.3   5.2   115   8.6   0.6   7     0  
## # ℹ 110 more rows
## # ℹ 12 more variables: PRES <dbl>, SRAD <dbl>, TA9M <dbl>, WS2M <dbl>,
## #   TS10 <dbl>, TB10 <dbl>, TS05 <dbl>, TS25 <dbl>, TS60 <dbl>, TR05 <dbl>,
## #   TR25 <dbl>, TR60 <dbl>

By default, pandas’ read_fwf will guess at the format of your fixed-width file.

import pandas as pd
url = "https://www.mesonet.org/index.php/dataMdfMts/dataController/getFile/202006070000/mdf/TEXT/"
data = pd.read_fwf(url, skiprows = 2) # Skip the first 2 lines (useless)
data.head()
##    STID   STNM  TIME  RELH  TAIR  WSPD  ...  TS05  TS25   TS60  TR05  TR25    TR60
## 0  ACME  110.0   0.0  53.0  31.8   5.2  ...  31.6  25.2   21.7  3.09  2.22    1.48
## 1  ADAX    1.0   0.0  55.0  32.4   1.0  ...  29.6  26.8 -998.0  2.61  1.88 -998.00
## 2  ALTU    2.0   0.0  31.0  35.6   8.9  ...  30.7  26.1 -998.0  3.39  2.47 -998.00
## 3  ALV2  116.0   0.0  27.0  35.8   6.7  ...  25.6  22.6 -998.0  2.70  1.60 -998.00
## 4  ANT2  135.0   0.0  73.0  27.8   0.0  ...  30.2  26.8   23.8  1.96  1.73    1.33
## 
## [5 rows x 24 columns]

17.3.2 Delimited Text Files

Delimited text files are files where fields are separated by a specific character, such as space, comma, semicolon, tabs, etc. Often, delimited text files will have the column names as the first row in the file.

17.3.2.1 Comma Delimited Files

url <- "https://raw.githubusercontent.com/srvanderplas/datasets/main/clean/pokemon_gen_1-9.csv"

pokemon_info <- read.csv(url, header = T, stringsAsFactors = F)
pokemon_info[1:6, 1:6] # Show only the first 6 lines & cols
  gen pokedex_no
1   1          1
2   1          2
3   1          3
4   1          3
5   1          4
6   1          5
                                                               img_link
1     https://img.pokemondb.net/sprites/sword-shield/icon/bulbasaur.png
2       https://img.pokemondb.net/sprites/sword-shield/icon/ivysaur.png
3      https://img.pokemondb.net/sprites/sword-shield/icon/venusaur.png
4 https://img.pokemondb.net/sprites/sword-shield/icon/venusaur-mega.png
5    https://img.pokemondb.net/sprites/sword-shield/icon/charmander.png
6    https://img.pokemondb.net/sprites/sword-shield/icon/charmeleon.png
        name variant         type
1  Bulbasaur    <NA> Grass,Poison
2    Ivysaur    <NA> Grass,Poison
3   Venusaur    <NA> Grass,Poison
4   Venusaur    Mega Grass,Poison
5 Charmander    <NA>         Fire
6 Charmeleon    <NA>         Fire

There is a family of read_xxx functions in readr to read files delimited with commas (read_csv), tabs (read_tsv), or generic delimited files (read_delim).

The most common delimited text format is CSV: comma-separated value.

library(readr)
url <- "https://raw.githubusercontent.com/srvanderplas/datasets/main/clean/pokemon_gen_1-9.csv"
pokemon_info <- read_csv(url)
pokemon_info[1:6, 1:6] # Show only the first 6 lines & cols
# A tibble: 6 × 6
    gen pokedex_no img_link                                  name  variant type 
  <dbl>      <dbl> <chr>                                     <chr> <chr>   <chr>
1     1          1 https://img.pokemondb.net/sprites/sword-… Bulb… <NA>    Gras…
2     1          2 https://img.pokemondb.net/sprites/sword-… Ivys… <NA>    Gras…
3     1          3 https://img.pokemondb.net/sprites/sword-… Venu… <NA>    Gras…
4     1          3 https://img.pokemondb.net/sprites/sword-… Venu… Mega    Gras…
5     1          4 https://img.pokemondb.net/sprites/sword-… Char… <NA>    Fire 
6     1          5 https://img.pokemondb.net/sprites/sword-… Char… <NA>    Fire 

There is a family of read_xxx functions in pandas including functions to read files delimited with commas (read_csv) as well as generic delimited files (read_table).

import pandas as pd

url <- "https://raw.githubusercontent.com/srvanderplas/datasets/main/clean/pokemon_gen_1-9.csv"
TypeError: bad operand type for unary -: 'str'
pokemon_info = pd.read_csv(url)
pokemon_info.iloc[:,2:51]
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[123 rows x 0 columns]

Sometimes, data is available in files that use other characters as delimiters. This can happen when commas are an important part of the data stored in the file, but can also just be a choice made by the person generating the file. Either way, we can’t let it keep us from accessing the data.

17.3.2.2 Other Character Delimited Files

# Download from web
download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", destfile = '../data/NE_Features.zip')
## Error in download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", : cannot open URL 'https://geonames.usgs.gov/docs/stategaz/NE_Features.zip'
# Unzip to `data/` folder
unzip('../data/NE_Features.zip', exdir = '../data/')
# List files matching the file type and pick the first one
fname <- list.files("../data/", 'NE_Features_20', full.names = T)[1]

# see that the file is delimited with |
readLines(fname, n = 5)
## [1] "FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|STATE_ALPHA|STATE_NUMERIC|COUNTY_NAME|COUNTY_NUMERIC|PRIMARY_LAT_DMS|PRIM_LONG_DMS|PRIM_LAT_DEC|PRIM_LONG_DEC|SOURCE_LAT_DMS|SOURCE_LONG_DMS|SOURCE_LAT_DEC|SOURCE_LONG_DEC|ELEV_IN_M|ELEV_IN_FT|MAP_NAME|DATE_CREATED|DATE_EDITED"
## [2] "171013|Peetz Table|Area|CO|08|Logan|075|405840N|1030332W|40.9777645|-103.0588116|||||1341|4400|Peetz|10/13/1978|"                                                                                                                                                        
## [3] "171029|Sidney Draw|Valley|NE|31|Cheyenne|033|410816N|1030116W|41.1377213|-103.021044|405215N|1040353W|40.8709614|-104.0646558|1255|4117|Brownson|10/13/1978|03/08/2018"                                                                                                  
## [4] "182687|Highline Canal|Canal|CO|08|Sedgwick|115|405810N|1023137W|40.9694351|-102.5268556|||||1119|3671|Sedgwick|10/13/1978|"                                                                                                                                              
## [5] "182688|Cottonwood Creek|Stream|CO|08|Sedgwick|115|405511N|1023355W|40.9197132|-102.5651893|405850N|1030107W|40.9805426|-103.0185329|1095|3592|Sedgwick|10/13/1978|10/23/2009"

# a file delimited with |
nebraska_locations <- read.delim(fname, sep = "|", header = T)
nebraska_locations[1:6, 1:6]
##   FEATURE_ID     FEATURE_NAME FEATURE_CLASS STATE_ALPHA STATE_NUMERIC
## 1     171013      Peetz Table          Area          CO             8
## 2     171029      Sidney Draw        Valley          NE            31
## 3     182687   Highline Canal         Canal          CO             8
## 4     182688 Cottonwood Creek        Stream          CO             8
## 5     182689        Sand Draw        Valley          CO             8
## 6     182690    Sedgwick Draw        Valley          CO             8
##   COUNTY_NAME
## 1       Logan
## 2    Cheyenne
## 3    Sedgwick
## 4    Sedgwick
## 5    Sedgwick
## 6    Sedgwick

There is a family of read_xxx functions in readr to read files delimited with commas (read_csv), tabs (read_tsv), or generic delimited files (read_delim).

# Download from web
download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", destfile = '../data/NE_Features.zip')
## Error in download.file("https://geonames.usgs.gov/docs/stategaz/NE_Features.zip", : cannot open URL 'https://geonames.usgs.gov/docs/stategaz/NE_Features.zip'
# Unzip to `data/` folder
unzip('../data/NE_Features.zip', exdir = '../data/')
# List files matching the file type and pick the first one
fname <- list.files("../data/", 'NE_Features_20', full.names = T)[1]

# see that the file is delimited with |
readLines(fname, n = 5)
## [1] "FEATURE_ID|FEATURE_NAME|FEATURE_CLASS|STATE_ALPHA|STATE_NUMERIC|COUNTY_NAME|COUNTY_NUMERIC|PRIMARY_LAT_DMS|PRIM_LONG_DMS|PRIM_LAT_DEC|PRIM_LONG_DEC|SOURCE_LAT_DMS|SOURCE_LONG_DMS|SOURCE_LAT_DEC|SOURCE_LONG_DEC|ELEV_IN_M|ELEV_IN_FT|MAP_NAME|DATE_CREATED|DATE_EDITED"
## [2] "171013|Peetz Table|Area|CO|08|Logan|075|405840N|1030332W|40.9777645|-103.0588116|||||1341|4400|Peetz|10/13/1978|"                                                                                                                                                        
## [3] "171029|Sidney Draw|Valley|NE|31|Cheyenne|033|410816N|1030116W|41.1377213|-103.021044|405215N|1040353W|40.8709614|-104.0646558|1255|4117|Brownson|10/13/1978|03/08/2018"                                                                                                  
## [4] "182687|Highline Canal|Canal|CO|08|Sedgwick|115|405810N|1023137W|40.9694351|-102.5268556|||||1119|3671|Sedgwick|10/13/1978|"                                                                                                                                              
## [5] "182688|Cottonwood Creek|Stream|CO|08|Sedgwick|115|405511N|1023355W|40.9197132|-102.5651893|405850N|1030107W|40.9805426|-103.0185329|1095|3592|Sedgwick|10/13/1978|10/23/2009"

nebraska_locations <- read_delim(fname, delim = "|")
nebraska_locations[1:6, 1:6]
## # A tibble: 6 × 6
##   FEATURE_ID FEATURE_NAME    FEATURE_CLASS STATE_ALPHA STATE_NUMERIC COUNTY_NAME
##        <dbl> <chr>           <chr>         <chr>       <chr>         <chr>      
## 1     171013 Peetz Table     Area          CO          08            Logan      
## 2     171029 Sidney Draw     Valley        NE          31            Cheyenne   
## 3     182687 Highline Canal  Canal         CO          08            Sedgwick   
## 4     182688 Cottonwood Cre… Stream        CO          08            Sedgwick   
## 5     182689 Sand Draw       Valley        CO          08            Sedgwick   
## 6     182690 Sedgwick Draw   Valley        CO          08            Sedgwick

We can actually read in the file without unzipping it, so long as we download it first - readr does not support reading remote zipped files, but it does support reading zipped files locally. If we know ahead of time what our delimiter is, this is the best choice as it reduces the amount of file clutter we have in our working directory.

nebraska_locations <- read_delim("../data/NE_Features.zip", delim = "|")
## Error: '../data/NE_Features.zip' does not exist in current working directory ('/home/susan/Projects/Class/stat-computing-r-python/part-wrangling').
nebraska_locations[1:6, 1:6]
## # A tibble: 6 × 6
##   FEATURE_ID FEATURE_NAME    FEATURE_CLASS STATE_ALPHA STATE_NUMERIC COUNTY_NAME
##        <dbl> <chr>           <chr>         <chr>       <chr>         <chr>      
## 1     171013 Peetz Table     Area          CO          08            Logan      
## 2     171029 Sidney Draw     Valley        NE          31            Cheyenne   
## 3     182687 Highline Canal  Canal         CO          08            Sedgwick   
## 4     182688 Cottonwood Cre… Stream        CO          08            Sedgwick   
## 5     182689 Sand Draw       Valley        CO          08            Sedgwick   
## 6     182690 Sedgwick Draw   Valley        CO          08            Sedgwick

There is a family of read_xxx functions in pandas including functions to read files delimited with commas (read_csv) as well as generic delimited files (read_table).

Pandas can access zipped data files and unzip them while reading the data in, so we don’t have to download the file and unzip it first.

# a file delimited with |

url = "https://geonames.usgs.gov/docs/stategaz/NE_Features.zip"
nebraska_locations = pd.read_table(url, delimiter = "|")
## urllib.error.HTTPError: HTTP Error 503: Service Unavailable
nebraska_locations
## NameError: name 'nebraska_locations' is not defined
Try it out: Reading CSV files

Rebrickable.com contains tables of almost any information imaginable concerning Lego sets, conveniently available at their download page. Because these data sets are comparatively large, they are available as compressed CSV files - that is, the .gz extension is a gzip compression applied to the CSV.

The readr package and pandas can handle .csv.gz files with no problems. Try reading in the data using the appropriate function from that package. Can you save the data as an uncompressed csv?

library(readr)
legosets <- read_csv("https://cdn.rebrickable.com/media/downloads/sets.csv.gz")
write_csv(legosets, "../data/lego_sets.csv")
import pandas as pd

legosets = pd.read_csv("https://cdn.rebrickable.com/media/downloads/sets.csv.gz")
legosets.to_csv("../data/lego_sets_py.csv")

17.4 Spreadsheets

17.4.1 Spreadsheet IO

This example uses from NYC SLice. The author maintains a google sheet of the slices he has photographed, which we can download as an excel sheet and import.

In R, the easiest way to read Excel data in is to use the readxl package. There are many other packages with different features, however - I have used openxlsx in the past to format spreadsheets to send to clients, for instance. By far and away you are more likely to have problems with the arcane format of the Excel spreadsheet than with the package used to read the data in. It is usually helpful to open the spreadsheet up in a graphical program first to make sure the formatting is as you expected it to be.

if (!"readxl" %in% installed.packages()) install.packages("readxl")
library(readxl)

url <- "https://docs.google.com/spreadsheets/d/1EY3oi9ttxybG0A0Obtwey6BFu7QLqdHe02JApijgztg/export?format=xlsx"
# Only download the data if it doesn't exist in the data folder
if (!file.exists("../data/nyc_slice.xlsx")) {
  download.file(url, destfile = "../data/nyc_slice.xlsx", mode = "wb")
}

# Read in the downloaded data
pizza_data <- read_xlsx("../data/nyc_slice.xlsx", sheet = 1)
pizza_data[1:10, 1:6]
## # A tibble: 10 × 6
##    `Link to IG Post`                       Name  location_lat location_lng Date 
##    <chr>                                   <chr> <chr>        <chr>        <chr>
##  1 https://www.instagram.com/p/CjszJ-fOP5… Ange… 40.6232544   -73.9379223… 2022…
##  2 https://www.instagram.com/p/CjdcPNAufP… Ozon… 40.6808917   -73.8426307  2022…
##  3 https://www.instagram.com/p/CjQdNsaOZl… Pino… 40.6000148   -73.9994551  2022…
##  4 https://www.instagram.com/p/Ci5XblnOnM… La R… 40.7133354   -73.8294102  2022…
##  5 https://www.instagram.com/p/CiiLAtkON_… Rony… 40.7482509   -73.9923498  2022…
##  6 https://www.instagram.com/p/CiS-44nucN… John… 40.8545616   -73.8658818… 2022…
##  7 https://www.instagram.com/p/CiSmQnjutQ… Preg… 40.8631291   -73.8585108  2022…
##  8 https://www.instagram.com/p/CiIO6oFuxp… N & … 40.6004632   -73.9430723… 2022…
##  9 https://www.instagram.com/p/ChaZUsxuFs… Pepp… 40.9036613   -73.8504667… 2022…
## 10 https://www.instagram.com/p/ChNd9wqOqG… Rocc… 40.8676344   -73.8836046  2022…
## # ℹ 1 more variable: `Date Expanded (times in EST)` <chr>
import pandas as pd

pizza_data = pd.read_excel("../data/nyc_slice.xlsx")
## ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.
pizza_data
## NameError: name 'pizza_data' is not defined

In general, it is better to avoid working in Excel, as it is not easy to reproduce the results (and Excel is horrible about dates and times, among other issues). Saving your data in more reproducible formats will make writing reproducible code much easier.

Try it out

The Nebraska Department of Motor Vehicles publishes a database of vehicle registrations by type of license plate. Link

Read in the data using your language(s) of choice. Be sure to look at the structure of the excel file, so that you can read the data in properly!

url <- "https://dmv.nebraska.gov/sites/dmv.nebraska.gov/files/doc/data/ld-totals/NE_Licensed_Drivers_by_Type_2021.xls"
download.file(url, destfile = "../data/NE_Licensed_Drivers_by_Type_2021.xls", mode = "wb")
library(readxl)
ne_plates <- read_xls(path = "../data/NE_Licensed_Drivers_by_Type_2021.xls", skip = 2)
ne_plates[1:10,1:6]
## # A tibble: 10 × 6
##    Age   \nOperator's \nLicense …¹ Operator's\nLicense …² Motor-\ncycle\nLicen…³
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 <NA>                         NA                     NA                     NA
##  2 14                            0                      0                      0
##  3 15                            0                      0                      0
##  4 16                            0                      0                      0
##  5 17                          961                     33                      0
##  6 18                        18903                    174                      0
##  7 19                        22159                    251                      0
##  8 20                        22844                    326                      1
##  9 21                        21589                    428                      0
## 10 22                        22478                    588                      0
## # ℹ abbreviated names: ¹​`\nOperator's \nLicense -\nClass O`,
## #   ²​`Operator's\nLicense - \nClass O/\nMotorcycle\nClass M`,
## #   ³​`Motor-\ncycle\nLicense /\nClass M`
## # ℹ 2 more variables: `Commercial Driver's License` <chr>, ...6 <chr>

You may need to install xlrd via pip for this code to work.

import pandas as pd

ne_plates = pd.read_excel("https://dmv.nebraska.gov/sites/dmv.nebraska.gov/files/doc/data/ld-totals/NE_Licensed_Drivers_by_Type_2021.xls", skiprows = 2)
## ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd.
ne_plates
## NameError: name 'ne_plates' is not defined

17.4.2 Google Sheets

Of course, some spreadsheets are available online via Google sheets. There are specific R and python packages to interface with Google sheets, and these can do more than just read data in - they can create, format, and otherwise manipulate Google sheets programmatically. We’re not going to get into the power of these packages just now, but it’s worth a look if you’re working with collaborators who use Google sheets.

This section is provided for reference, but the details of API authentication are a bit too complicated to require of anyone who is just learning to program. Feel free to skip it and come back later if you need it.

The first two tabs below show authentication-free options for publicly available spreadsheets. For anything that is privately available, you will have to use API authentication via GSpread or googlesheets4 in python and R respectively.

17.4.2.1 Reading Google Sheets

Let’s demonstrate reading in data from google sheets in R and python using the Data Is Plural archive.

One simple hack-ish way to read google sheets in Python (so long as the sheet is publicly available) is to modify the sheet url to export the data to CSV and then just read that into pandas as usual. This method is described in [2].

import pandas as pd

sheet_id = "1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk"
sheet_name = "Items"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"

data_is_plural = pd.read_csv(url)

This method would likely work just as well in R and would not require the googlesheets4 package.

This method is described in [2] for Python, but I have adapted the code to use in R.

library(readr)
sheet_id = "1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk"
sheet_name = "Items"
url = sprintf("https://docs.google.com/spreadsheets/d/%s/gviz/tq?tqx=out:csv&sheet=%s", sheet_id, sheet_name)

data_is_plural = read_csv(url)

This code is set not to run when the textbook is compiled because it requires some interactive authentication.

Copy this code and run it on your computer to read in a sheet from google drive directly. You will see a prompt in the R console that you’ll have to interact with, and there may also be a browser pop-up where you will need to sign in to google.

library(googlesheets4)
gs4_auth(scopes = "https://www.googleapis.com/auth/drive.readonly") # Read-only permissions
data_is_plural <- read_sheet("https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0")

These instructions are derived from [3]. We will have to install the GSpread package: type pip install gspread into the terminal.

Then, you will need to obtain a client token JSON file following these instructions.

import gspread as gs
import pandas as pd

I’ve stopped here because I can’t get the authentication working, but the method seems solid if you’re willing to fiddle around with it.

Try It Out!

Using a method of your choice, read in this spreadsheet of dog sizes and make a useful plot of dog height and weight ranges by breed.

Coming soon!

17.5 Binary Files

R has binary file formats which store data in a more compact form. It is relatively common for government websites, in particular, to provide SAS data in binary form. Python, as a more general computing language, has many different ways to interact with binary data files, as each programmer and application might want to save their data in binary form in a different way. As a result, there is not a general-purpose binary data format for Python data. If you are interested in reading binary data in Python, see [4].

17.5.1 Binary File IO

.Rdata is perhaps the most common R binary data format, and can store several objects (along with their names) in the same file.

legos <- read_csv("../data/lego_sets.csv")
my_var <- "This variable contains a string"
save(legos, my_var, file = "../data/R_binary.Rdata")

If we look at the file sizes of lego_sets.csv (619 KB) and R_binary.Rdata(227.8 KB), the size difference between binary and flat file formats is obvious.

We can load the R binary file back in using the load() function.

rm(legos, my_var) # clear the files out

ls() # all objects in the working environment
##  [1] "breaks"             "data"               "data_is_plural"    
##  [4] "fname"              "legosets"           "mesodata"          
##  [7] "mesodata_names"     "ne_plates"          "nebraska_locations"
## [10] "openai_key"         "pizza_data"         "pokemon_info"      
## [13] "sheet_id"           "sheet_name"         "tmdb_key"          
## [16] "tmp"                "tmp_chars"          "tmp_chars_space"   
## [19] "tmp_space"          "url"                "widths"

load("../data/R_binary.Rdata")

ls() # all objects in the working environment
##  [1] "breaks"             "data"               "data_is_plural"    
##  [4] "fname"              "legos"              "legosets"          
##  [7] "mesodata"           "mesodata_names"     "my_var"            
## [10] "ne_plates"          "nebraska_locations" "openai_key"        
## [13] "pizza_data"         "pokemon_info"       "sheet_id"          
## [16] "sheet_name"         "tmdb_key"           "tmp"               
## [19] "tmp_chars"          "tmp_chars_space"    "tmp_space"         
## [22] "url"                "widths"

Another (less common) binary format used in R is the RDS format. Unlike Rdata, the RDS format does not save the object name - it only saves its contents (which also means you can save only one object at a time). As a result, when you read from an RDS file, you need to store the result of that function into a variable.

saveRDS(legos, "../data/RDSlego.rds")

other_lego <- readRDS("../data/RDSlego.rds")

Because RDS formats don’t save the object name, you can be sure that you’re not over-writing some object in your workspace by loading a different file. The downside to this is that you have to save each object to its own RDS file separately.

We first need to install the pyreadr package by running pip install pyreadr in the terminal.

import pyreadr

rdata_result = pyreadr.read_r('../data/R_binary.Rdata')
rdata_result["legos"] # Access the variables using the variable name as a key
##             set_num  ...                                            img_url
## 0      0003977811-1  ...  https://cdn.rebrickable.com/media/sets/0003977...
## 1             001-1  ...   https://cdn.rebrickable.com/media/sets/001-1.jpg
## 2            0011-2  ...  https://cdn.rebrickable.com/media/sets/0011-2.jpg
## 3            0011-3  ...  https://cdn.rebrickable.com/media/sets/0011-3.jpg
## 4            0012-1  ...  https://cdn.rebrickable.com/media/sets/0012-1.jpg
## ...             ...  ...                                                ...
## 23860   YODACHRON-1  ...  https://cdn.rebrickable.com/media/sets/yodachr...
## 23861        YOTO-1  ...  https://cdn.rebrickable.com/media/sets/yoto-1.jpg
## 23862        YOTO-2  ...  https://cdn.rebrickable.com/media/sets/yoto-2.jpg
## 23863    YTERRIER-1  ...  https://cdn.rebrickable.com/media/sets/yterrie...
## 23864      ZX8000-1  ...  https://cdn.rebrickable.com/media/sets/zx8000-...
## 
## [23865 rows x 6 columns]
rdata_result["my_var"]
##                             my_var
## 0  This variable contains a string

rds_result = pyreadr.read_r('../data/RDSlego.rds')
rds_result[None] # for RDS files, access the data using None as the key since RDS files have no object name.
##             set_num  ...                                            img_url
## 0      0003977811-1  ...  https://cdn.rebrickable.com/media/sets/0003977...
## 1             001-1  ...   https://cdn.rebrickable.com/media/sets/001-1.jpg
## 2            0011-2  ...  https://cdn.rebrickable.com/media/sets/0011-2.jpg
## 3            0011-3  ...  https://cdn.rebrickable.com/media/sets/0011-3.jpg
## 4            0012-1  ...  https://cdn.rebrickable.com/media/sets/0012-1.jpg
## ...             ...  ...                                                ...
## 23860   YODACHRON-1  ...  https://cdn.rebrickable.com/media/sets/yodachr...
## 23861        YOTO-1  ...  https://cdn.rebrickable.com/media/sets/yoto-1.jpg
## 23862        YOTO-2  ...  https://cdn.rebrickable.com/media/sets/yoto-2.jpg
## 23863    YTERRIER-1  ...  https://cdn.rebrickable.com/media/sets/yterrie...
## 23864      ZX8000-1  ...  https://cdn.rebrickable.com/media/sets/zx8000-...
## 
## [23865 rows x 6 columns]

First, let’s download the NHTS data.

library(httr)
# Download the file and write to disk
res <- GET("https://query.data.world/s/y7jo2qmjqfcnmublmwjvkn7wl4xeax", 
           write_disk("../data/cen10pub.sas7bdat", overwrite = T))

You can see more information about this data here [5].

if (!"sas7bdat" %in% installed.packages()) install.packages("sas7bdat")

library(sas7bdat)
data <- read.sas7bdat("../data/cen10pub.sas7bdat")
head(data)
##    HOUSEID HH_CBSA10 RAIL10 CBSASIZE10 CBSACAT10 URBAN10 URBSIZE10 URBRUR10
## 1 20000017     XXXXX     02         02        03      04        06       02
## 2 20000231     XXXXX     02         03        03      01        03       01
## 3 20000521     XXXXX     02         03        03      01        03       01
## 4 20001283     35620     01         05        01      01        05       01
## 5 20001603        -1     02         06        04      04        06       02
## 6 20001649     XXXXX     02         03        03      01        02       01

If you are curious about what this data means, then by all means, take a look at the codebook (XLSX file). For now, it’s enough that we can see roughly how it’s structured.

First, we need to download the SAS data file. This required writing a function to actually write the file downloaded from the URL, which is what this code chunk does.

# Source: https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests
import requests
def download_file(url, local_filename):
  # NOTE the stream=True parameter below
  with requests.get(url, stream=True) as r:
    r.raise_for_status()
    with open(local_filename, 'wb') as f:
      for chunk in r.iter_content(chunk_size=8192): 
        f.write(chunk)
  return local_filename

download_file("https://query.data.world/s/y7jo2qmjqfcnmublmwjvkn7wl4xeax", "../data/cen10pub.sas7bdat")
## '../data/cen10pub.sas7bdat'

You can see more information about this data here [5].

To read SAS files, we use the read_sas function in Pandas.

import pandas as pd

data = pd.read_sas("../data/cen10pub.sas7bdat")
data
##             HOUSEID HH_CBSA10 RAIL10  ... URBAN10 URBSIZE10 URBRUR10
## 0       b'20000017'  b'XXXXX'  b'02'  ...   b'04'     b'06'    b'02'
## 1       b'20000231'  b'XXXXX'  b'02'  ...   b'01'     b'03'    b'01'
## 2       b'20000521'  b'XXXXX'  b'02'  ...   b'01'     b'03'    b'01'
## 3       b'20001283'  b'35620'  b'01'  ...   b'01'     b'05'    b'01'
## 4       b'20001603'     b'-1'  b'02'  ...   b'04'     b'06'    b'02'
## ...             ...       ...    ...  ...     ...       ...      ...
## 150142  b'69998896'  b'XXXXX'  b'02'  ...   b'01'     b'03'    b'01'
## 150143  b'69998980'  b'33100'  b'01'  ...   b'01'     b'05'    b'01'
## 150144  b'69999718'  b'XXXXX'  b'02'  ...   b'01'     b'03'    b'01'
## 150145  b'69999745'  b'XXXXX'  b'02'  ...   b'01'     b'03'    b'01'
## 150146  b'69999811'  b'31080'  b'01'  ...   b'01'     b'05'    b'01'
## 
## [150147 rows x 8 columns]
Try it out

Read in two of the files from an earlier example, and save the results as an Rdata file with two objects. Then save each one as an RDS file. (Obviously, use R for this)

In RStudio, go to Session -> Clear Workspace. (This will clear your environment)

Now, using your RDS files, load the objects back into R with different names.

Finally, load your Rdata file. Are the two objects the same? (You can actually test this with all.equal() if you’re curious)

Then, load the two RDS files and the Rdata file in Python. Are the objects the same?

library(readxl)
library(readr)
pizza <- read_xlsx("../data/nyc_slice.xlsx", sheet = 1, guess_max = 7000)
legos <- read_csv("../data/lego_sets.csv")

save(pizza, legos, file = "../data/04_Try_Binary.Rdata")
saveRDS(pizza, "../data/04_Try_Binary1.rds")
saveRDS(legos, "../data/04_Try_Binary2.rds")

rm(pizza, legos) # Limited clearing of workspace... 


load("../data/04_Try_Binary.Rdata")

pizza_compare <- readRDS("../data/04_Try_Binary1.rds")
lego_compare <- readRDS("../data/04_Try_Binary2.rds")

all.equal(pizza, pizza_compare)
## [1] TRUE
all.equal(legos, lego_compare)
## [1] TRUE
import pyreadr

robjs = pyreadr.read_r('data/04_Try_Binary.Rdata')
## pyreadr.custom_errors.PyreadrError: File b'data/04_Try_Binary.Rdata' does not exist!
pizza = robjs["pizza"]
## NameError: name 'robjs' is not defined
legos = robjs["legos"] # Access the variables using the variable name as a key
## NameError: name 'robjs' is not defined

pizza_compare = pyreadr.read_r('data/04_Try_Binary1.rds')[None]
## pyreadr.custom_errors.PyreadrError: File b'data/04_Try_Binary1.rds' does not exist!
lego_compare = pyreadr.read_r('data/04_Try_Binary2.rds')[None]
## pyreadr.custom_errors.PyreadrError: File b'data/04_Try_Binary2.rds' does not exist!

pizza.equals(pizza_compare)
## NameError: name 'pizza' is not defined
legos.equals(lego_compare)
## NameError: name 'legos' is not defined

17.6 Learn more

17.7 References

[1]
BetterExplained, “A little diddy about binary file formats – BetterExplained. Better explained,” 2017. [Online]. Available: https://betterexplained.com/articles/a-little-diddy-about-binary-file-formats/. [Accessed: Jan. 13, 2023]
[2]
M. Schäfer, “Read Data from Google Sheets into Pandas without the Google Sheets API,” Towards Data Science. Dec. 2020 [Online]. Available: https://towardsdatascience.com/read-data-from-google-sheets-into-pandas-without-the-google-sheets-api-5c468536550. [Accessed: Jun. 07, 2022]
[3]
M. Clarke, “How to read Google Sheets data in Pandas with GSpread,” Practical Data Science. Jun. 2021 [Online]. Available: https://web.archive.org/web/20211025204025/https://practicaldatascience.co.uk/data-science/how-to-read-google-sheets-data-in-pandas-with-gspread. [Accessed: Jun. 07, 2022]
[4]
C. Maierle, “Loading binary data to NumPy/Pandas,” Towards Data Science. Jul. 2020 [Online]. Available: https://towardsdatascience.com/loading-binary-data-to-numpy-pandas-9caa03eb0672. [Accessed: Jun. 07, 2022]
[5]
US Department of Transportation, “National Household Travel Survey (NHTS) 2009,” data.world. Mar. 2018 [Online]. Available: https://data.world/dot/national-household-travel-survey-nhts-2009. [Accessed: Jun. 13, 2022]