40  Databases

Published

May 28, 2025

Coming soon!

There are many different database formats. Some of the most common databases are SQL* related formats and Microsoft Access files.

Note

You can get through this class without this section. Feel free to skip it and come back when/if you need it.

This excellent GitHub repo contains code to connect to multiple types of databases in R, python, PHP, Java, SAS, and VBA

40.1 Microsoft Access

To get access to MS Access databases, you will need to become familiar with how to install ODBC drivers. These drivers tell your operating system how to connect to each type of database (so you need a different driver to get to MS Access databases than to get to SQL databases).

and This set of instructions appears to contain all of the right steps for Windows and Mac and has been updated recently (Feb 2022) teamexploratoryHowImportData2022?.

I installed mdbtools on Ubuntu and have the following entry in my /etc/odbcinst.ini file:

[MDBTools]
Description=MDBTools Driver
Driver=libmdbodbc.so
Setup=libmdbodbc.so
FileUsage=1
UsageCount=1

Adding this entry to the file may be part of the mdbtools installation - I certainly have no memory of doing it myself, but this may help if you’re troubleshooting, so I’ve included it.

40.1.1 Database IO: Scottish Witchcraft

For this demo, we’ll be using the Scottish Witchcraft DatabasejuliangoodareSurveyScottishWitchcraft2003?, which you can download from their website, or acquire from the course data folder if you don’t want to register with the authors. A description of the dataset is also available.

In R, we can read in MS Access files using the Hmisc package, as long as the mdbtools library is available on your computer1.

if (!"Hmisc" %in% installed.packages()) install.packages("Hmisc")
library(Hmisc)
db_loc <- "../data/Witchcraftsurvey_download.mdb"

mdb.get(db_loc, tables = TRUE) # get table list
## Error in system(paste("mdb-tables -1", file), intern = TRUE): error in running command
mdb.get(db_loc, tables = "WDB_Trial")[1:6,1:10] # get table of trials, print first 6 rows and 10 cols
## Error in system(paste("mdb-schema -T", shQuote(tab), file), intern = TRUE): error in running command

Many databases have multiple tables with keys that connect information in each table. We’ll spend more time on databases later in the semester - for now, it’s enough to be able to get data out of one. #### Python {-}

There are several tutorials out there to access MS Access databases using packages like pyodbc e.g. datatofishHowConnectPython2021?. I couldn’t quite get these working on Linux, but it is possible you may have better luck on another OS. With that said, the solution using pandas_access seems to be much simpler and require less OS configuration, so it’s what I’ll show here.

First, we have to install pandas_access using pip install pandas_access.

import pandas_access as mdb
db_filename = '../data/Witchcraftsurvey_download.mdb'

# List tables
for tbl in mdb.list_tables(db_filename):
  print(tbl)
## FileNotFoundError: [Errno 2] No such file or directory: 'mdb-tables'

# Read a small table.
trials = mdb.read_table(db_filename, "WDB_Trial_Person")
## FileNotFoundError: [Errno 2] No such file or directory: 'mdb-schema'

This isn’t perfectly stable - I tried to read WDB_Trial and got errors about NA values in an integer field - but it does at least work.

My hope is that you never actually need to get at data in an MS Access database - the format seems to be largely dying out.

40.1.2 SQLite

SQLite databases are contained in single files with the extension .SQLite. These files can still contain many different tables, though. They function as databases but are more portable than SQL databases that require a server instance to run and connecting over a network (or running a server on your machine locally). As a result, they provide an opportunity to demonstrate most of the skills required for working with databases without all of the configuration overhead.

Let’s try working with a sqlite file that has only one table in R:

if (!"RSQLite" %in% installed.packages()) install.packages("RSQLite")
if (!"DBI" %in% installed.packages()) install.packages("DBI")
library(RSQLite)
library(DBI)

# Download the baby names file:
download.file("http://2016.padjo.org/files/data/starterpack/ssa-babynames/ssa-babynames-for-2015.sqlite", destfile = "../data/ssa-babynames-2015.sqlite")

con <- dbConnect(RSQLite::SQLite(), "../data/ssa-babynames-2015.sqlite")
dbListTables(con) # List all the tables
## [1] "babynames"
babyname <- dbReadTable(con, "babynames")
head(babyname, 10) # show the first 10 obs
##    state year    name sex count rank_within_sex per_100k_within_sex
## 1     AK 2015  Olivia   F    56               1              2367.9
## 2     AK 2015    Liam   M    53               1              1590.6
## 3     AK 2015    Emma   F    49               2              2071.9
## 4     AK 2015    Noah   M    46               2              1380.6
## 5     AK 2015  Aurora   F    46               3              1945.0
## 6     AK 2015   James   M    45               3              1350.5
## 7     AK 2015  Amelia   F    39               4              1649.0
## 8     AK 2015     Ava   F    39               4              1649.0
## 9     AK 2015 William   M    44               4              1320.5
## 10    AK 2015  Oliver   M    41               5              1230.5

You can of course write formal queries using the DBI package, but for many databases, it’s easier to do the querying in R. We’ll cover both options later - the R version will be in the next module.

This example was created using datacarpentryAccessingSQLiteDatabases2019? as a primary reference.

If you haven’t already downloaded the database file, you can do that automatically in python using this code:

import urllib.request
urllib.request.urlretrieve("http://2016.padjo.org/files/data/starterpack/ssa-babynames/ssa-babynames-for-2015.sqlite", "../data/babynames-2015.sqlite")

You don’t have to install the sqlite3 module in python using pip because it’s been included in base python since Python 2.5.

import pandas as pd
import sqlite3

con = sqlite3.connect('../data/babynames-2015.sqlite')

babyname = pd.read_sql_query("SELECT * from babynames", con)
## pandas.errors.DatabaseError: Execution failed on sql 'SELECT * from babynames': no such table: babynames
babyname
## NameError: name 'babyname' is not defined

con.close() # You must close any connection you open!

  1. A currently maintained version of the library is here and should work for UNIX platforms. It may be possible to install the library on Windows using the UNIX subsystem, per this thread↩︎