if (!"Hmisc" %in% installed.packages()) install.packages("Hmisc")
library(Hmisc)
db_loc <- "../data/Witchcraftsurvey_download.mdb"
mdb.get(db_loc, tables = TRUE) # get table list
## Error in system(paste("mdb-tables -1", file), intern = TRUE): error in running command
mdb.get(db_loc, tables = "WDB_Trial")[1:6,1:10] # get table of trials, print first 6 rows and 10 cols
## Error in system(paste("mdb-schema -T", shQuote(tab), file), intern = TRUE): error in running command
40 Databases
Coming soon!
There are many different database formats. Some of the most common databases are SQL* related formats and Microsoft Access files.
You can get through this class without this section. Feel free to skip it and come back when/if you need it.
40.1 Microsoft Access
To get access to MS Access databases, you will need to become familiar with how to install ODBC drivers. These drivers tell your operating system how to connect to each type of database (so you need a different driver to get to MS Access databases than to get to SQL databases).
and This set of instructions appears to contain all of the right steps for Windows and Mac and has been updated recently (Feb 2022) teamexploratoryHowImportData2022?.
I installed mdbtools
on Ubuntu and have the following entry in my /etc/odbcinst.ini
file:
[MDBTools]
Description=MDBTools Driver
Driver=libmdbodbc.so
Setup=libmdbodbc.so
FileUsage=1
UsageCount=1
Adding this entry to the file may be part of the mdbtools
installation - I certainly have no memory of doing it myself, but this may help if you’re troubleshooting, so I’ve included it.
40.1.1 Database IO: Scottish Witchcraft
For this demo, we’ll be using the Scottish Witchcraft DatabasejuliangoodareSurveyScottishWitchcraft2003?, which you can download from their website, or acquire from the course data folder if you don’t want to register with the authors. A description of the dataset is also available.
In R, we can read in MS Access files using the Hmisc
package, as long as the mdbtools library is available on your computer1.
Many databases have multiple tables with keys that connect information in each table. We’ll spend more time on databases later in the semester - for now, it’s enough to be able to get data out of one. #### Python {-}
There are several tutorials out there to access MS Access databases using packages like pyodbc
e.g. datatofishHowConnectPython2021?. I couldn’t quite get these working on Linux, but it is possible you may have better luck on another OS. With that said, the solution using pandas_access seems to be much simpler and require less OS configuration, so it’s what I’ll show here.
First, we have to install pandas_access
using pip install pandas_access
.
import pandas_access as mdb
= '../data/Witchcraftsurvey_download.mdb'
db_filename
# List tables
for tbl in mdb.list_tables(db_filename):
print(tbl)
## FileNotFoundError: [Errno 2] No such file or directory: 'mdb-tables'
# Read a small table.
= mdb.read_table(db_filename, "WDB_Trial_Person")
trials ## FileNotFoundError: [Errno 2] No such file or directory: 'mdb-schema'
This isn’t perfectly stable - I tried to read WDB_Trial
and got errors about NA values in an integer field - but it does at least work.
My hope is that you never actually need to get at data in an MS Access database - the format seems to be largely dying out.
40.1.2 SQLite
SQLite databases are contained in single files with the extension .SQLite. These files can still contain many different tables, though. They function as databases but are more portable than SQL databases that require a server instance to run and connecting over a network (or running a server on your machine locally). As a result, they provide an opportunity to demonstrate most of the skills required for working with databases without all of the configuration overhead.
Let’s try working with a sqlite file that has only one table in R:
if (!"RSQLite" %in% installed.packages()) install.packages("RSQLite")
if (!"DBI" %in% installed.packages()) install.packages("DBI")
library(RSQLite)
library(DBI)
# Download the baby names file:
download.file("http://2016.padjo.org/files/data/starterpack/ssa-babynames/ssa-babynames-for-2015.sqlite", destfile = "../data/ssa-babynames-2015.sqlite")
con <- dbConnect(RSQLite::SQLite(), "../data/ssa-babynames-2015.sqlite")
dbListTables(con) # List all the tables
## [1] "babynames"
babyname <- dbReadTable(con, "babynames")
head(babyname, 10) # show the first 10 obs
## state year name sex count rank_within_sex per_100k_within_sex
## 1 AK 2015 Olivia F 56 1 2367.9
## 2 AK 2015 Liam M 53 1 1590.6
## 3 AK 2015 Emma F 49 2 2071.9
## 4 AK 2015 Noah M 46 2 1380.6
## 5 AK 2015 Aurora F 46 3 1945.0
## 6 AK 2015 James M 45 3 1350.5
## 7 AK 2015 Amelia F 39 4 1649.0
## 8 AK 2015 Ava F 39 4 1649.0
## 9 AK 2015 William M 44 4 1320.5
## 10 AK 2015 Oliver M 41 5 1230.5
You can of course write formal queries using the DBI package, but for many databases, it’s easier to do the querying in R. We’ll cover both options later - the R version will be in the next module.
This example was created using datacarpentryAccessingSQLiteDatabases2019? as a primary reference.
If you haven’t already downloaded the database file, you can do that automatically in python using this code:
import urllib.request
"http://2016.padjo.org/files/data/starterpack/ssa-babynames/ssa-babynames-for-2015.sqlite", "../data/babynames-2015.sqlite") urllib.request.urlretrieve(
You don’t have to install the sqlite3
module in python using pip because it’s been included in base python since Python 2.5.
import pandas as pd
import sqlite3
= sqlite3.connect('../data/babynames-2015.sqlite')
con
= pd.read_sql_query("SELECT * from babynames", con)
babyname ## pandas.errors.DatabaseError: Execution failed on sql 'SELECT * from babynames': no such table: babynames
babyname## NameError: name 'babyname' is not defined
# You must close any connection you open! con.close()
A currently maintained version of the library is here and should work for UNIX platforms. It may be possible to install the library on Windows using the UNIX subsystem, per this thread↩︎