erDiagram EMPLOYEE { int ID string first string last string middle string address string phone int ssn } BACKGROUND_CHECK { int ID int EMPLOYEE_ID date date string results } TAX_INFO { int ID int EMPLOYEE_ID int year float withholding } TRAINING { int ID int TRAINING_TYPE int EMPLOYEE_ID float score } EMP_AGREE { int ID int EMPLOYEE_ID int POSITION_ID int POSITION_TYPE date START_DATE date END_DATE int benefits_class float pay_hourly_equiv float hours_wk int pay_period } ORG_REL { int ID int EMPLOYEE_ID int SUPERVISOR_ID } EMPLOYEE }|--|| EMP_AGREE : has EMPLOYEE }|--|| ORG_REL : has EMPLOYEE ||--|{ TAX_INFO : has EMPLOYEE ||--|{ TRAINING : has EMPLOYEE ||--|{ BACKGROUND_CHECK : has accTitle: Sample Company Employee-related Records accDescr { This diagram contains information about corporate records relating to a single employee, such as employee details, supervisory relationships, employment agreement details, training records, tax information, and background checks. }
34 Record-based Data and List Processing Strategies
Chapter 33 introduces how XML and HTML documents are constructed and demonstrates different techniques for scraping data from the web.
Chapter 35 introduces Application Programming Interfaces (APIs) to get data from the web in a cleaner, more efficient way. Web-based data often uses different formats, like JSON (JavaScript Object Notation), to provide data from requests in a structured way. Before we can effectively use APIs, it helps to review some basic patterns and methods for working with record based data and converting it into the rectangular data that most statistical analyses are built around.
Prerequisites
- Working knowledge of data wrangling techniques (Chapter 23, Chapter 24, Chapter 25)
- Familiarity with table joins (Chapter 27)
- Familiarity with functional programming (Chapter 29)
- Familiarity with XML file structures (Chapter 33)
This chapter will assume that you’ve used (or at least seen) techniques like split-apply-combine or map-reduce, anonymous functions, and table joins (full, left, right, inner) and similar techniques before. Here, we will focus on how these strategies apply specifically to record-based, hierarchically formatted data that is often found in XML and JSON files.
Objectives
- Differentiate between tabular and record-based data structures
- Develop strategies to transform record-based data into tabular data
- Recognize situations where multiple linked tables or nested list-columns may be required to represent the data effectively in a tabular format
- Transform data in record-based formats such as XML and JSON into single or multiple linked rectangular data tables.
- Implement data cleaning and quality control measures to ensure that data is read in and transformed correctly and important information is not lost.
34.1 Data Models
If you are reading this book, chances are you’re approaching programming from a more statistical or data-sciency point of view, rather than as a computer scientist. As a result, you probably have a general expectation that data will be laid out in a rectangular form, with rows representing observations or individuals and columns representating variables, measurements, or other dimensions of data which are applicable to each observation or individual.
This is an assumption which is much more common (at least in my experience) in statistics than in computer science more generally, though of course there are statisticians working on all sorts of different data types, including those we will discuss here.
34.1.1 Relational Data
Relational data is a particular type of data model that assumes table-based data storage. That is, when we access data in spreadsheets, CSVs, and so on, we are working with relational data. In computer science terms, a relation consists of a heading and a body.
- The heading defines a set of attributes that have a name and a data type (mathematically, a domain).
- The body is a set of tuples (a tuple is a collection of \(n\) values, and is a formal data type in Python, but not in R), where there are as many values as are defined in the heading.
This is all a very technical way of describing the composition of a Data Frame, as this book did in Section 11.6, where a DataFrame is a heterogeneous list of columns where:
- Every entry in each column must have the same data type
- Every column must have the same number of rows
34.1.2 Record-based Data Models
Before the relational data model became popular, however, there was the hierarchical data model. In the 1960s, computers began to be utilized for data storage, and this naturally led to record-based data models.
In a record-based data model, data are stored as records that are a collection of fields, where each field is a single value with an associated (usually fixed length/size) type. The fields in a record determine the record’s type.
34.1.2.1 Hierarchical Data Models
A generic entity or class can be defined as a collection of fields in a more formal object-oriented hierarchical representation. Links connect records with associated records, forming a tree.
This type of data structure is incredibly common, but it does not always (easily) reduce to tabular data. In many cases, though, it is possible to represent hierarchical data as a set of tables that relate to each other through keys.
When a company hires an employee, many different records may be generated:
- employee information (name, address, phone number, ssn)
- initial paperwork (background check, tax information, training records)
- employment agreement details (position type - permanent/contract/intern, start date, benefits, pay amount, which position the employee reports to)
In record-based data models, it can be complicated to actually do anything with the set of records. It might be relatively easy to e.g. list out all employees, but other related tasks, like determining how many people one individual is supervising, may require sifting through every ORG_REL record and could be complicated by how the records are stored (are they all text files in folders by employees?). Record-based data models, whether hierarchical or not, were originally a digital extension of physical records (think rows of file cabinets in old movies).
Read in this XML file of sample employee data and
-
Assemble a table of all of the employee information in Figure 34.1 (that is, ID, first, middle, and last name, address, phone number, and social security number).
- Can you do this using data processing functions like
map_dfr
andas_list
in R orread_xml
in pandas (you’ll have to use chained operations in R and custom arguments in python)? - Identify any employees with an invalid social security number using your tabular data representation.
- Can you do this using data processing functions like
Identify the supervisor who has the most people reporting to them, without converting the data to tabular format, and then retrieve a list of all of that person’s direct reports as employee IDs.
Identify whether there are any employees who took the same training twice, without converting the data to tabular format.
Do you prefer to work with tabular data or hierarchical data? Why?
library(xml2)
library(purrr)
library(dplyr)
library(stringr)
info <- read_xml("../data/sample_employee_data.xml")
records <- xml_find_all(info, "//*/EMPLOYEE")
df <- records |>
as_list() |>
map_dfr(~.x |>
unlist(recursive = T) |>
t() |>
as.data.frame() |>
set_names("id", "first", "last", "middle",
"address", "phone", "ssn")
) |>
mutate(valid_ssn = str_count(ssn, "\\d")==9)
head(df)
## id first last middle
## 1 2824 Michael Phillips E
## 2 1409 William Gonzalez B
## 3 5506 Donald Watts L
## 4 5012 Erica Johnson C
## 5 4657 William Townsend D
## 6 3286 Mark Fernandez J
## address phone
## 1 341 Bonilla Extensions\nLake Jacktown, VA 72058 +1-701-028-0259x700
## 2 85759 Danielle Lights\nLake Anthonymouth, SD 72346 540.044.8808x1629
## 3 68711 Janet Wall\nMcdonaldmouth, MT 77414 +1-059-051-0485
## 4 755 Brandon Mill Suite 800\nNorth Phillip, LA 12123 (680)650-9821x6860
## 5 00624 Johnson Harbor Apt. 211\nWoodwardchester, AK 03498 026-982-2613
## 6 910 Mathew Mall Suite 805\nSchwartzmouth, OK 23331 630.559.6490x275
## ssn valid_ssn
## 1 821097324 TRUE
## 2 394613566 TRUE
## 3 367179644 TRUE
## 4 330054170 TRUE
## 5 164054528 TRUE
## 6 297177301 TRUE
filter(df, !valid_ssn)
## id first last middle
## 1 7873 Susan Mason K
## 2 5552 Cameron Miller C
## 3 6635 Alexander Marsh A
## 4 5333 Kayla Parker M
## address phone
## 1 PSC 4488, Box 1248\nAPO AE 96883 152.806.9336x6388
## 2 0819 Douglas Drives Suite 857\nDicksonfort, WY 67373 001-727-389-3454x96815
## 3 94716 Karen Square\nSmithside, WA 74182 +1-126-214-2102x641
## 4 1921 David Spur\nNorth Sarah, IA 01256 549-226-0959x23219
## ssn valid_ssn
## 1 12078633 FALSE
## 2 9361578 FALSE
## 3 39197635 FALSE
## 4 69647139 FALSE
The purrr::pluck()
function is a good way to pull out the information we need, once we convert the xml file to a list structure (which is still not a tabular form).
supervisors <- xml_find_all(info, "//*/ORG_REL/*")|>
as_list() |>
map_chr(~purrr::pluck(., "SUPERVISOR_ID", 1))
employees <- xml_find_all(info, "//*/ORG_REL/*")|>
as_list() |>
map_chr(~purrr::pluck(., "EMPLOYEE_ID", 1))
supervisor_reports <- supervisors |>
table() |> sort(decreasing = T)
employees[which(supervisors == names(supervisor_reports)[1])]
## [1] "1434" "4611" "3547" "6925"
# Just for context
filter(df, id%in%employees[which(supervisors == names(supervisor_reports)[1])])
## id first last middle
## 1 1434 Elizabeth Harris B
## 2 4611 Deanna Doyle C
## 3 3547 Sherri Warner T
## 4 6925 Christopher Jones P
## address
## 1 720 James Passage\nPort Justinfort, NC 55779
## 2 328 Kelley Junctions Suite 782\nNorth Jacqueline, PA 71278
## 3 280 Allen Highway Apt. 700\nSummerview, TX 62208
## 4 26004 Kelly Rest Apt. 898\nNew Katherine, OH 96504
## phone ssn valid_ssn
## 1 +1-448-259-7632x679 531909398 TRUE
## 2 001-272-144-5412x0950 226062294 TRUE
## 3 001-011-455-7532x304 228640945 TRUE
## 4 (195)612-9014x134 533899921 TRUE
# This is how easy it is in tabular form...
training_str <- xml_find_all(info, "//*/TRAINING/TRAININ")|>
as_list() |>
map_dfr(~.x |>
unlist(recursive = T) |>
t() |>
as.data.frame() |>
set_names("id", "type", "employee_id", "score")
) |> group_by(employee_id) |>
summarize(n = n(), n_unique = length(unique(type))) |>
filter(n != n_unique)
from bs4 import BeautifulSoup
## ModuleNotFoundError: No module named 'bs4'
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
= pd.read_xml("../data/sample_employee_data.xml",
df ={"EMPLOYEE": ["id", "first", "last", "middle",
iterparse"address", "phone", "ssn"]})
## NameError: name 'pd' is not defined
df.head()## NameError: name 'df' is not defined
34.1.2.2 Network Data Models
In a hierarchical data model, each record has only one parent. This is, as it turns out, a fairly restrictive constraint, as in the real world, there can be many-to-many relationships that are not strictly hierarchical - imagine trying to represent genealogical data with the restriction that each node can have only one parent!
Another form of record-based data model is a network. This model allows many-to-many relationships between records and even reciprocal links between two or more records (a “cycle” in network terms).
In an attempt to demonstrate how complex a network data model can get, I asked ChatGPT to generate a data set of romantic relationships between Grey’s Anatomy characters over the show’s 21 seasons.
graph TD %% Nodes and Edges with Relationship Labels MeredithGrey -- romantic --> DerekShepherd MeredithGrey -- romantic --> NathanRiggs MeredithGrey -- romantic --> AndrewDeLuca MeredithGrey -- romantic --> NickMarsh MeredithGrey -- one-night --> GeorgeOMalley CristinaYang -- romantic --> OwenHunt CristinaYang -- romantic --> PrestonBurke OwenHunt -- flirtation --> EmmaMarling OwenHunt -- romantic --> AmeliaShepherd OwenHunt -- romantic --> TeddyAltman AlexKarev -- romantic --> IzzieStevens AlexKarev -- romantic --> JoWilson AlexKarev -- one-night --> LexieGrey IzzieStevens -- romantic --> GeorgeOMalley IzzieStevens -- romantic --> DennyDuquette GeorgeOMalley -- romantic --> CallieTorres GeorgeOMalley -- one-night --> MeredithGrey CallieTorres -- romantic --> EricaHahn CallieTorres -- romantic --> ArizonaRobbins CallieTorres -- one-night --> MarkSloan ArizonaRobbins -- flirtation --> CarinaDeLuca ArizonaRobbins -- one-night --> LaurenBoswell MarkSloan -- romantic --> LexieGrey MarkSloan -- one-night --> CallieTorres MarkSloan -- flirtation --> AddisonMontgomery MarkSloan -- flirtation --> TeddyAltman LexieGrey -- romantic --> JacksonAvery JacksonAvery -- romantic --> AprilKepner JacksonAvery -- romantic --> MaggiePierce AprilKepner -- romantic --> MatthewTaylor JoWilson -- romantic --> JasonMyers AndrewDeLuca -- romantic --> MaggiePierce AmeliaShepherd -- romantic --> AtticusLincoln MaggiePierce -- romantic --> WinstonNdugu TeddyAltman -- romantic --> TomKoracick LeviSchmitt -- romantic --> NicoKim MirandaBailey -- romantic --> BenWarren RichardWebber -- romantic --> CatherineFox RichardWebber -- romantic --> EllisGrey EllisGrey -- romantic --> RichardWebber CatherineFox -- romantic --> RichardWebber TomKoracick -- romantic --> TeddyAltman AtticusLincoln -- romantic --> AmeliaShepherd NathanRiggs -- romantic --> MeganHunt