34  Record-based Data and List Processing Strategies

Published

June 18, 2025

Chapter 33 introduces how XML and HTML documents are constructed and demonstrates different techniques for scraping data from the web.
Chapter 35 introduces Application Programming Interfaces (APIs) to get data from the web in a cleaner, more efficient way. Web-based data often uses different formats, like JSON (JavaScript Object Notation), to provide data from requests in a structured way. Before we can effectively use APIs, it helps to review some basic patterns and methods for working with record based data and converting it into the rectangular data that most statistical analyses are built around.

Prerequisites

This chapter will assume that you’ve used (or at least seen) techniques like split-apply-combine or map-reduce, anonymous functions, and table joins (full, left, right, inner) and similar techniques before. Here, we will focus on how these strategies apply specifically to record-based, hierarchically formatted data that is often found in XML and JSON files.

Objectives

  • Differentiate between tabular and record-based data structures
    • Develop strategies to transform record-based data into tabular data
    • Recognize situations where multiple linked tables or nested list-columns may be required to represent the data effectively in a tabular format
  • Transform data in record-based formats such as XML and JSON into single or multiple linked rectangular data tables.
  • Implement data cleaning and quality control measures to ensure that data is read in and transformed correctly and important information is not lost.

34.1 Data Models

If you are reading this book, chances are you’re approaching programming from a more statistical or data-sciency point of view, rather than as a computer scientist. As a result, you probably have a general expectation that data will be laid out in a rectangular form, with rows representing observations or individuals and columns representating variables, measurements, or other dimensions of data which are applicable to each observation or individual.

This is an assumption which is much more common (at least in my experience) in statistics than in computer science more generally, though of course there are statisticians working on all sorts of different data types, including those we will discuss here.

34.1.1 Relational Data

Relational data is a particular type of data model that assumes table-based data storage. That is, when we access data in spreadsheets, CSVs, and so on, we are working with relational data. In computer science terms, a relation consists of a heading and a body.

  • The heading defines a set of attributes that have a name and a data type (mathematically, a domain).
  • The body is a set of tuples (a tuple is a collection of \(n\) values, and is a formal data type in Python, but not in R), where there are as many values as are defined in the heading.

This is all a very technical way of describing the composition of a Data Frame, as this book did in Section 11.6, where a DataFrame is a heterogeneous list of columns where:

  • Every entry in each column must have the same data type
  • Every column must have the same number of rows

34.1.2 Record-based Data Models

Before the relational data model became popular, however, there was the hierarchical data model. In the 1960s, computers began to be utilized for data storage, and this naturally led to record-based data models.

In a record-based data model, data are stored as records that are a collection of fields, where each field is a single value with an associated (usually fixed length/size) type. The fields in a record determine the record’s type.

34.1.2.1 Hierarchical Data Models

A generic entity or class can be defined as a collection of fields in a more formal object-oriented hierarchical representation. Links connect records with associated records, forming a tree.

This type of data structure is incredibly common, but it does not always (easily) reduce to tabular data. In many cases, though, it is possible to represent hierarchical data as a set of tables that relate to each other through keys.

Demo: Hierarchical Employee Data

When a company hires an employee, many different records may be generated:

  • employee information (name, address, phone number, ssn)
  • initial paperwork (background check, tax information, training records)
  • employment agreement details (position type - permanent/contract/intern, start date, benefits, pay amount, which position the employee reports to)
erDiagram
  EMPLOYEE {
    int ID
    string first
    string last
    string middle
    string address
    string phone
    int ssn
  }
  BACKGROUND_CHECK {
    int ID
    int EMPLOYEE_ID
    date date
    string results
  }
  TAX_INFO {
    int ID
    int EMPLOYEE_ID
    int year
    float withholding
  }
  TRAINING {
    int ID
    int TRAINING_TYPE
    int EMPLOYEE_ID
    float score
  }
  EMP_AGREE {
    int ID
    int EMPLOYEE_ID
    int POSITION_ID
    int POSITION_TYPE
    date START_DATE
    date END_DATE
    int benefits_class
    float pay_hourly_equiv
    float hours_wk
    int pay_period
  }
  ORG_REL {
    int ID
    int EMPLOYEE_ID
    int SUPERVISOR_ID
  }
  EMPLOYEE }|--|| EMP_AGREE : has
  EMPLOYEE }|--|| ORG_REL : has
  EMPLOYEE ||--|{ TAX_INFO : has
  EMPLOYEE ||--|{ TRAINING : has
  EMPLOYEE ||--|{ BACKGROUND_CHECK : has


  accTitle: Sample Company Employee-related Records
  accDescr {
    This diagram contains information about corporate records relating to a single employee, such as employee details, supervisory relationships, employment agreement details, training records, tax information, and background checks.
  }
Figure 34.1: This entity-relationship diagram contains information about corporate records relating to a single employee, such as employee details, supervisory relationships, employment agreement details, training records, tax information, and background checks. One

In record-based data models, it can be complicated to actually do anything with the set of records. It might be relatively easy to e.g. list out all employees, but other related tasks, like determining how many people one individual is supervising, may require sifting through every ORG_REL record and could be complicated by how the records are stored (are they all text files in folders by employees?). Record-based data models, whether hierarchical or not, were originally a digital extension of physical records (think rows of file cabinets in old movies).

Example: Hierarchical Employee Data

Read in this XML file of sample employee data and

  1. Assemble a table of all of the employee information in Figure 34.1 (that is, ID, first, middle, and last name, address, phone number, and social security number).

    • Can you do this using data processing functions like map_dfr and as_list in R or read_xml in pandas (you’ll have to use chained operations in R and custom arguments in python)?
    • Identify any employees with an invalid social security number using your tabular data representation.
  2. Identify the supervisor who has the most people reporting to them, without converting the data to tabular format, and then retrieve a list of all of that person’s direct reports as employee IDs.

  3. Identify whether there are any employees who took the same training twice, without converting the data to tabular format.

Do you prefer to work with tabular data or hierarchical data? Why?

library(xml2)
library(purrr)
library(dplyr)
library(stringr)

info <- read_xml("../data/sample_employee_data.xml")
records <- xml_find_all(info, "//*/EMPLOYEE")
df <- records |> 
  as_list() |>
  map_dfr(~.x |> 
            unlist(recursive = T) |> 
            t() |> 
            as.data.frame() |> 
            set_names("id", "first", "last", "middle", 
                      "address", "phone", "ssn")
  ) |>
  mutate(valid_ssn = str_count(ssn, "\\d")==9)
head(df)
##     id   first      last middle
## 1 2824 Michael  Phillips      E
## 2 1409 William  Gonzalez      B
## 3 5506  Donald     Watts      L
## 4 5012   Erica   Johnson      C
## 5 4657 William  Townsend      D
## 6 3286    Mark Fernandez      J
##                                                    address               phone
## 1          341 Bonilla Extensions\nLake Jacktown, VA 72058 +1-701-028-0259x700
## 2       85759 Danielle Lights\nLake Anthonymouth, SD 72346   540.044.8808x1629
## 3                68711 Janet Wall\nMcdonaldmouth, MT 77414     +1-059-051-0485
## 4      755 Brandon Mill Suite 800\nNorth Phillip, LA 12123  (680)650-9821x6860
## 5 00624 Johnson Harbor Apt. 211\nWoodwardchester, AK 03498        026-982-2613
## 6       910 Mathew Mall Suite 805\nSchwartzmouth, OK 23331    630.559.6490x275
##         ssn valid_ssn
## 1 821097324      TRUE
## 2 394613566      TRUE
## 3 367179644      TRUE
## 4 330054170      TRUE
## 5 164054528      TRUE
## 6 297177301      TRUE
filter(df, !valid_ssn)
##     id     first   last middle
## 1 7873     Susan  Mason      K
## 2 5552   Cameron Miller      C
## 3 6635 Alexander  Marsh      A
## 4 5333     Kayla Parker      M
##                                                address                  phone
## 1                     PSC 4488, Box 1248\nAPO AE 96883      152.806.9336x6388
## 2 0819 Douglas Drives Suite 857\nDicksonfort, WY 67373 001-727-389-3454x96815
## 3              94716 Karen Square\nSmithside, WA 74182    +1-126-214-2102x641
## 4               1921 David Spur\nNorth Sarah, IA 01256     549-226-0959x23219
##        ssn valid_ssn
## 1 12078633     FALSE
## 2  9361578     FALSE
## 3 39197635     FALSE
## 4 69647139     FALSE

The purrr::pluck() function is a good way to pull out the information we need, once we convert the xml file to a list structure (which is still not a tabular form).

supervisors <- xml_find_all(info, "//*/ORG_REL/*")|> 
  as_list() |>
  map_chr(~purrr::pluck(., "SUPERVISOR_ID", 1)) 
employees <- xml_find_all(info, "//*/ORG_REL/*")|> 
  as_list() |>
  map_chr(~purrr::pluck(., "EMPLOYEE_ID", 1)) 

supervisor_reports <- supervisors |>
  table() |> sort(decreasing = T)

employees[which(supervisors == names(supervisor_reports)[1])]
## [1] "1434" "4611" "3547" "6925"

# Just for context
filter(df, id%in%employees[which(supervisors == names(supervisor_reports)[1])])
##     id       first   last middle
## 1 1434   Elizabeth Harris      B
## 2 4611      Deanna  Doyle      C
## 3 3547      Sherri Warner      T
## 4 6925 Christopher  Jones      P
##                                                      address
## 1               720 James Passage\nPort Justinfort, NC 55779
## 2 328 Kelley Junctions Suite 782\nNorth Jacqueline, PA 71278
## 3           280 Allen Highway Apt. 700\nSummerview, TX 62208
## 4         26004 Kelly Rest Apt. 898\nNew Katherine, OH 96504
##                   phone       ssn valid_ssn
## 1   +1-448-259-7632x679 531909398      TRUE
## 2 001-272-144-5412x0950 226062294      TRUE
## 3  001-011-455-7532x304 228640945      TRUE
## 4     (195)612-9014x134 533899921      TRUE
# This is how easy it is in tabular form... 
training_str <- xml_find_all(info, "//*/TRAINING/TRAININ")|> 
  as_list() |>
  map_dfr(~.x |> 
            unlist(recursive = T) |> 
            t() |> 
            as.data.frame() |> 
            set_names("id", "type", "employee_id", "score")
  ) |> group_by(employee_id) |>
  summarize(n = n(), n_unique = length(unique(type))) |>
  filter(n != n_unique)
from bs4 import BeautifulSoup
## ModuleNotFoundError: No module named 'bs4'
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'

df = pd.read_xml("../data/sample_employee_data.xml", 
                 iterparse={"EMPLOYEE": ["id", "first", "last", "middle", 
                                         "address", "phone", "ssn"]})
## NameError: name 'pd' is not defined
df.head()
## NameError: name 'df' is not defined

34.1.2.2 Network Data Models

In a hierarchical data model, each record has only one parent. This is, as it turns out, a fairly restrictive constraint, as in the real world, there can be many-to-many relationships that are not strictly hierarchical - imagine trying to represent genealogical data with the restriction that each node can have only one parent!

Another form of record-based data model is a network. This model allows many-to-many relationships between records and even reciprocal links between two or more records (a “cycle” in network terms).

Demo

In an attempt to demonstrate how complex a network data model can get, I asked ChatGPT to generate a data set of romantic relationships between Grey’s Anatomy characters over the show’s 21 seasons.

graph TD

%% Nodes and Edges with Relationship Labels

MeredithGrey -- romantic --> DerekShepherd
MeredithGrey -- romantic --> NathanRiggs
MeredithGrey -- romantic --> AndrewDeLuca
MeredithGrey -- romantic --> NickMarsh
MeredithGrey -- one-night --> GeorgeOMalley

CristinaYang -- romantic --> OwenHunt
CristinaYang -- romantic --> PrestonBurke

OwenHunt -- flirtation --> EmmaMarling
OwenHunt -- romantic --> AmeliaShepherd
OwenHunt -- romantic --> TeddyAltman

AlexKarev -- romantic --> IzzieStevens
AlexKarev -- romantic --> JoWilson
AlexKarev -- one-night --> LexieGrey

IzzieStevens -- romantic --> GeorgeOMalley
IzzieStevens -- romantic --> DennyDuquette

GeorgeOMalley -- romantic --> CallieTorres
GeorgeOMalley -- one-night --> MeredithGrey

CallieTorres -- romantic --> EricaHahn
CallieTorres -- romantic --> ArizonaRobbins
CallieTorres -- one-night --> MarkSloan

ArizonaRobbins -- flirtation --> CarinaDeLuca
ArizonaRobbins -- one-night --> LaurenBoswell

MarkSloan -- romantic --> LexieGrey
MarkSloan -- one-night --> CallieTorres
MarkSloan -- flirtation --> AddisonMontgomery
MarkSloan -- flirtation --> TeddyAltman

LexieGrey -- romantic --> JacksonAvery

JacksonAvery -- romantic --> AprilKepner
JacksonAvery -- romantic --> MaggiePierce

AprilKepner -- romantic --> MatthewTaylor

JoWilson -- romantic --> JasonMyers

AndrewDeLuca -- romantic --> MaggiePierce

AmeliaShepherd -- romantic --> AtticusLincoln

MaggiePierce -- romantic --> WinstonNdugu

TeddyAltman -- romantic --> TomKoracick

LeviSchmitt -- romantic --> NicoKim

MirandaBailey -- romantic --> BenWarren

RichardWebber -- romantic --> CatherineFox
RichardWebber -- romantic --> EllisGrey

EllisGrey -- romantic --> RichardWebber

CatherineFox -- romantic --> RichardWebber

TomKoracick -- romantic --> TeddyAltman

AtticusLincoln -- romantic --> AmeliaShepherd

NathanRiggs -- romantic --> MeganHunt
Figure 34.2