36  Working with PDFs

Published

July 28, 2025

When I started my first job out of graduate school, one particular process used by my coworkers completely mystified me: they would print out a document, and then immediately scan it back in, with the scan emailed to themselves. This seemed like a waste of toner, paper, and time to me – why would anyone do such a thing? Eventually, I found out that the printer would automatically recognize the text and add a text layer to a PDF that previously couldn’t be highlighted. So, my coworkers found a handy workaround to manually typing out the numbers they needed from older PDF documents! As clever as this was, it was also unnecessary. A lot of paper, toner, and time could have been saved if the company had just provided Optical Character Recognition programs and made them available to workers.

In this chapter, you’ll learn about PDF document structure, as well as how to use OCR programs.

The image is a comic strip depicting a person sitting at a news desk. The person has simplistic features with a round head, short hair, and no facial features drawn. A speech bubble above them says "According to a new PDF..." Behind them is a sign reading "Breaking News." Crossed out phrases above them include "According to a new preprint...", "...an unpublished study...", and "According to a new paper uploaded to a preprint server, but which has not undergone peer review...". Below the main image, there's a section titled "Benefits of just saying 'a PDF':" followed by three bullet points. • AVOIDS IMPLICATIONS ABOUT PUBLICATION STATUS • IMMEDIATELY RAISES QUESTIONS ABOUT AUTHOR(S) • STILL IMPLIES "THIS DOCUMENT WAS PROBABLY PREPARED BY A PROFESSIONAL, BECAUSE NO NORMAL HUMAN TRYING TO COMMUNICATE IN 2020 WOULD CHOOSE THIS RIDICULOUS FORMAT".

DOWNSIDES: Adobe people may periodically email your newsroom to ask you to call it an ‘Adobe® PDF document,’ but they’ll reverse course once they learn how sarcastically you can pronounce the registered trademark symbol.
CC-A-NC2.5 by Randall Munroe. source

Objectives

  • Identify the type of PDF and the data it contains.
  • Develop a strategy to extract the data from the PDF programmatically, using strategies to improve the success of Optical Character Recognition (OCR) if necessary.
  • Augment the PDF files with OCR to add a text layer, if necessary, before extracting information.
  • Extract information from PDFs programmatically and format the information appropriately.
  • Implement quality control and data cleaning measures which handle the most common OCR errors elegantly.

36.1 Introduction

Over the objections of open data organizations, data archivists [1], and programmers [2], [3], companies and government agencies frequently use PDF (portable document format) documents to store and release data. Election results (Oregon, 2024), property appraisals (Lancaster County, Nebraska), public health reports (Centers for Disease Control Morbidity and Mortality Weekly Reports), and more, locked up in PDF format instead of stored properly in nicely formatted CSVs, spreadsheets, or databases [4]. Even though we object to the storage mechanism, learning how to deal with data stored in PDF format is a valuable skill for the aspiring data scientist. Even if you never work with PDF data in a professional capacity (and I hope you’re that lucky), these skills are very useful for public data side projects.

36.1.1 PDF File Format

The PDF file format was created in 1993 by Adobe and was a proprietary format until 2008, when the format became an open standard under the control of the International Standards Organization [5]. As the acronym suggests, Portable Document Format is intended to be readable on any computer. This was something of a novel idea in the 1990s, when Mac users used one document creation software and Windows users another, and there was not a version of e.g. Microsoft Office available for Mac.

The technical details of a PDF file are complex [6], [7]. However, conceptually, there are four required components to a PDF document, as shown in Figure 36.1:

  • Header
    • PDF version number
    • arbitrary sequence of binary data to prevent applications from opening the document as a text file (which would corrupt the file)
  • Body (relationships between body components shown in Figure 36.2)
    • Page tree - serves as the root of the document, and may be as simple as a list of pages.
    • Pages - each page is defined independently and contains its own metadata, links to resources, and content (defined separately).
    • Resources - objects required to render a page, such as fonts.
    • Content - text and graphics which appear visually on the page.
    • Catalog - an indication to programs as to where to start reading the document. Often this is just a link to the root page tree.
  • Cross-reference table - Records the location in the file of each object in the body of the file so that when viewing a page, only objects from that page are loaded into memory.
  • Trailer - tells applications how to read the file
    • A reference to the catalog which links to the document root
    • Location of the cross-reference table
    • Size of the cross-reference table

[8] has some good examples showing pages and the PDF document code that create the pages.

Within the page, streams are often used to define the page’s appearance (the other option is lattices, which can be used to divide the page up into sections). To add text, commands are issued to define the font, position the text cursor, and type the text onto the page. Text is positioned from the bottom left corner, with \(Y\) defining the vertical and \(X\) defining the horizontal location. Line breaks and other formatting features are not a part of the PDF format – these operations are performed by another program before the file is saved as PDF. As a result, text commands in PDF can be fragmented, leading to a continuous paragraph of words being written in the PDF file as separate lines, with other page elements present in between (like figure captions, page numbers, and images). In addition, PDF documents allow for changing the kerning of text (space between letters) in specific ways that may make it difficult for the characters to be separated visually. One common example of this is the sequence of characters ff or fi in a document, which can sometimes be read in interesting ways by OCR: sometimes, as unicode ff or fi, sometimes as Cyrillic characters, and sometimes left out entirely or misplaced.

Demo: PDF Fun

Consider a Home appraisal record from Lancaster County, NE. Opening the card in a PDF reader and selecting all the text yields a disorganized text file (and this PDF is actually created using modern methods and relatively clean!)

A few observations, marked up in Figure 36.3:

  • The title on page 1 is on line 39 of the file, and it appears that the data from the first column is on lines 1-38.
  • When there are multi-column tables, as in the “Inspection History” table in the middle of the first page, the values are listed by column, but missing values (the times of inspection) are not indicated at all!
  • The Appraised Values table has columns Land, Building, Total, and Method. Line 142 of the text file shows an entry for “Total Method”, and it is clear that the text for the two columns has been combined.

Now, perhaps we could write a script that would disentangle some of this information and format it properly, though I think the missing values would still be unrecoverable without someone visually mapping the data to the corresponding lines.

The arrangement of the text you get from selecting all text is different in different PDF applications – I tried it with Okular and Firefox, and got totally different orders of text boxes.

The scope of the actual problem only becomes visible when you look at a second PDF document and the corresponding text file. Figure 36.4 shows the two PDF files and their corresponding text files, with the comparable portion of each PDF and text file highlighted.

Hopefully you’re beginning to understand how challenging this whole extracting data from PDFs thing can be! What we would really want to do here is detect the column boundaries somehow, and then read the data in from each table column-wise - this would be easier than postprocessing it, and we can also get coordinates for x and y to help us determine which data correspond to the same rows. Hold on to that thought – we’ll come back to it.

36.2 Types of PDF Files

36.2.1 Layers

Practically, we can think of PDFs as having a text layer, an image layer, or both (hybrid PDFs). A PDF with a text layer will allow you to select embedded text and copy it into a text file, while a PDF that just has an image layer does not. It is also possible to have a PDF that has an image layer with a corresponding text layer on top. Optical Character Recognition takes a PDF with only image layers and creates a text file (or layer, depending on the tool) by identifying the characters in the document and converting those characters to text with a corresponding \((x,y)\) location in the document. Different OCR programs use different conventions for this process, and the quality of the image matters a lot as well - some images are just not good enough to produce a passable transcription of the text using automatic methods.

Thus, if we want to think about classifying PDF files by type, we might come up with the following groups:

  • A PDF file that has an image layer, but no text layer. (This is sometimes called a “raster” PDF, because an image that’s made up of pixels is a raster image.)
  • A PDF file that has a text layer but no image layer is unsurprisingly called a text PDF.
  • Many files have both text and images; these are hybrid PDFs.

How we ingest data from PDF files depends heavily on the type of PDF we have.

Demo: Types of PDF Files

For this demo, I’ve converted the first page of one of the Lancaster county, NE property appraisal PDFs into:

Open these up in your favorite PDF editor and try to highlight the text in each. How does it work?

We can also examine the format of a PDF file using R and python libraries.

You’ll need the pdftools package, which you can install with install.packages("pdftools"). This may require you to install libpoppler on Linux , but versions for other operating systems should be self-contained.

The pdf_info function gives us information from the PDF header, and the pdf_text function tries to extract the text, if it exists.

Text-Based PDF
library(pdftools)
library(stringr)

pdf_info("../data/Lancaster-County-NE-Real-Estate-AppraisalCard-28749-2025-147568-pg1-text.pdf")$version
pdf_text("../data/Lancaster-County-NE-Real-Estate-AppraisalCard-28749-2025-147568-pg1-text.pdf") |>
  str_split("\n") |>
  unlist() |>
  head()
## [1] "1.3"
## [1] "                                                                   LANCASTER COUNTY APPRAISAL CARD"                                                                                                        
## [2] "     Parcel ID: 10-24-201-025-000                                         Tax Year: 2025                                                Run Date: 7/15/2025 12:23:27 PM                  Page       1 of 2"
## [3] "    OWNER NAME AND MAILING ADDRESS                                                                                 SALES INFORMATION"                                                                      
## [4] "EASTDALE RENTALS LLC                           Date                Type            Sale Amount          Validity              Multi           Inst.Type                            Instrument #"           
## [5] "Attn: JEFF & ANITA EASTMAN                     05/20/2022          Improved                   $0        Disqualified                          Warranty Deed                        2022025796"             
## [6] "2501 S 74 ST                                   04/23/1996          Improved              $37,000        Disqualified                          Warranty Deed                        1996016959"
1
Split text into lines
2
Remove from list structure – make a vector
3
Show first few lines
Image-Based PDF
pdf_info("../data/Lancaster-County-NE-Real-Estate-AppraisalCard-28749-2025-147568-pg1-image.pdf")$version
## [1] "1.5"
pdf_text("../data/Lancaster-County-NE-Real-Estate-AppraisalCard-28749-2025-147568-pg1-image.pdf") 
## [1] ""

You will need the pdfplumber package [9], which you can install with pip.

The pdf object contains metadata and pages, and page text (if it exists) can be accessed with the pdf.pages[i].extract_text() method.

Text-Based PDF
import pdfplumber
with pdfplumber.open("../data/Lancaster-County-NE-Real-Estate-AppraisalCard-28749-2025-147568-pg1-text.pdf") as pdf:
  metadata = pdf.metadata
  first_page =  pdf.pages[0]
  text = first_page.extract_text()
  pdf.close()

metadata
text.split("\n")[0:5]
## {'CreationDate': "D:20250715122327-05'00'", 'Producer': 'iText# by Gerald Henson (r0.95 of lowagie.com, based on version Paulo build 103)'}
## ['LANCASTER COUNTY APPRAISAL CARD', 'Parcel ID:10-24-201-025-000 Tax Year: 2025 Run Date: 7/15/2025 12:23:27 PM Page 1 of 2', 'OWNER NAME AND MAILING ADDRESS SALES INFORMATION', 'EASTDALE RENTALS LLC Date Type Sale Amount Validity Multi Inst.Type Instrument #', 'Attn: JEFF & ANITA EASTMAN 05/20/2022 Improved $0 Disqualified Warranty Deed 2022025796']
1
Use the pdfplumber library
2
Open the file and call it pdf
3
Get the metadata
4
Get the data for the first page
5
Extract the text from the first page
6
Close the file (important to release memory)
7
Split the text by \n and show the first few lines
Image-Based PDF
with pdfplumber.open("../data/Lancaster-County-NE-Real-Estate-AppraisalCard-28749-2025-147568-pg1-image.pdf") as pdf:
  metadata = pdf.metadata
  first_page =  pdf.pages[0]
  text = first_page.extract_text()
  pdf.close()

metadata
## {'Producer': 'cairo 1.16.0 (https://cairographics.org)', 'CreationDate': "D:20250715133207-05'00"}

text.split("\n")[0:5]
## ['']

Notice that the text does not exist for the image-based PDF. I created the image version of the PDF by opening the PDF in an image editor and saving the resulting file as PDF within that image editor, so it’s not surprising that the text layer is not present in that version of the file.

36.2.2 The Trouble with Text Layers

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data… [10]

The quote above gives you some idea of the challenge of extracting text from a PDF, but it’s likely you’ve already come across this challenge when trying to copy text out of a PDF and into some other program for editing. Incidentally, the PDF representation of text is also a reason why some groups are shifting away from the format entirely – it is difficult to make PDF documents accessible to screen readers, because the text isn’t inherently ordered in any way1

Text extraction from PDFs can be annoying, but extracting structured text in tables is an even harder challenge. Hopefully, the next few demonstrations will help you appreciate why it’s so challenging, as well as introducing you to the tools you may need to convert image layers to text layers.

36.2.3 Converting Images to Text Using Optical Character Recognition

In order to covert image layers to text layers, we need to use Optical Character Recognition (OCR).

Most free OCR tools are based on the tesseract library [11], which you can access using pytesseract in python [12] or the tesseract R package [13] You can also run tesseract from the command line, if you install the library for your operating system and language.

Demo: Optical Character Recognition

For the sake of shorter commands, let’s assume I’m working with a 1-page PDF file named file.pdf and want to create file.txt which contains the text of the image-based PDF.

pdftoppm -png ../data/file.pdf file
tesseract  -l eng ../data/file-1.png ../data/file-1-bash
1
Convert PDF to PNG to work with Tesseract. If the PDF file has more than one page, this will create file-1.pngfile-n.png images that can be fed into tesseract using a bash for loop.
2
Extract the text to a text file.
Text

Link to the text file

## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN,
## 2501 S 74 ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## MAGEE AOE NOON
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## | PROPERTY FACTORS
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parkina Type:
## Parking Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## 
## Tax Year: 2025 Run Date: 7/15/2025 12:23:27 PM Page 1 of 2
## 
## Date Type Sale Amount Validity Multi Inst.Type Instrument #
## 05/20/2022 Improved $0 Disqualified Warranty Deed 2022025796
## 04/23/1996 Improved $37,000 Disqualified Warranty Deed 1996016959
## 09/08/1994 Improved $0 Disaualified Death Certificate 1994045456
## 
## Number Issue Date Amount Status Type Description
## 
## Date Time Process Reason Appraiser Contact-Code
## 
## 07/11/2022 Interview and Measure - 01 General Review MRC Tenant - 2
## 
## 11/20/2015 11:45AM —_—No Answer At Door, Exterior - 04 General Review afo
## 
## 07/15/2010 Field Review - 08 Final Review CAB
## 
## 0504/2010 No Answer At Door, Measured - 05 General Review TMJ
## 
## 09/17/2008 Field Review - 08 Final Review CAB
## 
## Year Level Case # Status Action Year Land Building Total
## 2025 $45,000 $124,700 $169,700
## 2024 $45,000 $124,700 $169,700
## 2023 $45,000 $115,700 $160,700
## 2022 $25,000 $69,400 $94,400
## 2021 $25,000 $69,400 $94,400
## 
## Land Building Total Method
## 
## Current $45,000 $124,700 $169,700 IDXVAL
## Prior $45,000 $124,700 $169,700 IDXVAL
## Cost $158,280 Market $332,300 GRM $169,700
## Income $0 MRA $160,100 Ovr
## 
## Method Type AC/SF (Units Inft Factt Inf2 Fact2 InflC FactC Avg Unit Val Land Value
## 
## Site RPI-Primary Interior 45,000 45,000
## 
## Total Acres 0.15 GIS SF 6402 Mkt Land Total $45,000
## 
## Taxable Aq Land Total $0

After OCR, you can search for text in the text output, but it doesn’t provide all of the features that a hybrid PDF with both an image and a text layer provides – the text isn’t associated with the \((x,y)\) location on the page(s). Also, note that the OCR isn’t perfect: because the word “Flags:” under Exemptions in the left column is partially obscured, it is transcribed as “Flaas:”, which isn’t an english word. Using OCR can introduce errors in to the data in ways that are not necessarily predictable (though, predictable issues include confusing capital O for 0, lowercase L for 1, and so on). However, OCR is miles better than doing things manually!

On Mac and Linux, you will likely need to install some system packages to make the tesseract package installable.

Try installing:
 * deb: libtesseract-dev libleptonica-dev (Debian, Ubuntu, etc)
 * rpm: tesseract-devel leptonica-devel (Fedora, CentOS, RHEL)
 * brew: tesseract (Mac OSX)

The Windows R package should contain the dependencies required.

# install.packages("tesseract")
library(tesseract)
library(pdftools)
library(stringr)

pdf_convert(pdf = "../data/file.pdf", filenames = "../data/file-%d-r.%s")
text <- ocr("../data/file-1-r.png", engine = tesseract("eng"))
writeLines(text, "../data/file-1-r.txt")
## Converting page 1 to ../data/file-1-r.png... done!
## [1] "../data/file-1-r.png"
1
Convert the pdf to an image. Provide placeholder %d in the string for the number, and %s in the string for the extension.
2
Run OCR on the image using the english language engine

Link to the text file created with R

## Parc: 102420125000 Tox Yous 2125 Fun Date: Tn e227 PM_——_—Pooe_1 of 2
## omen ue ano mauincavoness ale ont
## Serer aaracsrion lowe ees So Staules tar bees ‘oro
## PROPERTY STS ADDRESS aun pats
## 20 S204 st hmber——sueOHe Amount Sts Toe esc
## event Propenr?HoRNTON
## Uineunis: 2 magento,
## bape aot FO cen72008 Ft Bevew “08 ral lene ec
## ronan mecenrapremnmstony —sseDvatuenisromy
## ae a $e00 Stat Steir
## | pnorenryractons ie Soo. oan “Bee
## Socino: arrnatseovawues
## [Araaupesonenon cent som st2t7— SD xvAL
## 
## ‘cosmo Wort ABE ORM HERD
##  asereanminconiion
## Towlawes 015 ossr 612 etLand Tot $4500
## 
## Taiabe Aa Land Tt 8

Why is this OCR version so much worse than the version using Bash? They’re using the same tesseract library under the hood!

A look at the documentation of pdftoppm suggests that the default resolution is 150 DPI for image conversion, where the default resolution used by pdf_convert in R is 72 DPI. If we pass in 150 DPI, what happens?

library(tesseract)
library(pdftools)
library(stringr)

pdf_convert(pdf = "../data/file.pdf", dpi  = 150, filenames = "../data/file-%d-r-150dpi.%s")
## Converting page 1 to ../data/file-1-r-150dpi.png... done!
## [1] "../data/file-1-r-150dpi.png"
text <- ocr("../data/file-1-r-150dpi.png", engine = tesseract("eng"))
writeLines(text, "../data/file-1-r-150dpi.txt")

Link to the text file created with R at 150 DPI

## LANCASTER COUNTY APPRAISAL CARD
## Parcel ID: 10-24-201-025-000 Tax Year: 2025 Run Date: 7/15/2025 12:28:27 PM Page 1 of 2
## EASTDALE RENTALS LLC Date Type Sale Amount Validity Multi Inst.Type Instrument #
## Attn: JEFF & ANITA EASTMAN 05/20/2022 Improved $0 Disqualified Warranty Deed 2022025796
## 2501 $ 74ST 04/23/1996 Improved $37,000 _Disaualified Warranty Deed 1996016959
## LINCOLN, NE 68506 0908/1994 Improved $0 Disqualified Death Certificate 1994045456
## Additional Owners
## No.
## 2250 SHELDON ST Number Issue Date Amount Status Type Description
## LINCOLN, NE 68503
## Prop Class: Residential Improved
## Primary Use: Conversion-Apt
## Livina Units: 2 INSPECTION HISTORY
## vo. slemtial Diet Date Time Process Reason Appraiser Contact-Code
## Zonina: R4-Residential District 07/11/2022 Interview and Measure - 01 General Review MRC Tenant -2
## Noha: 8NCO1 - North Central - 11202015 11:45AM —_No Answer At Door, Exterior - 04 General Review afo
## cvbu 07/15/2010 Field Review - 08 Final Review CAB
## br: 05/04/2010 No Answer At Door, Measured - 05 General Review TMJ
## Tax Unit Grp: 0001 09/17/2008 Field Review - 08 Final Review CAB
## Schl Code Base: 55-0001 Lincoln
## Exemptions: ~ RECENTAPPEALHISTORY = ASSESSEDVALUEHISTORY
## Year Level Case # Status Action Year Land Building Total
## Flaas: 2025 $45,000 $124,700 $169,700
## : 2024 $45,000 $124,700 $169,700
## | Property Factors poe peso Ser) Sjonaoo
## 2022 $25,000 $69,400 $94,400
## GBA: 0 2021 $25,000 $69,400 $94,400
## NRA:
## ponent  APPRAISEDVALUES
## Parkina Type:
## Parkina Quantity: Land Building Total Method
## — = LEGAL DESCRIPTION, Current $45,000 $124,700 $169,700 IDXVAL
## ENGLESIDE ADDITION, BLOCK 2, Lot 23 Prior $45,000 $124,700 $169,700 IDXVAL
## Cost $158,280 Market $332,300 GRM $169,700
## Income $0 MRA $160.100 Ovr
## Method Type ACSFUnits Infl_ Fact! ~—sInf2._—sFact2=—SsInflC_— Fact. ~=— Av Unit Val Land Value
## Site RPI-Primary Interior u 45,000 45,000
## TotalAcres 0.15 GIS SF 6402 Mkt Land Total $45,000
## Taxable Aa Land Total $0

If you don’t have the image created, you have to first create an image (imgBlob) for each page, and then run image_to_string on that image. Note that this method does not require that you write the image out to a separate file: the image is only stored in memory. This might be preferable to the bash method, which produces one file for each PDF page and will use up disk space unless you delete the intermediate files at some point after the text extraction is complete.

from PIL import Image
import pytesseract
from pdf2image import convert_from_path

pages = convert_from_path("../data/file.pdf")

for pageNum,imgBlob in enumerate(pages): 
  text = pytesseract.image_to_string(imgBlob, lang='eng')
  with open(f'../data/file-{pageNum+1}-py.txt', 'a') as the_file:
    the_file.write(text)
## 1796
1
Read in the PDF (by default, uses 200 dpi)
2
Run OCR on each page of the PDF
3
Write a file for each page of the PDF containing the text from OCR.

If you already have the image created, it’s simple to get the text out with pytesseract, using Image.open(<file>) instead of convert_from_path().

pytesseract.image_to_string(Image.open('../data/file-1.png')).split("\n")[0:5] 
## ['Parcel ID: 10-24-201-025-000', '', 'EASTDALE RENTALS LLC', 'Attn: JEFF & ANITA EASTMAN,', '2501 S 74 ST']

Link to the text file created with Python

## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## CIRCUIT COURT FOR FREDERICK COUNTY
## 
## COURT HOUSE
## FREDERICK, MARYLAND 21701
## 
## IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND
## 
## EXLINE-HASSLER
## 
## Plaintiff
## Vv. Civil Docket
## 
## No. 10-C-12-000410
## PENN NATIONAL INSURANCE, ET AL.,
## 
## Defendant
## 
## OFFICIAL TRANSCRIPT OF PROCEEDINGS
## 
## (JURY TRIAL - DAY ONE)
## 
## Frederick, Maryland
## 
## January 22, 2013
## 
## BEFORE:
## 
## THE HONORABLE JULIE S. SOLT, JUDGE
## 
## APPEARANCES :
## 
## For the Plaintiff:
## LAURA C. ZOIS, ESQUIRE
## JOHN B. BRATT, ESQUIRE
## 
## For the Defendant:
## WALTER E. GILLCRIST, JR., ESQUIRE
## ANNE K. HOWARD, ESQUIRE
## 
## For Penn National Insurance, et al.:
## GUIDO PORCARELLI, ESQUIRE
## 
## TRANSCRIBED BY:
## Victoria Eastridge
## Official Transcriber
## 
## 100 W. Patrick Street
## Frederick, Maryland 21701
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
## Parcel ID: 10-24-201-025-000
## 
## EASTDALE RENTALS LLC
## Attn: JEFF & ANITA EASTMAN
## 2501S 74ST
## 
## LINCOLN, NE 68506
## 
## Additional Owners
## No.
## 
## 2250 SHELDON ST
## LINCOLN, NE 68503
## 
## Prop Class: Residential Improved
## 
## Primary Use: Conversion-Apt
## 
## Living Units: 2
## 
## Zonina: R4-Residential District
## 
## Nbhd: 8NC01 - North Central -
## CVDU
## 
## Tax Unit Grp: 0001
## 
## Schl Code Base: 55-0001 Lincoln
## 
## Exemptions:
## 
## Flaas:
## 
## GBA: 0
## 
## NRA:
## 
## Location:
## 
## Parking Tvpe:
## 
## Parkina Quantitv:
## 
## ENGLESIDE ADDITION, BLOCK 2, Lot 23
## 
## LANCASTER COUNTY APPRAISAL CARD
## Tax Year: 2025
## 
## Date Type Sale Amount Validity
## 
## 05/20/2022 Improved $0 Disqualified
## 04/23/1996 Improved $37,000 Disaualified
## 09/08/1994 Improved $0 Disqualified
## 
## Number Issue Date
## 
## Date Time Process
## 
## 07/11/2022 Interview and Measure - 01
## 11/20/2015 11:45 AM No Answer At Door, Exterior - 04
## 07/15/2010 Field Review - 08
## 
## 05/04/2010 No Answer At Door, Measured - 05
## 09/17/2008 Field Review - 08
## 
## Case # Status Action
## 
## Amount _ Status Type
## 
## Run Date: 7/15/2025 12:23:27 PM
## 
## Description
## 
## Reason
## General Review
## General Review
## Final Review
## General Review
## Final Review
## 
## Current
## Prior
## Cost
## 
## Income
## 
## Page 1 of 2
## 
## Multi Inst.Type Instrument #
## Warranty Deed 2022025796
## Warranty Deed 1996016959
## Death Certificate 1994045456
## 
## Appraiser Contact-Code
## MRC Tenant - 2
## 
## afo
## 
## CAB
## 
## TMJ
## 
## CAB
## 
## Land Buildina Total
## $45,000 $124,700 $169,700
## $45,000 $124,700 $169,700
## $45,000 $115,700 $160,700
## $25,000 $69.400 $94,400
## $25,000 $69.400 $94,400
## 
## Land Buildina Total Method
## $45,000 $124,700 $169,700 IDXVAL
## $45.000 $124,700 $169,700 IDXVAL
## 
## $158,280 Market $332,300 GRM $169,700
## $0 MRA $160,100 Ovr
## 
## Method Type
## Site RPI-Primary Interior
## 
## Total Acres 0.15 GIS SF 6402
## 
## ACSF Units Inf1 Fact1
## 
## Inf2 Fact2
## 
## InflC FactC Land Value
## 
## 45,000
## 
## Avg Unit Val
## 45,000
## 
## Mkt Land Total $45,000
## Taxable Aq Land Total $0
Assessment of OCR Methods

What is interesting is that even when controlling the DPI, the OCR programs in each language generate different text files. Bash and python are fairly similar, but R’s text file is ordered by horizontal lines, not columns. It seems likely that we could probably fix the issue if we got the right set of options, but a more straightforward option might be to crop the images into separate chunks for each table and section. That might produce cleaner and more interpretable OCR’d images. If we wanted to process many of these files and ingest the information into a database, then we would need to ensure that we could determine how to crop the images

In many cases, we don’t want to deal with only a text file - we want the context of the image, but we’d like to be able to see the text, use Ctrl/Cmd-F to find the right page, and copy the text back out of the file.

Consider Hamlet, which I downloaded from The Internet Archive - it has the original text, but it also has a text layer that allows you to search through the document and find, for instance, the 11 instances of the word “skull”, four of which are on page 82. In general, the text layer is either not displayed, or, more commonly, situated behind the image layer, allowing the reader to access the words without showing the text over top of the image. How are these hybrid PDF documents created?

Demo: Creating Hybrid PDFs

When I need OCR and don’t necessarily want to bother with R, I prefer to use a program called ocrmypdf that is based on tesseract and available for most distributions. I started using ocrmypdf before I realized that it’s actually a python package that can just be called from the Linux command line. In any case, it’s possible to use this command within R or Python, but once you have the python package installed and the binary in your system path, it’s just as easy to use the program from the terminal – everything else is just overhead.

Let’s OCR the Lancaster county, NE home appraisal image PDF and see what we come up with.

ocrmypdf -l eng ../data/file.pdf ../data/file-ocrmypdf.pdf

The output of ocrmypdf is a hybrid PDF that has the text and image data superimposed (the text is not visible until you highlight it).

36.3 Working with PDFs Programmatically

36.3.1 Reading Text from Raster-based PDFs with OCR

In many real-world situations, we may not need to read tabular data out of a PDF. For instance, Google’s Library Project scanned millions of books, allowing them to show the frequency of a word’s use over time. This wouldn’t be possible without a very good OCR library (which is one reason Google took over tesseract when IBM open-sourced it).

The primary challenges when the goal is reading in text from an OCR’d PDF are:

  1. If documents are the result of lower-quality scans, noise removal, background removal, and brightness/contrast adjustments may be necessary before the OCR step.
  2. Ensuring word order is maintained across lines
  3. Removing page numbers and header/footer information

36.3.1.1 Extended Example: Trial Transcripts

Let’s start out with a relatively simple case, based on a real problem I tried to tackle in 2022. We have a number of court trial transcripts, and we want to systematically search for certain phrases to determine how common they are2. We’ll ignore, for now, the problem of actually understanding what the words are saying and classifying that properly (that would require language or topic modeling), and focus on the problem of even getting the trial transcripts converted into plain text in a format which is readable.

The first step will be to run OCR on the first day’s trial transcript. I acquired these transcripts from a professional attorney information site for demonstration purposes, as the transcripts I was working with were confidential.

I will show code for running OCR on all pages of the file, but it doesn’t make sense to actually run that code yet – there is always some tweaking to be done, so I usually will run 1 page, and then 5 pages, and then 10, and then the full set, optimizing the code as I go.

Step 1: Testing OCR of full page(s)
pdf="../data/legal-transcript-trial_-_day_1.pdf"
path="../data/legal-transcript-trial-day-1"
mkdir -p $path
pdftoppm -png -r 300 "$pdf" "$path/page"

for file in "$path/*.png"
do
  tesseract  -l eng $file "${file%.*}-bash";
done

tesseract  -l eng "$path/page-001.png" "$path/page-001-bash"
1
Convert each page to a separate PNG. The \ character allows extending the command over multiple lines for readability
2
Iterate over the PNGs (commented out because there’s no point in doing it multiple times at the moment)
3
Convert each PNG to a corresponding text file, removing the extension and adding -text to the end of the file name (tesseract will add the .txt extension). ${file%.*} is a bash convention to remove the file extension, and adding -text at the end ensures that we have a valid file name.
4
This is the single-file version of the command in the for loop - the biggest difference being that we specify the input and output file name instead of using $file placeholders.

We could speed this up using gnu parallel if we wanted to do so - see this script for an example of what that would look like.

Any further cleaning work should probably be done in some other language (we could go through the use of awk and sed here, but that’s probably beyond the scope of this book).

cat ../data/legal-transcript-trial-day-1/page-001-bash.txt
## cat: ../data/legal-transcript-trial-day-1/page-001-bash.txt: No such file or directory
library(tesseract)
library(pdftools)
library(stringr)
library(purrr)
# pdf_convert(
#   pdf = "../data/legal-transcript-trial_-_day_1.pdf",
#   filenames = "../data/legal-transcript-trial-day-1/page-%03d.%s",
#   dpi = 300)
png_files <- list.files("../data/legal-transcript-trial-day-1/",
                        "png$", full.names = T)
# text <- map_chr(png_files, ~ocr(., engine = tesseract("eng")))
text1 <- ocr(png_files[1], engine = tesseract("eng"))
1
Convert PDF pages to PNGs using R
2
List out all png files in the folder
3
OCR each PNG file and return the text in a character vector (all files)
4
OCR the first PNG and return the text
text1[[1]]
## [1] "IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND\nEXLINE-HASSLER\nPlaintiff a\nVv. Civil Docket\nNo. 10-C-12-000410\nPENN NATIONAL INSURANCE, ET AL.,\n: Defendant\nOFFICIAL TRANSCRIPT OF PROCEEDINGS\n(JURY TRIAL - DAY ONE)\nFrederick, Maryland\nJanuary 22, 2013\nBEFORE: . .\nTHE HONORABLE JULIE S. SOLT, JUDGE\nAPPEARANCES:\n| For the Plaintiff:\n| LAURA C. ZOIS, ESQUIRE\nJOHN B. BRATT, ESQUIRE\n) For the Defendant:\nWALTER E. GILLCRIST, JR., ESQUIRE\nANNE K. HOWARD, ESQUIRE\n| For Penn National Insurance, et al.: |\n/ GUIDO PORCARELLI, ESQUIRE\n,\n3 TRANSCRIBED BY:\nVictoria Eastridge\ni Official Transcriber\n. 100 W. Patrick Street\n> Frederick, Maryland 21701\n"
text1[[1]] |> 
  str_split("\n", simplify = F) |> unlist()
##  [1] "IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND"
##  [2] "EXLINE-HASSLER"                                     
##  [3] "Plaintiff a"                                        
##  [4] "Vv. Civil Docket"                                   
##  [5] "No. 10-C-12-000410"                                 
##  [6] "PENN NATIONAL INSURANCE, ET AL.,"                   
##  [7] ": Defendant"                                        
##  [8] "OFFICIAL TRANSCRIPT OF PROCEEDINGS"                 
##  [9] "(JURY TRIAL - DAY ONE)"                             
## [10] "Frederick, Maryland"                                
## [11] "January 22, 2013"                                   
## [12] "BEFORE: . ."                                        
## [13] "THE HONORABLE JULIE S. SOLT, JUDGE"                 
## [14] "APPEARANCES:"                                       
## [15] "| For the Plaintiff:"                               
## [16] "| LAURA C. ZOIS, ESQUIRE"                           
## [17] "JOHN B. BRATT, ESQUIRE"                             
## [18] ") For the Defendant:"                               
## [19] "WALTER E. GILLCRIST, JR., ESQUIRE"                  
## [20] "ANNE K. HOWARD, ESQUIRE"                            
## [21] "| For Penn National Insurance, et al.: |"           
## [22] "/ GUIDO PORCARELLI, ESQUIRE"                        
## [23] ","                                                  
## [24] "3 TRANSCRIBED BY:"                                  
## [25] "Victoria Eastridge"                                 
## [26] "i Official Transcriber"                             
## [27] ". 100 W. Patrick Street"                            
## [28] "> Frederick, Maryland 21701"                        
## [29] ""

R seems to read in the characters closer to the left side of the page, like the gutter notes and the line numbers, where tesseract via bash did not.

from PIL import Image
import pytesseract
from pdf2image import convert_from_path

#pages = convert_from_path("../data/legal-transcript-trial_-_day_1.pdf", dpi = 300)

pages = convert_from_path("../data/legal-transcript-trial_-_day_1.pdf", 
                          dpi = 300, first_page=1, last_page=1)

for pageNum,imgBlob in enumerate(pages):
  text = pytesseract.image_to_string(imgBlob, lang='eng')
  with open(f'../data/legal-transcript-trial-day-1/page-{pageNum+1:03}-py.txt', 'w') as the_file:
    the_file.write(text)
## 708
1
Convert all pages to images (held in memory, not saved to a file)
2
Convert only the first page to an image in memory
3
OCR the image
4
Save the text to a file. {pageNum+1:03}-py.txt is a way to specify the output filename - since Python is 0-indexed, we need to add 1 to the page number index to get the actual page number of the PDF (since we don’t count pages from 0, normally). The :03 component is saying that the number should be formatted as an integer with 3 digits, and the front should be padded with 0s, so 1 becomes 001, 10 becomes 010, and so on. The f at the front of the string indicates that this is an f-string – a formatted string literal.
cat ../data/legal-transcript-trial-day-1/page-001-py.txt
## CIRCUIT COURT FOR FREDERICK COUNTY
## 
## COURT HOUSE
## FREDERICK, MARYLAND 21701
## 
## IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND
## 
## EXLINE-HASSLER
## 
## Plaintiff
## Vv. Civil Docket
## 
## No. 10-C-12-000410
## PENN NATIONAL INSURANCE, ET AL.,
## 
## Defendant
## 
## OFFICIAL TRANSCRIPT OF PROCEEDINGS
## 
## (JURY TRIAL - DAY ONE)
## 
## Frederick, Maryland
## 
## January 22, 2013
## 
## BEFORE:
## 
## THE HONORABLE JULIE S. SOLT, JUDGE
## 
## APPEARANCES :
## 
## For the Plaintiff:
## LAURA C. ZOIS, ESQUIRE
## JOHN B. BRATT, ESQUIRE
## 
## For the Defendant:
## WALTER E. GILLCRIST, JR., ESQUIRE
## ANNE K. HOWARD, ESQUIRE
## 
## For Penn National Insurance, et al.:
## GUIDO PORCARELLI, ESQUIRE
## 
## TRANSCRIBED BY:
## Victoria Eastridge
## Official Transcriber
## 
## 100 W. Patrick Street
## Frederick, Maryland 21701

The python generated file looks pretty dang perfect to me, which is cool.

It’s always preferable to refine your methods before running them on the whole vector, which is why I only OCR’d the first file. Looking at the text produced by bash and R (but not python!), it’s clear that we’ll need to do some cleaning. If possible, we’d like to get line numbers out as a column, but if not, we would at least like to remove them from the text. There is no way to do this without at least looking at the PDF or PNG files; if we want to do a really good job, we’ll need to manually clean each page (ugh).

Step 2: Cropping

We could reduce this work a bit if we could automatically recognize the vertical-ish lines separating the transcript content from the line numbers, but even doing that automatically is tricky and would likely involve line detection, rotation, and cropping for each image.

However, it might be easy enough to make an educated guess, as this PDF is actually a “best case” OCR scenario. The pages don’t have a ton of dust, copy artifacts, etc., so if we can crop off the top, bottom, and left margins, we’d probably get cleaner data. In addition, the pages aren’t that skewed and are mostly in the same alignment – we might be able to get away with specifying a crop boundary for all pages.

I could figure out how to crop images in R or python, but honestly, I think the easiest way to do this is just to use the command line. I encourage you to search for how to do

for a folder of files on the command line, because no one remembers the syntax for this stuff unless they’re doing it all the time. I consulted this page and this page and eventually pieced something together. The $(echo $f | sed 's/\.png/-crop.png/') is probably not the most elegant way to rename the output files but it works (which the other methods I tried did not). This requires that imagemagick is installed and on the system path.

Then, we have to figure out how much to crop. Imagemagick’s convert command uses an argument string of <Width>x<Height>+OffsetX+OffsetY. I opened a few files up in an image viewing program that had a crop function and drew some rectangles to see how different cropping arguments would work, as in Figure 36.5.

rm ../data/legal-transcript-trial-day-1/*-crop.png
for f in ../data/legal-transcript-trial-day-1/*.png; 
  do convert -crop 2000x2700+450+300 +repage "$f" $(echo $f | sed 's/\.png/-crop.png/');
done
1
Removing previously cropped files prevents cropping *-crop.png files to get *-crop-crop.png files and accumulating ridiculous numbers of files.
2
Crop the page to a 2000 width, 2700 height page, starting (from the top left) at x=450, y = 300. The repage argument resets the coordinates of the PNG so that the canvas doesn’t have the original size. $(echo $f | sed 's/\.png/-crop.png/') renames the output file to have -crop.png at the end… more challenging than expected.

Once the conversion command has run, it can be helpful to page through the cropped files looking for problems, and rerun the crop command if problems are identified, as in Figure 36.6.

Then, we re-run the OCR process on the cropped files.

Step 3: OCR on cropped images
# for file in ../data/legal-transcript-trial-day-1/*-crop.png
# do
#   tesseract  -l eng $file "${file%.*}-bash"
# done

path="../data/legal-transcript-trial-day-1"
tesseract  -l eng "$path/page-001-crop.png" "$path/page-001-crop-bash"
1
Iterate over the PNGs (commented out because there’s no point in doing it multiple times at the moment)
2
Convert each PNG to a corresponding text file, removing the extension and adding -text to the end of the file name (tesseract will add the .txt extension). ${file%.*} is a bash convention to remove the file extension, and adding -text at the end ensures that we have a valid file name.
3
This is the single-file version of the command in the for loop - the biggest difference being that we specify the input and output file name instead of using $file placeholders.
cat ../data/legal-transcript-trial-day-1/page-001-crop-bash.txt
## IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND
## EXLINE-HASSLER
## 
## Plaintiff
## Vv. Civil Docket
## No. 10-C-12-000410
## PENN NATIONAL INSURANCE, ET AL.,
## 
## Defendant
## 
## OFFICIAL TRANSCRIPT OF PROCEEDINGS
## 
## (JURY TRIAL - DAY ONE)
## 
## Frederick, Maryland
## 
## January 22, 2013
## 
## BEFORE:
## 
## THE HONORABLE JULIE S. SOLT, JUDGE
## 
## APPEARANCES :
## 
## For the Plaintiff:
## LAURA C. ZOIS, ESQUIRE
## JOHN B. BRATT, ESQUIRE
## 
## For the Defendant:
## WALTER E. GILLCRIST, JR., ESQUIRE
## ANNE K. HOWARD, ESQUIRE
## 
## For Penn National Insurance, et al.:
## GUIDO PORCARELLI, ESQUIRE
## 
## TRANSCRIBED BY:
## 
## Victoria Eastridge
## Official Transcriber
## 
## 100 W. Patrick Street
## Frederick, Maryland 21701

The text looks pretty perfect to me at this point.

library(tesseract)
library(pdftools)
library(stringr)
library(purrr)

png_files <- list.files("../data/legal-transcript-trial-day-1/",
                        "-crop.png$", full.names = T)
textcrop <- ocr(png_files[1], engine = tesseract("eng"))
1
List out all png files in the folder
2
OCR the first PNG and return the text
textcrop[[1]] |> 
  str_split("\n", simplify = F) |> unlist()
##  [1] "IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND"
##  [2] "EXLINE-HASSLER"                                     
##  [3] "Plaintiff a"                                        
##  [4] "Vv. Civil Docket"                                   
##  [5] "No. 10-C-12-000410"                                 
##  [6] "PENN NATIONAL INSURANCE, ET AL.,"                   
##  [7] ": Defendant"                                        
##  [8] "OFFICIAL TRANSCRIPT OF PROCEEDINGS"                 
##  [9] "(JURY TRIAL - DAY ONE)"                             
## [10] "Frederick, Maryland"                                
## [11] "January 22, 2013"                                   
## [12] "BEFORE: . ."                                        
## [13] "THE HONORABLE JULIE S. SOLT, JUDGE"                 
## [14] "APPEARANCES:"                                       
## [15] "| For the Plaintiff:"                               
## [16] "| LAURA C. ZOIS, ESQUIRE"                           
## [17] "JOHN B. BRATT, ESQUIRE"                             
## [18] ") For the Defendant:"                               
## [19] "WALTER E. GILLCRIST, JR., ESQUIRE"                  
## [20] "ANNE K. HOWARD, ESQUIRE"                            
## [21] "| For Penn National Insurance, et al.: |"           
## [22] "/ GUIDO PORCARELLI, ESQUIRE"                        
## [23] ","                                                  
## [24] "3 TRANSCRIBED BY:"                                  
## [25] "Victoria Eastridge"                                 
## [26] "i Official Transcriber"                             
## [27] ". 100 W. Patrick Street"                            
## [28] "> Frederick, Maryland 21701"                        
## [29] ""

R is still finding a few extra characters, but this is a vast improvement over the previous version. We could likely remove what’s left through some careful text cleaning.

from PIL import Image
import pytesseract
from pdf2image import convert_from_path

image="../data/legal-transcript-trial-day-1/page-001-crop.png"

text = pytesseract.image_to_string(Image.open(image))
f = open(f'../data/legal-transcript-trial-day-1/page-001-crop-py.txt', 'w')
f.writelines(text)
f.close()
cat ../data/legal-transcript-trial-day-1/page-001-crop-py.txt
## IN THE CIRCUIT COURT FOR FREDERICK COUNTY, MARYLAND
## EXLINE-HASSLER
## 
## Plaintiff
## Vv. Civil Docket
## No. 10-C-12-000410
## PENN NATIONAL INSURANCE, ET AL.,
## 
## Defendant
## OFFICIAL TRANSCRIPT OF PROCEEDINGS
## 
## (JURY TRIAL - DAY ONE)
## 
## Frederick, Maryland
## 
## January 22, 2013
## 
## BEFORE:
## THE HONORABLE JULIE S. SOLT, JUDGE
## 
## APPEARANCES :
## 
## For the Plaintiff:
## LAURA C. ZOIS, ESQUIRE
## JOHN B. BRATT, ESQUIRE
## 
## | For the Defendant:
## WALTER E. GILLCRIST, JR., ESQUIRE
## ANNE K. HOWARD, ESQUIRE
## 
## For Penn National Insurance, et al.:
## GUIDO PORCARELLI, ESQUIRE
## 
## 3 TRANSCRIBED BY:
## Victoria Eastridge
## t Official Transcriber
## . 100 W. Patrick Street
## > Frederick, Maryland 21701

In this case, the output from the cropped version is actually not as clean as the output from the cleaned version. This can occur because of settings that detect the main part of the page - if there isn’t enough margin, these settings sometimes don’t work as well. It’s important to calibrate the pipeline to the tools you have available.

Step 4: Cleaning The Text Output

Once the pages have been cropped and OCR’d, then we need to clean up the text output. This might involve some of the following steps:

  • Concatenating the text files into a single file
  • Joining adjacent lines that are part of the same thought and from the same speaker
  • Assigning line numbers based on the order of the transcript
  • Separating out the speaker information, so that there is a column for speaker and a column for text
  • Running the transcript through spell check to handle any minor OCR errors, like using a zero instead of a capital O
  • Cleaning up punctuation marks that are often confused, like ( and { to ensure that all brackets match
  • Identifying portions of the transcript that are from depositions (which have Q and A at the beginning of each line, instead of speakers)

Which steps are undertaken depends heavily on the goal of the analysis. We might not care about punctuation marks if we’re going to apply text processing algorithms that require that we strip all punctuation out of the text, but we may need to remove all of the speaker information so that only spoken text remains.

All of these steps are string processing tasks, which will not be repeated in this chapter.

36.3.2 Reading Tabular Data from Text or Hybrid PDFs

As mentioned above, tabular data is a particular challenge to read from PDF files, as the PDF specification doesn’t actually have any way to represent structured text. There are two common open-source libraries recommended for extracting tabular data from PDFs - tabula, which is a Java library [14], and camelot, a Python library [15]. There are interfaces to the tabula library in both R and Python (tabulapdf and tabula-py, respectively), but there is no R interface to camelot, as far as I am aware. In this chapter, I will focus primarily on tabula, since it works across both R and Python, but if you ever run into issues using it, consider camelot as well – it has some cool features [16].

Installation of tabulapdf in R depends on rJava, which can be a bit tricky, particularly on Windows. The tabulapdf github page has more detailed instructions for how to install Java for Windows using Chocolatey.

36.3.2.1 Example: BLS

The Bureau of Labor Statistics provides monthly Consumer Price Index news releases. An archive of these releases is available at https://www.bls.gov/bls/news-release/cpi.htm. Let’s acquire 2 years worth of monthly CPI reports, and focus on trying to extract the first table, “Consumer Price Index for All Urban Consumers (CPI-U): US city average, by expenditure category”.

As this material is covered in Chapter 33, I’m just going to provide the code to do this in R – you can see equivalent python commands in Chapter 33. After having done this in R, I realized I probably could have accomplished the same task with a single wget command in bash, the lesson being that it is important to pick your tools wisely.

library(rvest)
library(lubridate)
library(stringr)

save_dir <- "../data/bls-pdfs/"
dir.create(save_dir, showWarnings = F)

url <- "https://www.bls.gov/bls/news-release/cpi.htm"

session <- read_html_live(url)

# PDFs are the 2nd link in each entry
links <- session$html_elements("li a:nth-child(2)")  

# Get the last 2 years of entries
link_tbl <- tibble(link = html_attr(links, "href"), 
                   date = str_extract(link, "\\d{8}")) |>
  na.omit() |>
  mutate(datestr = date, date = mdy(date)) |>
  filter(today() - years(2) <= date)


ua <- "Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0"
options(HTTPUserAgent = ua)

filelist <- paste0("https://www.bls.gov", link_tbl$link)
filesave <- paste0(save_dir, basename(link_tbl$link))

# The site is finicky about user agents, so we need to 
# specifically pass that in to the download.file method.
walk2(filelist, filesave, ~download.file(.x, destfile = .y, method = "wget", extra = paste0("-U \"", ua, "\"")))

You can either run the code above (assuming you have wget on your machine), or you can download a zip file of the PDFs.

# install.packages("tabulapdf")
library(tabulapdf)
library(pdftools)
library(purrr)
library(stringr)

files <- list.files(path = "../data/bls-pdfs", pattern = ".pdf$", full.names=T)

find_page_number <- function(file) {
  txt <- pdf_text(file)
  txt_by_page <- map_chr(txt, ~paste(., collapse=" "))
  which(str_detect(txt_by_page, "Table 1"))
}

# page_numbers <- map_int(files, find_page_number)
page_numbers <- c(9, 10, 8, 9, 8, 8, 9, 8, 9, 9, 8, 9, 9, 9, 9, 9, 9, 9, 9, 10, 9, 10, 9)

tables <- extract_tables(files[1], page = page_numbers[1],  output = "tibble")[[1]]

head(tables)
## # A tibble: 6 × 5
##   ...1                 ...2    ...3  `Unadjusted percent` Seasonally adjusted …¹
##   <chr>                <chr>   <chr> <chr>                <chr>                 
## 1 <NA>                 Relati… Unad… change               change                
## 2 <NA>                 impor-  <NA>  <NA>                 <NA>                  
## 3 Expenditure category tance   <NA>  Dec. Nov.            Sep. Oct. Nov.        
## 4 <NA>                 Nov.    Dec.… 2022- 2023-          2023- 2023- 2023-     
## 5 <NA>                 2023    2022… Dec. Dec.            Oct. Nov. Dec.        
## 6 <NA>                 <NA>    <NA>  2023 2023            2023 2023 2023        
## # ℹ abbreviated name: ¹​`Seasonally adjusted percent`

It appears that tabulapdf isn’t separating the columns the way we’d prefer. Let’s see if we can fix that….

There’s a function, locate_areas(), that works interactively - it opens a viewer tab and you select the table using the mouse, as in Figure 36.8.

locate_areas provides a sequence of coordinates that are relatively consistent across multiple full-page tables, so we might try to use those coordinates to improve our table parsing.

locate_areas(files[1], pages = rep(page_numbers[1], 5))

Here’s what I got running this 5 times for the first PDF in the list – this gives me boundaries for each column (without including the header).

Listening on http://127.0.0.1:6481
[[1]]
     top     left   bottom    right 
128.1042 209.2557 583.8484 244.2939 

[[2]]
     top     left   bottom    right 
128.1042 289.0649 584.8223 327.9962 

[[3]]
     top     left   bottom    right 
128.1042 369.8473 586.7699 410.7252 

[[4]]
     top     left   bottom    right 
128.1042 452.5763 586.7699 489.5611 

[[5]]
     top     left   bottom    right 
126.1566 535.3054 586.7699 575.2099 

We can actually run this for each column, keeping track of the left and right values, to get an even more precise way to read our data in. Here are my rough column alignments, using cpi_01112024.pdf as a test.

  • Table Start - 35
  • Col2 - 209
  • Col3 - 244
  • Col4 - 289
  • Col5 - 328
  • Col6 - 370
  • Col7 - 411
  • Col8 - 453
  • Col9 - 490
  • Col10 - 535
  • Table End - 575
tables <- extract_tables(
  files[1], page = page_numbers[1], 
  guess = F,
  col_names = F, 
  area = list(c(128, 35, 586, 575)), 
  columns = list(c(209, 244, 289, 328, 370, 411, 453, 490, 535))
)[[1]]

head(tables)
## # A tibble: 6 × 10
##   X1                          X2    X3    X4    X5    X6    X7    X8    X9   X10
##   <chr>                    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 All items.. . . . . . … 100     297.  307.  307.   3.4  -0.1   0     0.1   0.3
## 2 Food.. . . . . . . . .…  13.4   317.  325.  325.   2.7   0.1   0.3   0.2   0.2
## 3 Food at home.. . . . .…   8.55  299.  303.  303.   1.3  -0.1   0.3   0.1   0.1
## 4 Cereals and bakery pro…   1.16  345.  356.  354.   2.6  -0.7   0.2   0.5  -0.3
## 5 Meats, poultry, fish, …   1.78  320.  320.  320.  -0.1   0.1   0.7  -0.2   0.5
## 6 Dairy and related prod…   0.78  271.  268.  268.  -1.3   0.1   0.3   0.1   0.3

Ok, that looks good - let’s apply it to the rest of the reports now.

tables <- map2(files, page_numbers, ~extract_tables(
    .x, page = .y, 
    guess = F,
    col_names = F, 
    area = list(c(128, 35, 586, 575)), 
    columns = list(c(209, 244, 289, 328, 370, 411, 453, 490, 535))
    )[[1]]
  )

Then, we can read in the dates that are present in the header row, assuming that the major dividers stay the same between reports. I used locate_areas() to get the coordinates of each header that we care about.

> locate_areas(files[1], pages = page_numbers[1])

Listening on http://127.0.0.1:7339
[[1]]
      top      left    bottom     right 
 89.15168  36.98473 127.13037 572.29009 
headers <- map2(
  files, page_numbers,
  ~extract_tables(
    .x, page = .y,
    guess = F,
    col_names = F,
    area = list(c(89, 35, 128, 575)),
    columns = list(c(209, 244, 289, 328, 370, 411, 453, 490, 535))
  )[[1]]
)

fix_headers <- function(tbl) {
modifiers <- c("", "Rel_imp.", rep("Unadj_idx.", 3), rep("Unadj_pct_chg.", 2), rep("Seas_adj_pct_chg.", 3))
  
  vars <- c(tbl[1,1],
    paste(unlist(tbl[2:3, 2]), collapse=""),
    paste(unlist(tbl[2:3, 3]), collapse=""),
    paste(unlist(tbl[2:3, 4]), collapse=""),
    paste(unlist(tbl[2:3, 5]), collapse=""),
    paste(unlist(tbl[1:4, 6]), collapse=""),
    paste(unlist(tbl[1:4, 7]), collapse=""),
    paste(unlist(tbl[1:4, 8]), collapse=""),
    paste(unlist(tbl[1:4, 9]), collapse=""),
    paste(unlist(tbl[1:4, 10]), collapse="")
    ) |> 
    str_remove_all("\\.") |>
    str_replace_all("[ -]", "_")
  
  paste0(modifiers, vars)
}

headers_fixed = map(headers, fix_headers)

library(magrittr)
tables <- map2(tables, headers_fixed, ~set_names(.x, .y))

tables[[1]]
## # A tibble: 41 × 10
##    Expenditure_category      Rel_imp.Nov2023 Unadj_idx.Dec2022 Unadj_idx.Nov2023
##    <chr>                               <dbl>             <dbl>             <dbl>
##  1 All items.. . . . . . . …          100                 297.              307.
##  2 Food.. . . . . . . . . .…           13.4               317.              325.
##  3 Food at home.. . . . . .…            8.55              299.              303.
##  4 Cereals and bakery produ…            1.16              345.              356.
##  5 Meats, poultry, fish, an…            1.78              320.              320.
##  6 Dairy and related produc…            0.78              271.              268.
##  7 Fruits and vegetables. .…            1.47              349.              351.
##  8 Nonalcoholic beverages a…           NA                  NA                NA 
##  9 materials. . . . . . . .…            1.03              210.              216.
## 10 Other food at home.. . .…            2.33              263.              270.
## # ℹ 31 more rows
## # ℹ 6 more variables: Unadj_idx.Dec2023 <dbl>,
## #   Unadj_pct_chg.Dec2022_Dec2023 <dbl>, Unadj_pct_chg.Nov2023_Dec2023 <dbl>,
## #   Seas_adj_pct_chg.Sep2023_Oct2023 <dbl>,
## #   Seas_adj_pct_chg.Oct2023_Nov2023 <dbl>,
## #   Seas_adj_pct_chg.Nov2023_Dec2023 <dbl>
1
Pull headers out using roughly the same command as we used to get the tables, but with a different top and bottom area.
2
Write a function to clean the headers up a bit
3
modifiers are the top row of the variable names that aren’t captured by our headers object. They’re consistent from report to report. We’ll separate the modifier from the dates using ., for easier cleaning.
4
Extract only the components of the headers object that are needed – this depends on whether we’re talking about a single month-to-month comparison, or a time span. In the first column, we only need the “expenditure category” object.
5
Remove all . characters from the names so they don’t mess up our delimiter, and replace spaces and dashes with _.
6
Paste the two vectors together to get the names.
7
Apply the function to each header
8
Set the names of the variables in each table to the corresponding header.

Then, we just need to clean things up a bit more.

library(lubridate)
library(dplyr)
library(tidyr)
report_date <- str_replace(basename(files), "cpi_(.*)\\.pdf", "\\1") |> mdy()

cpi_data <- map2(tables, report_date, ~mutate(.x, report_date = .y))

cpi_data <- map(
  cpi_data, 
  ~pivot_longer(., -c(Expenditure_category, report_date),
                names_to="var",
                values_to = "val") |>
    separate(var, c("varname", "vardate"), sep = "\\.") |>
    pivot_wider(id_cols = c("Expenditure_category", "report_date", "vardate"), names_from = "varname", values_from = "val")
)

cpi_data <- cpi_data |>
  bind_rows() |>
  mutate(Expenditure_category = str_remove_all(Expenditure_category, "[\\. ]{1,}$") |>
           str_trim())

dim(cpi_data)
cpi_data
## [1] 6601    7
## # A tibble: 6,601 × 7
##    Expenditure_category report_date vardate      Rel_imp Unadj_idx Unadj_pct_chg
##    <chr>                <date>      <chr>          <dbl>     <dbl>         <dbl>
##  1 All items            2024-01-11  Nov2023        100        307.          NA  
##  2 All items            2024-01-11  Dec2022         NA        297.          NA  
##  3 All items            2024-01-11  Dec2023         NA        307.          NA  
##  4 All items            2024-01-11  Dec2022_Dec…    NA         NA            3.4
##  5 All items            2024-01-11  Nov2023_Dec…    NA         NA           -0.1
##  6 All items            2024-01-11  Sep2023_Oct…    NA         NA           NA  
##  7 All items            2024-01-11  Oct2023_Nov…    NA         NA           NA  
##  8 Food                 2024-01-11  Nov2023         13.4      325.          NA  
##  9 Food                 2024-01-11  Dec2022         NA        317.          NA  
## 10 Food                 2024-01-11  Dec2023         NA        325.          NA  
## # ℹ 6,591 more rows
## # ℹ 1 more variable: Seas_adj_pct_chg <dbl>
1
Determine the date of the report from the filename
2
Add a column with the corresponding report date to each table
3
Convert each table to long form with expenditure category and report date as ID columns.
4
Split the variable names from the period over which the variable is calculated. In theory, we should be able to determine the lag for each of these and not care about the date, but I don’t trust that the report has been that consistent over 2 years… paranoia.
5
Pivot wider, so that there’s a column for each variable name.
6
Bind all the tables together into a single table
7
Clean up the expenditure category names so that the dots are gone.

We could probably get this data cleaner – the lagged columns should be specified better, but this will do for now. Let’s at least do something interesting with this data that wouldn’t have been possible without reading data in from the tables.

library(ggplot2)
library(dplyr)
cpi_data |>
  filter(Expenditure_category %in% c("Energy", "Food", "Shelter", "Medical care services", "commodities", "Transportation services")) |>
  mutate(Category = str_replace_all(Expenditure_category, c("commodities"= "Non-food Goods", "Medical care services" = "Medical", "Transportation services" ="Transportation")) |>
           factor(levels = c("Shelter", "Non-food Goods", "Energy", "Food", "Medical", "Transportation"))) |>
  select(Category, report_date, Rel_imp) |>
  na.omit() |>
  ggplot(aes(x = report_date, y = Rel_imp, color = Category)) + geom_line() + 
  xlab("Date") + ylab("Relative Importance in CPI-U Calculation") + 
  theme_bw()

A line chart with date on the x-axis spanning from June 2023 to June 2025 and relative importance in the CPI-U calculation on the y-axis, ranging from approximately 0 to 40%. Six lines are shown: Shelter is the highest and has a slight bump up between March 2024 and January 2025, at which point it returns to approximately 35%. Non-food goods has a corresponding decrease under 20% during the same time span, but also has a slight downward trend during the period. Energy is next at around 13-14%, with a slight increase over the 2 year period. Food, Medical, and transportation costs are all betwen 5 and 8%, with some slight variation during the period.

Chart of the relative importance of shelter, goods, energy, food, medical, and transportation costs in the CPI-U calculation.

If we learned anything from doing this in R, it’s that it probably won’t work the first time. So this code saves a bit of evaluation time by using some of the info we got from R, like the page numbers (I’ve adjusted these to match python indexing).

import pdfplumber
import tabula
from glob import glob
import numpy as np

files = glob("../data/bls-pdfs/*.pdf")

def find_page_number(file):
  pdf = pdfplumber.open(file)
  has_tb1 = ["Table 1" in i.extract_text() for i in pdf.pages]
  pdf.close()
  return int(np.where(has_tb1)[0][0])

# page_numbers = [find_page_number(i) for i in files]
# page_numbers = [i[0] for i in page_numbers]
page_numbers = [8, 8, 7, 8, 8, 8, 7, 8, 9, 8, 8, 7, 8, 8, 8, 9, 8, 8, 8, 7, 7, 9, 8]
tables = tabula.read_pdf(files[0], pages=page_numbers[0]+1)
## AttributeError: module 'tabula' has no attribute 'read_pdf'
tables
## NameError: name 'tables' is not defined

Ok, so this time, we get 5 columns, which isn’t quite right - it seems as if the major headers are determining the column structure.

Let’s see if we can define the table area and help things out.

tbl = tabula.read_pdf(files[0], pages=page_numbers[0]+1, area=[128, 35, 586, 575], pandas_options={'header': None})[0]
## AttributeError: module 'tabula' has no attribute 'read_pdf'
tbl
## NameError: name 'tbl' is not defined
from itertools import chain
import pandas as pd

header = tabula.read_pdf(files[0], pages=page_numbers[0]+1, area=[89, 35, 128, 575], pandas_options={'header': None})[0]

def fix_headers(tbl):
  modifiers=[["", "Rel_imp."], ["Unadj_idx."]*3, ["Unadj_pct_chg."]*2, ["Seas_adj_pct_chg."]*3]
  modifiers=list(chain.from_iterable(modifiers))
  # https://stackoverflow.com/questions/11860476/how-to-unnest-a-nested-list
  
  spans=pd.Series([tbl.loc[0,0],
  ''.join(tbl.loc[1:2,1]),
  ''.join(tbl.loc[1:2,2]),
  ''.join(tbl.loc[1:2,3]),
  ''.join(tbl.loc[1:2,4]),
  ''.join(tbl.loc[0:3,5]),
  ''.join(tbl.loc[0:3,6]),
  ''.join(tbl.loc[0:3,7]),
  ''.join(tbl.loc[0:3,8]),
  ''.join(tbl.loc[0:3,9])])
  spans=spans.str.replace("\.", "", regex = True)
  spans=spans.str.replace("[ -]", "_", regex = True)
  
  return modifiers + spans

header = fix_headers(header)
header

def read_table_1(file, page_number):
  tbl    = tabula.read_pdf(file, pages=page_number+1, area=[128, 35, 586, 575], pandas_options={'header': None})[0]
  header = tabula.read_pdf(file, pages=page_number+1, area=[ 89, 35, 128, 575], pandas_options={'header': None})[0]
  header=fix_headers(header)
  tbl = tbl.rename(header, axis=1)
  return tbl

tables = [read_table_1(file, page_numbers[i]) for i,file in enumerate(files)]
tables[2]
## AttributeError: module 'tabula' has no attribute 'read_pdf'
## NameError: name 'header' is not defined
## NameError: name 'header' is not defined
## AttributeError: module 'tabula' has no attribute 'read_pdf'
## NameError: name 'tables' is not defined
1
Define a function to fix header text
2
First, create a nested list of modifiers that will repeat as many times as there are nested columns
3
Unnest the list of modifiers
4
Put the pieces of each header together properly
5
Clean up the header pieces a bit
6
Apply the function to one header to see if it works
7
Write a function to read table 1 from each report
8
First, read the contents of the table
9
Then, read in the header from the table
10
Fix the header using the fix_header function
11
Rename the columns of the table contents with the header values.

Then we just need to clean things up a bit more.

import pandas as pd
import os

report_date = pd.Series([os.path.basename(i) for i in files])
report_date = report_date.str.replace("cpi_|\.pdf$", "", regex=True)
report_date = pd.to_datetime(report_date, format="%m%d%Y")

cpi_data = pd.DataFrame()
for i,tbl in enumerate(tables):
  tbl['report_date'] = report_date[i]
  tbl = tbl.melt(id_vars=['Expenditure_category', 'report_date'], value_name='val', var_name='var')
  cols = pd.DataFrame(tbl['var'].str.split("\.").to_list(), columns=['varname', 'vardate'])
  tbl = pd.concat([tbl, cols], axis = 1)
  tbl = tbl.set_index(['report_date', 'Expenditure_category', 'vardate'])
  tbl = tbl.drop(['var'], axis=1)
  tbl_wide = tbl.pivot(columns='varname', values = 'val')
  tbl_wide = tbl_wide.reset_index()
  cpi_data = pd.concat([cpi_data, tbl_wide], axis=0)

cpi_data['Expenditure_category'] = cpi_data['Expenditure_category'].str.replace("[ \.]{1,}$", "", regex=True)

cpi_data.shape
cpi_data.head
## NameError: name 'tables' is not defined
## KeyError: 'Expenditure_category'
## (0, 0)
## <bound method NDFrame.head of Empty DataFrame
## Columns: []
## Index: []>
1
Determine the date of the report from the filename
2
Add a column with the corresponding report date to each table
3
Convert each table to long form with expenditure category and report date as ID columns.
4
Split the variable names from the period over which the variable is calculated. In theory, we should be able to determine the lag for each of these and not care about the date, but I don’t trust that the report has been that consistent over 2 years… paranoia.
5
Pivot wider, so that there’s a column for each variable name.
6
Bind all the tables together into a single table
7
Clean up the expenditure category names so that the dots are gone.

We could probably get this data cleaner – the lagged columns should be specified better, but this will do for now. Let’s at least do something interesting with this data that wouldn’t have been possible without reading data in from the tables.

tmp = cpi_data.query("~vardate.str.contains(r'_')")
## pandas.errors.UndefinedVariableError: name 'vardate' is not defined
tmp = tmp.assign(vardate = pd.to_datetime(tmp['vardate'], format="%b%Y"))
## NameError: name 'tmp' is not defined

tmp2 = tmp.query('Expenditure_category.isin(["Energy", "Food", "Shelter", "Medical care services", "commodities", "Transportation services"])')
## NameError: name 'tmp' is not defined

tmp2 = tmp2.assign(year = lambda x: x['vardate'].dt.year,
                   days = lambda x: (pd.to_datetime(x['year']+1, format='%Y') - 
                          pd.to_datetime(x['year'], format='%Y')).dt.days,
                   var_dec_date = lambda x: x.year + (x['vardate']-pd.to_datetime(x.year, format='%Y'))/ (x.days * pd.to_timedelta(1, unit="D")))
## NameError: name 'tmp2' is not defined

cat_repl = {'Medical care services':'Medical', 'commodities':'Goods', 'Transportation services':'Transit'}
tmp2=tmp2.rename(columns = {'Expenditure_category':'Category', 'var_dec_date': 'date'})
## NameError: name 'tmp2' is not defined
for old,new in cat_repl.items():
  tmp2.loc[:,'Category'] = tmp2.Category.str.replace(old, new, regex=False)
## NameError: name 'tmp2' is not defined

tmp_plot = tmp2[['date', 'Unadj_idx', 'Category']]
## NameError: name 'tmp2' is not defined
tmp_plot = tmp_plot.drop_duplicates()
## NameError: name 'tmp_plot' is not defined
tmp_plot = tmp_plot.assign(Unadj_idx = lambda x: pd.to_numeric(x.Unadj_idx))
## NameError: name 'tmp_plot' is not defined

import seaborn.objects as so
import seaborn as sns
import matplotlib.pyplot as plt

plot = sns.lineplot(data = tmp_plot, x = 'date', y = 'Unadj_idx', hue = 'Category')
## NameError: name 'tmp_plot' is not defined
plot.set(xlabel="Date", ylabel="Unadjusted Index")
## NameError: name 'plot' is not defined
plt.show()

A line chart with date on the x-axis spanning from June 2022 to June 2025 and relative importance in the CPI-U calculation on the y-axis, ranging from approximately 170 to 625. Six lines are shown: Medical is the highest and has a slight increase over the period between June 2023 and June 2025, reaching just over 600. Transit is the next highest, and starts at around 360, increasing steadily with only a few wobbles to 400 by June 2025. Shelter costs are just below Transit, increasing from 350 to about 400 over the period of the graphs in almost an entirely linear fashion. Food increases slightly from 300 to about 325. Energy starts at just above 300 and oscillates in an irregular fashion over the 3-year period. Finally, goods are almost entirely flat at perhaps 175.

Chart of the Unadjusted index values relating to shelter, goods, energy, food, medical, and transportation costs in the CPI-U calculation.
  • parsemypdf [17], a collection of AI-based parsing libraries
  • camelot, a python library for table parsing

References

[1]
M. Klindt, PDF/a considered harmful for digital preservation,” in iPRES 2017 conference proceedings, 2017 [Online]. Available: https://phaidra.univie.ac.at/o:931063. [Accessed: Jul. 15, 2025]
[2]
B. Edwards, “Why extracting data from PDFs is still a nightmare for data experts. Ars technica,” Mar. 11, 2025. [Online]. Available: https://arstechnica.com/ai/2025/03/why-extracting-data-from-pdfs-is-still-a-nightmare-for-data-experts/. [Accessed: Jul. 15, 2025]
[3]
Q. Zhang et al., “Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction.” arXiv, Oct. 28, 2024 [Online]. Available: http://arxiv.org/abs/2410.21169. [Accessed: Jul. 15, 2025]
[4]
J. B. Merrill, “Purifying the sea of PDF data, automatically. Medium,” Jul. 24, 2017. [Online]. Available: https://open.nytimes.com/purifying-the-sea-of-pdf-data-automatically-99e6043a09b3. [Accessed: Jul. 25, 2025]
[5]
Wikimedia contributors, “History of PDF,” Wikipedia. Oct. 30, 2024 [Online]. Available: https://en.wikipedia.org/w/index.php?title=History_of_PDF&oldid=1254285186. [Accessed: Jul. 15, 2025]
[6]
GNUpdf project, “Introduction to PDF,” Oct. 10, 2014. [Online]. Available: https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF. [Accessed: Jul. 15, 2025]
[7]
R. Hodson, PDF succinctly. Morrisville, NC: Syncfusion, Inc, 2012 [Online]. Available: https://web.archive.org/web/20140706124739/http://www.syncfusion.com/Content/downloads/ebook/PDF_Succinctly.pdf. [Accessed: Jul. 15, 2025]
[8]
J. C. King, “Adobe: Introduction to the insides of PDF,” Apr. 26, 2005 [Online]. Available: https://web.archive.org/web/20141212020737/http://www.adobe.com/content/dam/Adobe/en/technology/pdfs/PDF_Day_A_Look_Inside.pdf. [Accessed: Jul. 15, 2025]
[9]
J. Singer-Vine and The pdfplumber contributors, “Pdfplumber.” Jun. 2025 [Online]. Available: https://github.com/jsvine/pdfplumber. [Accessed: Jul. 15, 2025]
[10]
J. Ooms, “Pdftools: Text extraction, rendering and converting of PDF documents.” rOpenSci, Mar. 03, 2025 [Online]. Available: https://packages.ropensci.org/pdftools. [Accessed: Jul. 16, 2025]
[11]
S. Weil, R. Smith, and Z. Podobny, “Tesseract.” Google, Jul. 18, 2025 [Online]. Available: https://github.com/tesseract-ocr/tesseract. [Accessed: Jul. 18, 2025]
[12]
M. A. Lee and S. Hoffstaetter, “Pytesseract: Python-tesseract is a python wrapper for google’s tesseract-OCR.” Aug. 15, 2024 [Online]. Available: https://github.com/madmaze/pytesseract. [Accessed: Jul. 18, 2025]
[13]
J. Ooms, “Tesseract: Open source OCR engine.” 2025 [Online]. Available: https://CRAN.R-project.org/package=tesseract
[14]
M. Aristarán, M. Tigas, J. B. Merrill, J. Das, D. Frackman, and T. Swicegood, “Tabulapdf/tabula.” Tabula, Jul. 18, 2025 [Online]. Available: https://github.com/tabulapdf/tabula. [Accessed: Jul. 18, 2025]
[15]
vinayak-mehta and bosn, “Camelot: PDF table extraction for humans.” camelot-dev, Jul. 18, 2025 [Online]. Available: https://github.com/camelot-dev/camelot. [Accessed: Jul. 18, 2025]
[16]
Y. Dennis, “Tabula-py vs. Camelot: A duel of PDF table extraction titans. Medium,” Mar. 04, 2024. [Online]. Available: https://python.plainenglish.io/tabula-py-vs-camelot-a-duel-of-pdf-table-extraction-titans-61a534c5134d. [Accessed: Jul. 15, 2025]
[17]
R. Srivastava, “Genieincodebottle/parsemypdf.” Jul. 13, 2025 [Online]. Available: https://github.com/genieincodebottle/parsemypdf. [Accessed: Jul. 16, 2025]

  1. In the US, the Department of Justice’s interpretation of Title II of the Americans with Disabilities Act requires state and government services (including education) to meet digital accessibility standards. Many universities directed professors to shift materials to Word or HTML instead of PDF, as making PDFs accessible requires proprietary Adobe software.↩︎

  2. I actually acquired 3 days worth of transcripts from a single trial, but each day has 200+ pages, so working with only one day seems reasonable.↩︎