Hidden Multiple Comparisons Increase Forensic Error Rates

Susan Vanderplas

2025-08-04

Outline

Motivation
CSI: Wire Cuts
What is the probability of a false positive?
Conclusions

This research is joint work with Heike Hofmann (Nebraska-Lincoln) and Alicia Carriquiry (Iowa State).

Slides Link

Motivation

A screenshot of an article by Meehan Crist and Tim Requarth from The Nation (Feb 26, 2018 issue). The title reads 'Forensic Science Put Jimmy Genrich in Prison for 24 Years. What if It Wasn’t Science?'. A mugshot is shown below the article title. — Source

A Colorado Public Media story by Tom Hesse from July 10, 2023, titled 'Judge orders new trial for Grand Junction man convicted in '90s pipe bombings'. A picture of the Mesa County Justice Center is shown below the article title. — Source

Motivation

At trial, O’Neil testified that tools like the wire cutters found in Genrich’s residence were the only tools that could have been used to make the pipe bombs. “Agent O’Neil opined that the three of Mr. Genrich’s tools were the only tools in the world that could have made certain marks found on pieces from the four bombs,” Judge Gurley wrote in Monday’s opinion.

O’Neil was asked specifically at trial what he meant by the phrase “to the exclusion of any other tool.” “That the individual jaw, the location within that jaw on that particular side, was identified as having cut the wire in question to a degree of certainty to exclude any other tool,” he said, according to court transcripts referenced in Gurley’s decision. O’Neil said this was true of needle-nosed pliers used on the bomb, slip-joint pliers, as well as the wire cutter.

Source (emphasis added)

CSI: Wire Cuts

Examiner Method

Between 2 and 4 cutting surfaces for each tool, \(s \in \{2, 4\}\)
Wires have 1-2 striated surfaces, \(w \in \{1, 2\}\)

Visual Examination

Compare sequential wire cuts or blade cuts to recovered wire
Examiner uses a comparison microscope to see both cuts, aligns striae manually
At least \(N_{ij} = b_i/d_j\) comparisons for each blade surface \(i\) and cut wire surface \(j\)
Manual alignment means we can only estimate the minimum comparisons

A diagram showing a blade cut (long, thin) with striations compared to a wire cut (semi-circle shape, short) aligned with the blade cut. There are multiple non-overlapping positions along the blade cut where the wire cut could fit, and each is shown with an empty hemisphere.

Adjacent comparisons are non-overlapping \(\Rightarrow\) “independent”¹

Algorithmic Method

Scans -> Cross Section -> Signature (remove gross topology)

Cross-correlation is used to align signature from blade cut surface to signature from wire cut surface.
Image credit: Heike Hofmann

Number of Comparisons

In our example,

Minimum

blade size is \(b=15 mm\)
\(s = 2\) cutting surfaces/blade
wire size is \(d = 2 mm\)
\(w = 1\) wire surfaces
7.5 comparisons per side
15 comparisons total

A comparison microscope, which has two microscopes with a single viewfinder, allowing the object from the first scope to be directly compared or aligned with the object from the second. — Image by Tamasflex. Wikimedia. CC-A-SA.

Aligned signature from blade and wire surface. Crop of previous gif

Maximum

blade size: \(b = 15000\) \(\mu\!\) m
\(s = 2\) cutting surfaces/blade
wire size: \(d = 2000\) \(\mu\!\) m
\(w = 1\) wire surfaces
resolution: \(r=0.645\) \(\mu\!\) m/px
20,156 comparisons per side
40,312 comparisons total

This assumes there’s a single wire and a single possible tool.

Hidden Comparisons

We accept that there’s a false positive error rate with any method
Alignment produces hidden multiple comparisons
The standard wire comparison process produces many more comparisons
- multiple angles
- multiple substrate materials (sometimes)
- multiple potential tools
Wires from crime scenes may be fragmented or damaged

Personal Motivation

An annotated picture of a messy garage, showing all of the tools which could conceivably make striated toolmarks, including visegrips, pliers, bolt cutters, pry bars, electrical tools, razor blades, hammers, tin snips, saws, and more.

My house has \(\approx\) 982 cm of (easily accessible) blade surface which might be used to cut wires. Not shown: the craft room, the kitchen, and the garden shed. My dad’s shop has \(\approx\) 2243 cm of blade surface. No one in either house is a professional craftsperson.

So How Bad Is the Problem?

For error rate \(e\) and \(n\) comparisons

\[P(\text{no false positive errors}) = 1 - \left(1 - e\right)^n\]

So How Bad Is the Problem?

Estimated false positive error rate for striated comparisons from bullets and firing pins: 0.45 - 7.24%.
- Not enough error rate studies on wires
- bullets have additional structure to facilitate alignment

Pooled estimate: 2%

probably an underestimate for wires (shorter striae sequences, more dependence on angle)

So How Bad Is the Problem?

If we want to ensure the family-wise false positive rate is under 10%…

A table showing the family wise false discovery rate for N comparisons using several different estimates of striated comparison error rates. When the FDR is 7%, over 50% of 10-comparison samples would be expected to have a false positive. When the FDR is 2%, over 18% of 10-comparison samples would be expected to have a false positive. Additional columns are provided giving the expected frequency of 100 and 1000 comparisons under different error rates. The final column provides the maximum number of comparisons at the specified error rate which can be performed to keep the family wise false positive error rate under 10%.

Conclusions

Wire cut forensics are problematic

but… this problem shows up in database searches, too!

IAFIS (Integrated Automated Fingerprint ID System)
NIBIN (Ballistics Database)
NDIS (National DNA ID System) and CODIS (Combined DNA ID System)
PDQ (Paint Data Query)
FISH (Forensic ID System for Handwriting)

Conclusions

Automatic intelligence algorithms are not a good substitute for investigative work
Example: Oregon
- 3 firearms examiners for the state
- Most firearms evidence run through NIBIN first
- Promising leads forwarded to examiners for manual assessment

This is a setup that could generate false-positives by design!

Conclusions

Examiners should report and Defense Attorneys should require
- overall length or area of surfaces generated during the examination process (\(b\))
- total consecutive length/area of recovered evidence (\(d\))
Studies relating length/area of comparison surface to error rates are essential!
- No available black-box error rate for wire cuts
- Studies should be difficult, like casework!
Any database search used at any stage of the process should be disclosed along with
- \(N\) items in the database used for comparison
- Number of results returned as ‘similar’ (top 20? top 5?)
- Protocols for confirmatory assessment

Caveats

It is important to distinguish between matching algorithms and intelligence algorithms

Matching algorithms compare the similarity of two pieces of evidence
- no database searches are involved
- algorithm may have a set of training data (different)

Defense Attorneys should be familiar with protocols for evidence assessment in the jurisdiction
- What algorithms were used
- What databases were searched
- Were examiners provided with automatic database search results at the start?

Questions?

PNAS Paper Link

This work was partially funded by the Center for Statistics and Applications in Forensic Evidence (CSAFE) through Cooperative Agreements 70NANB15H176 and 70NANB20H019 between NIST and Iowa State University, which includes activities carried out at Carnegie Mellon University, Duke University, University of California Irvine, University of Virginia, West Virginia University, University of Pennsylvania, Swarthmore College and University of Nebraska, Lincoln.