With large-scale AI systems, the question of “How did we get to this conclusion?” is just as important as the conclusion itself. The history of the data—the provenance—is part of the artifact and the analysis. Provenance is the key to making AI systems both usable and explainable.
Starting with my PhD work on Teleoscope, a large-scale document curation system, I build provenance-tracking human-in-the-loop AI systems for communities that need to make careful and rigorous empirical decisions based on multimodal data. As Informatics Curator at the Beaty Biodiversity Museum, I ensure that our data is not just archived, but archival so that future scientists can trace the history of the data.
Human-in-the-loop OCR transcription1
The Beaty has a wonderful archive of hundreds of letters sent between collectors detailing their trips to find new and interesting species of plants and animals in the Pacific Northwest. Some of these letters are typed, and some are handwritten. To perform any sort of analysis, the letters must be transcribed. Some transcription can be done by machine, but many must be done by hand.
We are developing a provenance-tracking human-in-the-loop OCR agentic AI transcription pipeline for these letters. That sounds like a mouthful: we want to track and tune operations like background removal and text segmentation, leverage handwriting samples for difficult sections, use LLMs for text correction and review, and finally allow human verification and updates to transcriptions.
Reconciling Data Authorities2
The Beaty relies on many different worldwide data authorities that provide important scientific information such as atlases, taxonomic trees, and citizen science specimen photo archives. However, each of these data authorities has a socially contingent level of trust that is hard to compute. Some are up-to-date, some have conflicting specimen identifications, and some are full of errors. These also have to be reconciled with the Beaty’s internal data.
We are developing an agentic AI pipeline for tracking changes to worldwide data sources, including levels of trust. As the data evolves, we want to be able to see how our level of trust evolves with it.
Navigating and Representing Biological Uncertainty3
Many biological data types have an inherent uncertainty to them. Geographic locations seem precise when using latitude-longitude coordinates, but place names change over time, species move, and collectors forget where they were. Similarly, taxonomic determinations are often tentative or will change as new data emerges. Yet, current data standards and databasing methods assume that what is written down is fact, losing the sense of contingency that a note scrawled on a piece of paper might have.
We are developing methods for representing and assessing biological uncertainty in database records. Part of this is simply ensuring that histories of change and precision is recorded, but new data standards are needed to make sense of these changes.