Intelligent Digitization: Our Multi-Stage AI Pipeline

The Processing Pipeline

Our approach to document processing is built on a sophisticated, multi-stage pipeline that ensures the highest possible accuracy. Each stage builds upon the last, creating a robust foundation for the computationally intensive analysis phases.

FILE

Initial Ingestion: Collects basic metadata (filename, URL, record ID) for each document file directly from NARA sources. Establishes the master list.

RECORD

Document Grouping: Organizes individual files under their corresponding NARA record IDs, linking related documents.

PDF

Acquisition: Downloads the actual PDF documents from NARA archives, storing them locally and tracking download status.

PAGE

Deconstruction & Initial Extraction: Extracts individual pages from PDFs. Attempts direct text extraction and native image extraction first. Saves page images as a fallback.

OCR

Baseline Transcription: Processes page images using state-of-the-art OCR (ABBYY) to generate an initial text layer. This serves as one input for later analysis.

ANALYZE

Quality Assessment & Strategy: Analyzes page images and initial OCR text. Uses vision models to assess quality, detect handwriting/markings, determine if image correction is needed, and assign processing paths.

TRANSCRIPT

AI-Powered Transcription: Applies different strategies based on the ANALYZE stage - cost-efficient models for cleaner pages, high-tier vision models for complex pages, with multi-pass processing when needed.

TXT

Raw Text Aggregation: Saves the final, processed text content for each individual page as a raw .txt file.

MD

Structured Document Assembly: Reconstructs the full document into a standardized Markdown format with YAML metadata, document abstract, summary, and page-by-page content.

SUMMARY

Content Synthesis: Uses LLMs to analyze the complete text of a document, extracting key information, dates, entities, and connections to other documents.

Our AI Strategy

We employ a multi-pronged approach to ensure the highest possible accuracy in our transcriptions:

Best-in-Class Baseline

We start with the best available commercial OCR (ABBYY) to get an initial text layer, acknowledging its limitations but using it as a foundation.

Leveraging Prior Work

We incorporate some of the best existing transcriptions as additional data points for comparison, building upon previous successful efforts.

AI Vision Analysis

Before final transcription, advanced vision AI models analyze each page image to assess quality, predict accuracy, and determine the optimal processing path.

Tiered AI Transcription

We use different AI models based on document complexity - from efficient models for clean pages to high-tier vision models for complex or degraded documents.

Multi-Pass Processing

For severely degraded pages, we use a three-step process: describe the page layout, segment and transcribe specific regions, then recombine the information.

Abstract Understanding

When literal transcription is impossible, we prioritize capturing the intended meaning and key information (the "5 Ws") over character-for-character accuracy.

Future Enhancements

Our pipeline is designed to be adaptable and extensible. Future enhancements include:

Audio versions of documents (MP3 stage)
Richer document descriptions and metadata (DESCRIPT stage)
HTML and WIKI format conversions for enhanced accessibility
Adaptation for MLK and RFK assassination files upon their release