Intelligent Digitization: Our Multi-Stage AI Pipeline
The Processing Pipeline
Our approach to document processing is built on a sophisticated, multi-stage pipeline that ensures the highest possible accuracy. Each stage builds upon the last, creating a robust foundation for the computationally intensive analysis phases.
FILE
Initial Ingestion: Collects basic metadata (filename, URL, record ID) for each document file directly from NARA sources. Establishes the master list.
RECORD
Document Grouping: Organizes individual files under their corresponding NARA record IDs, linking related documents.
Acquisition: Downloads the actual PDF documents from NARA archives, storing them locally and tracking download status.
PAGE
Deconstruction & Initial Extraction: Extracts individual pages from PDFs. Attempts direct text extraction and native image extraction first. Saves page images as a fallback.
OCR
Baseline Transcription: Processes page images using state-of-the-art OCR (ABBYY) to generate an initial text layer. This serves as one input for later analysis.
ANALYZE
Quality Assessment & Strategy: Analyzes page images and initial OCR text. Uses vision models to assess quality, detect handwriting/markings, determine if image correction is needed, and assign processing paths.
TRANSCRIPT
AI-Powered Transcription: Applies different strategies based on the ANALYZE stage - cost-efficient models for cleaner pages, high-tier vision models for complex pages, with multi-pass processing when needed.
TXT
Raw Text Aggregation: Saves the final, processed text content for each individual page as a raw .txt file.
MD
Structured Document Assembly: Reconstructs the full document into a standardized Markdown format with YAML metadata, document abstract, summary, and page-by-page content.
SUMMARY
Content Synthesis: Uses LLMs to analyze the complete text of a document, extracting key information, dates, entities, and connections to other documents.
Our AI Strategy
We employ a multi-pronged approach to ensure the highest possible accuracy in our transcriptions:
Best-in-Class Baseline
We start with the best available commercial OCR (ABBYY) to get an initial text layer, acknowledging its limitations but using it as a foundation.
Leveraging Prior Work
We incorporate some of the best existing transcriptions as additional data points for comparison, building upon previous successful efforts.
AI Vision Analysis
Before final transcription, advanced vision AI models analyze each page image to assess quality, predict accuracy, and determine the optimal processing path.
Tiered AI Transcription
We use different AI models based on document complexity - from efficient models for clean pages to high-tier vision models for complex or degraded documents.
Multi-Pass Processing
For severely degraded pages, we use a three-step process: describe the page layout, segment and transcribe specific regions, then recombine the information.
Abstract Understanding
When literal transcription is impossible, we prioritize capturing the intended meaning and key information (the "5 Ws") over character-for-character accuracy.
Future Enhancements
Our pipeline is designed to be adaptable and extensible. Future enhancements include:
- Audio versions of documents (MP3 stage)
- Richer document descriptions and metadata (DESCRIPT stage)
- HTML and WIKI format conversions for enhanced accessibility
- Adaptation for MLK and RFK assassination files upon their release