Blog
Engineering
blog category

Improving Document Content Extraction with Multi-Modal LLM

A Case Study on Enhancing Ingestion Service

By
Shelby Ramsey
May 9, 2025
Improving Document Content Extraction with Multi-Modal LLM

We recently tackled a critical challenge in our ingestion pipeline: extracting content from images, graphs, and tables embedded in PDFs, Word, and PowerPoint. This issue had a significant impact on downstream features such as search and summarization. Many complex assets, especially PDFs, images, and presentations, were not processed accurately, leading to noisy or incomplete extractions and undermining user trust.

This problem was initially flagged through a combination of user feedback and monitoring alerts. Users reported missing text and improperly segmented outputs. Dashboards revealed unusually low embedding counts and degraded asset quality, suggesting systemic silent failures in extracting all content from the document or image.

Our goal was to enable reliable, high-fidelity extraction from diverse file types without introducing significant latency or cost increases. We also aimed to avoid truncation from token limits and improve handling of image-heavy or scanned documents.

Debugging the Problem

Our investigation relied on real-world failures, job logs, and SQL queries. Key findings included:

  • Extraction Failures for Scanned Documents: Scanned images or image-based PDFs returned garbled text or huge, unstructured chunks.
  • Image and Table Handling: Tables and graphs in documents were either skipped or incorrectly extracted.
  • Legacy Tika Comparison: Our legacy fallback using traditional OCR techniques for extraction failed more completely on complex content, especially scanned or visual-heavy assets.

The Fix and Implementation

1. Moving to a vision-enabled LLM after evaluation of options

We introduced a microservice that:

  • Downloads the full PDF via signed URL
  • Split PDF files into 4–20 page chunks
  • Store the “shards” in cloud storage
  • Maintains shard metadata for idempotent reprocessing

This reduced the chance of hitting output token limits, allowed parallel extraction, and enabled recovery from partial failures.

2. Enhanced Prompt Engineering for Images and Tables

We refined prompts to better interpret visual content:

  • Describe shapes, layout, and colors
  • Summarize primary content and annotate key entities
  • Extract text, captions, and structured data

Example Markdown Output for an Image:

## Visual Elements
The image features a yellow, round shape representing a face.

## Primary Content
The main focus is the affirmative message conveyed through the word "YES".

## Text Content
YES

## Key Entities
- Character: Yellow blob-like cartoon
- Sign: White rectangle with "YES" in green

3. Extractor Interface Integration

We implemented a new Extractor interface:

type Extractor interface {
    Extract(ctx context.Context, request *ExtractionRequest) (*Extracted, error)
}

The ingestion service dynamically routes extraction jobs through this interface to the LLM, applying sharding and improved prompts.

4. Explicit Error Handling for Token Limits

To improve user experience, we exposed meaningful failure messages:

switch {
case errors.Is(err, extract.ErrTokensExceeded.New("")):
    done.FatalError(err)
    return &done
default:
    done.Error(err)
    return &done
}

This allowed frontends to inform users when a document exceeded extractable limits.

Validating the Fix

Quantitative Results:

  • Embedding counts rose dramatically from under 10 to over 260 in large documents.
  • Extraction times averaged ~4 seconds per page.
  • The PDF document used in our testing process now yields 51 coherent chunks, up from 10.

Qualitative Improvements:

  • Tables rendered as markdown or HTML depending on complexity.
  • Images now yield entity-rich, structured markdown.
  • Mathematical formulas and code snippets are preserved without truncation.

Production Readiness:

  • Retry logic resumes jobs from incomplete shards.
  • Conversion of unsupported formats (e.g., PPTX to PDF) integrated seamlessly.

Example: "Untitled James Donovan Project" screenplay now produces 268 high-quality chunks.

Lessons Learned & Next Steps

Key Takeaways:

  • Token limits are a hard constraint—sharding is essential.
  • Incremental extraction improves resilience and clarity.
  • Prompt engineering can radically improve LLM output structure.
  • Clear failure signals improve UX and developer velocity.
  • Metric dashboards and raw chunk inspection are both crucial.

Next Steps:

  • Expand microservice to support DOCX and content-density-based chunking
  • Improve semantic labeling for image extraction
  • Auto-tune prompts based on asset metadata
  • Add safeguards to prevent overload in extractor service

Conclusion

By leveraging a multi-modal LLM, structured prompt engineering, and the new Extractor interface, we dramatically improved the accuracy and reliability of content extraction across formats. This upgrade enables better search, summarization, and overall trust in the system.

Gallery

Changelogs

Here's what we rolled out this week
No items found.