How one compliance team extracted structured data from hundreds of unstructured documents

Mark Ku

October 21, 2025

Every compliance professional knows the pain: hundreds of PDF documents scattered across folders, each containing critical information buried in paragraphs of text. Company registrations, tax IDs, contact details, transaction histories—all locked away in formats designed for reading, not analysis.

One financial compliance team recently faced exactly this challenge. They needed to extract detailed information about business partners and their associated companies from over 50 individual registration documents. Each document contained a mix of personal identification numbers, corporate tax IDs, phone numbers, and transaction histories. The goal? Create a comprehensive, searchable database they could filter and analyze to meet regulatory requirements.

The traditional approach: hours of manual work

Before discovering Storytell, the team's process looked familiar to anyone who's done data extraction work:

Open each PDF individually
Manually copy relevant fields into a spreadsheet
Cross-reference information across multiple documents
Apply filters and create subsets for specific analyses
Repeat for every new document batch

For 50+ documents with multiple data points each, this process could take days. And that's assuming no errors crept in during manual transcription.

A different approach: natural language data extraction

Instead, the team uploaded all their documents to Storytell and simply described what they needed in plain language:

"Extract a detailed list from the attached documents containing: partner name, company name, corporate tax ID, company phone numbers, partner tax ID, and partner phone numbers. Present the result in table format for easy viewing."

Storytell's knowledge base search capability immediately went to work, analyzing all uploaded documents simultaneously and extracting the requested information into a structured table. But the real power emerged in the next steps.

Iterative refinement through conversation

The initial extraction gave them a comprehensive dataset, but regulatory analysis often requires specific subsets. The team continued with natural language requests:

"Create a new spreadsheet based on the original, keeping all fields, but filtering only records that have a conventional installment plan, deferred and consolidated."

Then they got more specific:

"Filter companies that have exclusively one conventional installment plan that is deferred and consolidated. Ensure that no company with more than one installment is included."

And even more granular:

"Separate into a new list the companies that had only one conventional installment in 2025 and that had any other type of transaction in previous years."

Each request generated a new, precisely filtered dataset—all without writing a single line of SQL or opening a spreadsheet application.

What made this approach successful

Several factors contributed to the effectiveness of this workflow:

Multi-document comprehension: Rather than processing documents one at a time, Storytell analyzed the entire collection simultaneously, identifying relationships and patterns across documents that would be time-consuming to spot manually.

Semantic understanding: The system understood intent, not just keywords. Requests like "exclusively one" versus "at least one" were interpreted correctly, applying the appropriate uniqueness constraints.

Structured data generation: Information locked in unstructured text became queryable data, with Storytell automatically handling the schema design and data normalization.

Iterative refinement: Each new request built on previous results, allowing the team to progressively narrow their focus without re-processing everything from scratch.

Temporal logic: Complex conditions like "in 2025" versus "in previous years" were handled naturally, without needing to explain date comparison logic.

Broader applications for data extraction

While this specific example involved financial compliance, the pattern applies broadly to any scenario involving:

Legal document analysis: Extract clauses, dates, parties, and obligations from contracts, agreements, or regulatory filings

Research synthesis: Pull structured data from academic papers, reports, or technical documentation

Customer intelligence: Organize information from customer communications, support tickets, or feedback forms

HR and recruiting: Extract qualifications, experience, and contact details from resumes and application materials

Due diligence: Compile structured datasets from company filings, financial statements, and regulatory documents

The key insight is that whenever you have information locked in documents designed for human reading, natural language extraction can transform hours of manual work into minutes of conversational interaction.

From extraction to analysis

The compliance team's workflow demonstrates an important principle: data extraction is rarely the end goal. It's a stepping stone to analysis, reporting, and decision-making.

By reducing extraction time from days to minutes, Storytell allowed the team to spend their energy where it matters most—interpreting the data, identifying patterns, and making informed decisions. The tedious work of finding and organizing information became automatic, while the meaningful work of analysis became primary.

Try it yourself

If you're facing similar challenges with unstructured documents, consider starting with a simple experiment:

Gather a small batch of similar documents (5-10 to start)
Define what structured information you need from them
Describe your requirements in plain language
Iterate with refinements based on initial results

You might be surprised how quickly conversational extraction becomes faster than manual processes—and how much more complex analysis becomes possible when you're not exhausted from data entry.

The future of knowledge work isn't about reading faster or copying more accurately. It's about asking better questions and letting AI handle the mechanical work of organizing answers. That future is already here.

‍