Technical advances in document understanding

Practical AI - Episode Summary

Episode Title: Technical advances in document understanding
Release Date: December 2, 2025
Hosts: Daniel Whitenack & Chris Benson
Theme: Exploring the evolving landscape of AI-powered document understanding and processing technologies, focusing on practical, real-world advances and implementations.

Episode Overview

In this “Fully Connected” episode, Daniel and Chris take a deep dive into the technical advances in document understanding. They explore the history, current state, and recent breakthroughs in document processing AI—discussing classical OCR, document structure models (like Docling), vision-language models, and the innovative DeepSeek OCR. The conversation balances technical explanation with practical advice, highlighting use cases, trade-offs, and implications for business workflows and AI practitioners.

Main Discussion Points & Insights

1. Why Document Processing Still Matters

Document processing is everywhere: Essential for daily business workflows—compliance, summarization, extraction, etc. (04:37)
Often “unsexy”, but powerful: Overlooked compared to LLMs and chatbots, yet central to real-world AI success.
Quote:
"Lurking below the surface of a lot of work in industry is document processing... it's really at the center of a lot of what happens in businesses day to day." — Daniel (05:29)

2. Classical OCR: How It Works and Its Limits

Old but evolving: OCR (Optical Character Recognition) has been around for decades but only became usable with modern deep learning (07:45, 10:03).
Classical pipeline: Image in → detect text regions → process each with CNNs or LSTMs → output character probabilities (11:12).
Efficient but “brute force”: Small models, can run on CPU/laptop, but layout and low-quality images are significant challenges (14:43).
Quote:
"Early OCR was really not very good... it's almost costing me more effort than it's worth it. So things have changed dramatically." — Chris (10:03)

3. Document Structure Models: Adding Layout Intelligence

What they are: Models like Dockling predict regions/structure (tables, titles, etc.), not just text (20:26).
No OCR, just structure: Useful for complex, multi-column, or data-rich documents.
Combine with OCR: First, detect structure, then send regions to OCR for text extraction—preserves layout for downstream tasks like RAG (Retrieval Augmented Generation) (25:06).
Practicality: More computationally expensive than OCR, but not as heavy as big vision models; still viable for many applications (27:07).
Quote:
"A document structure model... tries to predict the structure of the document... over here is a table, over here is a title... so when you have these more complex documents... you have this structure laid out." — Daniel (22:25)
Use in RAG: “One of the very frequent uses... is for the processing of documents that are feeding into a RAG pipeline…the cleaner and more context relevant you can make those chunks of text... the better results you’re going to get.” — Daniel (29:20)

4. Vision-Language Models: The Multimodal Leap

Definition: Take both images and text as input, output text (34:18).
LLM + Vision transformer: Fuse embeddings of text & images, trained jointly—great for general reasoning, Q&A over images, and document understanding.
Limits: Interpretability suffers; output is "just text," no clear mapping to document structure regions (37:41).
Quote:
"It doesn’t have to be used for document processing, but it could be used for that… It's a general-purpose reasoner over images." — Daniel (35:05)

5. DeepSeek OCR: The Resolution Revolution

What’s new: DeepSeek OCR splits input pages into high-res tiles (“vision tokens”) and combines them with a global low-res view, addressing the fixed-resolution issue in most vision-language models (39:51).
Why it matters: Preserves fine print, mathematical notation, and layout details that standard approaches often miss (41:50).
Heavier models: Requires GPU, but likely to shrink over future generations, following the path of LLMs.
Quote:
"DeepSeek is saying, let’s not lose all of that context… let’s take the tile, let’s tile this image… we’re not losing any of the resolution, but we’re also not losing the structure of where these kind of tiles are placed." — Daniel (44:03)
Broader trend: Shows that the state-of-the-art in document understanding is diverging in methods and architectures, not converging (47:02).

Notable Quotes & Memorable Moments

On practical focus:
“We pride ourselves on bringing that practical, productive, and accessible approach… the difference in the conversations we have on this show is we’re really focused on getting people into this technology so that they can use it day to day in a fun way.” — Chris (06:39)
On the user experience:
“The cleaner and more context relevant you can make those chunks of text into your RAG system, the better results you’re going to get in the responses.” — Daniel (29:26)
On innovation:
“From my perspective, maybe from a nerdy perspective, document processing is very much not boring because there’s actually such a diversity and such innovation going on here.” — Daniel (47:02)
On motivation to learn:
“Maybe coming out of the holidays they can go back into the office and give an upgrade to their RAG system and be wizards at how effective RAG is being for their organization.” — Chris (47:23)

Key Timestamps for Major Segments

| Timestamp | Topic/Segment | |------------|-------------------------------------------------| | 04:37 | Why Document Processing Still Matters | | 07:45 | Brief history & evolution of OCR | | 11:12 | Classical OCR Processing Pipeline Explained | | 20:26 | Document Structure Models (Dockling) | | 22:25 | Structure vs OCR, JSON/Markdown outputs | | 27:07 | When & why to use structure models (trade-offs) | | 29:20 | Structure models in RAG systems | | 34:18 | Vision-Language Models Explained | | 35:05 | Multimodal input/output, use-cases | | 39:51 | DeepSeek OCR and advances in resolution | | 44:03 | Technical method: tiling and high-res context | | 47:02 | Diversity & innovation in document models | | 47:23 | Real-world implications and use cases |

Hosts’ Closing Thoughts

Chris: Listeners can go upgrade their RAG systems and improve organizational effectiveness thanks to modern document processing (47:23).
Daniel: The landscape is exciting, diverse, and progressing rapidly; today’s models offer flexibilities for various real-world scenarios (47:02).
Holiday wishes & gratitude: The hosts express thanks to listeners for their engagement and support (48:11).

Takeaways for Listeners

Document processing remains a critical, vibrant domain in AI—powers much of what fuels modern workflows and compliance.
Innovation is rapid and multi-faceted: Advances span from efficient classical tools (OCR) up to highly sophisticated, context-preserving multi-modal models.
Choice of tool depends on use case: Simple scans may just need OCR; complex layouts or RAG integration benefit from structure models; newest approaches (like DeepSeek OCR) address previous technical limitations.
Technical diversity is growing: There is no “one size fits all”—practitioners must weigh tradeoffs in performance, context preservation, interpretability, and resource constraints.

For further learning or practical application, check out toolkits and models mentioned in the episode:

Tesseract, PaddleOCR (Classical OCR)
Dockling (structure models)
Hugging Face’s small Dockling model
Vision-Language models (Qwen 2.5, others)
DeepSeek OCR (for state-of-the-art use cases)

Stay practical, stay curious, and consider upgrading your document workflows with these modern advances!

Practical AI - Episode Summary

Episode Overview

Main Discussion Points & Insights

1. Why Document Processing Still Matters

Document processing is everywhere: Essential for daily business workflows—compliance, summarization, extraction, etc. (04:37)
Often “unsexy”, but powerful: Overlooked compared to LLMs and chatbots, yet central to real-world AI success.
Quote:
"Lurking below the surface of a lot of work in industry is document processing... it's really at the center of a lot of what happens in businesses day to day." — Daniel (05:29)

2. Classical OCR: How It Works and Its Limits

Old but evolving: OCR (Optical Character Recognition) has been around for decades but only became usable with modern deep learning (07:45, 10:03).
Classical pipeline: Image in → detect text regions → process each with CNNs or LSTMs → output character probabilities (11:12).
Efficient but “brute force”: Small models, can run on CPU/laptop, but layout and low-quality images are significant challenges (14:43).
Quote:
"Early OCR was really not very good... it's almost costing me more effort than it's worth it. So things have changed dramatically." — Chris (10:03)

3. Document Structure Models: Adding Layout Intelligence

What they are: Models like Dockling predict regions/structure (tables, titles, etc.), not just text (20:26).
No OCR, just structure: Useful for complex, multi-column, or data-rich documents.
Combine with OCR: First, detect structure, then send regions to OCR for text extraction—preserves layout for downstream tasks like RAG (Retrieval Augmented Generation) (25:06).
Practicality: More computationally expensive than OCR, but not as heavy as big vision models; still viable for many applications (27:07).
Quote:
"A document structure model... tries to predict the structure of the document... over here is a table, over here is a title... so when you have these more complex documents... you have this structure laid out." — Daniel (22:25)
Use in RAG: “One of the very frequent uses... is for the processing of documents that are feeding into a RAG pipeline…the cleaner and more context relevant you can make those chunks of text... the better results you’re going to get.” — Daniel (29:20)

4. Vision-Language Models: The Multimodal Leap

Definition: Take both images and text as input, output text (34:18).
LLM + Vision transformer: Fuse embeddings of text & images, trained jointly—great for general reasoning, Q&A over images, and document understanding.
Limits: Interpretability suffers; output is "just text," no clear mapping to document structure regions (37:41).
Quote:
"It doesn’t have to be used for document processing, but it could be used for that… It's a general-purpose reasoner over images." — Daniel (35:05)

5. DeepSeek OCR: The Resolution Revolution

What’s new: DeepSeek OCR splits input pages into high-res tiles (“vision tokens”) and combines them with a global low-res view, addressing the fixed-resolution issue in most vision-language models (39:51).
Why it matters: Preserves fine print, mathematical notation, and layout details that standard approaches often miss (41:50).
Heavier models: Requires GPU, but likely to shrink over future generations, following the path of LLMs.
Quote:
"DeepSeek is saying, let’s not lose all of that context… let’s take the tile, let’s tile this image… we’re not losing any of the resolution, but we’re also not losing the structure of where these kind of tiles are placed." — Daniel (44:03)
Broader trend: Shows that the state-of-the-art in document understanding is diverging in methods and architectures, not converging (47:02).

Notable Quotes & Memorable Moments

On practical focus:
“We pride ourselves on bringing that practical, productive, and accessible approach… the difference in the conversations we have on this show is we’re really focused on getting people into this technology so that they can use it day to day in a fun way.” — Chris (06:39)
On the user experience:
“The cleaner and more context relevant you can make those chunks of text into your RAG system, the better results you’re going to get in the responses.” — Daniel (29:26)
On innovation:
“From my perspective, maybe from a nerdy perspective, document processing is very much not boring because there’s actually such a diversity and such innovation going on here.” — Daniel (47:02)
On motivation to learn:
“Maybe coming out of the holidays they can go back into the office and give an upgrade to their RAG system and be wizards at how effective RAG is being for their organization.” — Chris (47:23)

Key Timestamps for Major Segments

Hosts’ Closing Thoughts

Chris: Listeners can go upgrade their RAG systems and improve organizational effectiveness thanks to modern document processing (47:23).
Daniel: The landscape is exciting, diverse, and progressing rapidly; today’s models offer flexibilities for various real-world scenarios (47:02).
Holiday wishes & gratitude: The hosts express thanks to listeners for their engagement and support (48:11).

Takeaways for Listeners

Document processing remains a critical, vibrant domain in AI—powers much of what fuels modern workflows and compliance.
Innovation is rapid and multi-faceted: Advances span from efficient classical tools (OCR) up to highly sophisticated, context-preserving multi-modal models.
Choice of tool depends on use case: Simple scans may just need OCR; complex layouts or RAG integration benefit from structure models; newest approaches (like DeepSeek OCR) address previous technical limitations.
Technical diversity is growing: There is no “one size fits all”—practitioners must weigh tradeoffs in performance, context preservation, interpretability, and resource constraints.

For further learning or practical application, check out toolkits and models mentioned in the episode:

Tesseract, PaddleOCR (Classical OCR)
Dockling (structure models)
Hugging Face’s small Dockling model
Vision-Language models (Qwen 2.5, others)
DeepSeek OCR (for state-of-the-art use cases)

Stay practical, stay curious, and consider upgrading your document workflows with these modern advances!

wavePod

Get Free Podcast Summaries in Your Inbox

Pick Your Shows

Subscribe Free

Get Instant Summaries

Summary

Practical AI - Episode Summary

Episode Overview

Main Discussion Points & Insights

1. Why Document Processing Still Matters

2. Classical OCR: How It Works and Its Limits

3. Document Structure Models: Adding Layout Intelligence

4. Vision-Language Models: The Multimodal Leap

5. DeepSeek OCR: The Resolution Revolution

Notable Quotes & Memorable Moments

Key Timestamps for Major Segments

Hosts’ Closing Thoughts

Takeaways for Listeners

Summary

Practical AI - Episode Summary

Episode Overview

Main Discussion Points & Insights

1. Why Document Processing Still Matters

2. Classical OCR: How It Works and Its Limits

3. Document Structure Models: Adding Layout Intelligence

4. Vision-Language Models: The Multimodal Leap

5. DeepSeek OCR: The Resolution Revolution

Notable Quotes & Memorable Moments

Key Timestamps for Major Segments

Hosts’ Closing Thoughts

Takeaways for Listeners