Loading summary
Podcast Host
Welcome to the Practical AI Podcast where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work and create. Our goal is to help make AI technology practical, productive and accessible to everyone. Whether you're a developer, business leader or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn X or Bluesky to stay up to date with episode drops behind the scenes and AI insights. You can learn more at Practical AI fm. Now onto the show.
Sponsor Representative
Well friends, when you're building and shipping AI products at scale, there's one constant complexity. Yes, you're wrangling models, data pipelines, deployment, infrastructure, and then someone says, let's turn this into a business. Cue the chaos. That's where Shopify steps in. Whether you're spinning up a storefront for your AI powered app or launching a brand around the tools you built, Shopify is the commerce platform trusted by millions of businesses and 10% of all US E commerce. From names like Mattel, Gymshark to founders just like you. With literally hundreds of ready to use templates, powerful built in marketing tools and AI that writes product descriptions for you, headlines, even polishes your product photography, Shopify doesn't just get you selling, it makes you look good doing it. And we love it. We use it here at Changelog. Check us out merch.changelog.com that's our storefront and it handles the heavy lifting too. Payments, inventory, returns, shipping, even global logistics. It's like having an ops team built into your stack to help you sell. So if you're ready to sell, you are ready for Shopify. Sign up now for your $1 per month trial and start selling today at shopify.com PracticalAI Again, that is shopify.com PracticalAI.
Daniel Whitenack
Welcome to another fully connected episode of the Practical AI Podcast. This is Daniel Whitenack. I am CEO at Prediction Guard and I'm joined by Chris Benson, my co host who is a principal AI Research engineer at Lockheed Martin. And in these fully connected episodes, or it's just Chris and I, we try to dig into a few topics or deep dive into some learning resources that will help you level up your AI machine learning game. Looking forward to this one Chris. I think in reflecting before the episode, both of us going into American Thanksgiving, which is tomorrow as we're recording this, but going in with a lot of gratitude for the year. Just a lot, a lot happens in life and it's a nice time to kind of reflect and see see the blessings that we have at Thanksgiving. And yeah, what a blessing to just keep doing this, keep doing this show for going on eight, eight years now. And it's been a moment having, having a lot of fun, making a few, you know, stepping on a few minds along the way. But, but having, having fun generally. And I think, yeah, thankful to our listeners as well. Just to take a moment to say thank you for sticking with us all these years. Chris and I have, have a lot of cool plans for the coming year and there's energy behind, behind the show. Lots of ideas going on that we'll talk about soon. But, but yeah, thank you to our listeners for, for sticking with us.
Chris Benson
Couldn't say it better. Thank you to the listeners for sticking with us. And I got to say, these fully connected shows in a lot of ways are, are so much fun. They're among my very favorites because, like, we get to talk to these most amazing guests in a typical episode, you know, where you're like talking to some of the smartest people in the world and being able to kind of understand how they see it and learn. And I know our listeners go along for the ride on that. But I also love when we just, you know, it's the Wednesday afternoon before Thanksgiving for you and me as we're recording this. I know people will be listening to it just after Thanksgiving, but it's a lot of fun just to jump into the conversation. And I know, I know we have some fun things to hit today, so I'm relaxed and looking forward to it. Daniel?
Daniel Whitenack
Yeah, yeah, for sure. And I don't know a more exciting topic for the Thanksgiving dinner table than document processing, which is what I, what I kind of brought, brought forward today. I guess what I was realizing, Chris, is we talk a lot about large language models. We've talked a lot about, we have talked a lot about computer vision type of things on the show. Maybe not as much recently, but certainly, certainly over the years. We've talked about all the kind of chatbot stuff and all of that, but I think kind of lurking below the surface of a lot of work in industry is document processing. And as the years have gone along and we've kind of entered into the generative AI kind of revolution, there has been also this kind of stream of innovations in relation to production, processing documents in an automated way with models. And of course that reaches very practical places in terms of everyday business work. Right. I think often the most valuable workflows that people have day to day, or maybe the most annoying ones, is this person sends me an email with this document. I've got a extract this or do this or create a summary of that or I have new documents that are, you know, regulations related to compliance and I need to process them and get them, you know, in, into somewhere. And that's really kind of at the center of a lot of what happens in businesses day to day. So I, yeah, I thought it would, you know, as we, as we hopefully aren't yet in coma after eating too much turkey, we could, you know, use our use this time when we're alert to talk about some of that, you know.
Chris Benson
Great point there. And before like, I kind of hate the name, you know, like document processing and I think, you know, like before everyone out there goes to sleep, you know, turns us off, goes oh my God, they're talking about document processing and goes to sleep. This is pretty cool stuff. And it's important because it's modeling wise. Yeah, absolutely. And it is productive. And you know, we pride ourselves, you know, on bringing that, you know, practical, productive and accessible approach to it. And, and I think that's really important is like, I think one of the differences in the conversations we have on the show versus some other shows is the other ones tend to chase the, the headlines and the glam and stuff. And we're really focused on like getting people into this technology so that they can use it day to day in a fun way. And, and so like before you turn off and go up, I'm going to turn off for turkey. On document processing, this is pretty cool stuff. And as Daniel said, this has been going on which just doesn't get the headlines anymore like it used to. And so it's really worth diving into and saying, hey, look at what's possible now versus the last time we talked about it.
Daniel Whitenack
Yeah. And probably what initially prompted this is of course, I mean, we've been working with some of these models internally. But also Deep Seq did release a Deep Seq OCR model which people have been talking a lot about, which is, which represents at least partial part of this stream of work that's been going on around document processing models now just so people kind of have, I guess, a little bit of background or jargon kind of where we're headed. My thought is we really need to kind of pick apart some of these different kinds of modeling, how they fit in and where they're practical, maybe where they're not practical. And in particular there is ocr, which has been around for the longest, I guess in terms of the things that we'll talk about. Which is optical Character recognition, that's what that stands for. Then there are language vision models, which is something that has happened, or LVMs. Then there are, I guess, document structure type of models, kind of like a Dockling. People might have heard of Dockling. And then finally there's kind of this latest model, Deep SEQ ocr, which is different from kind of like what people might think of in terms of ocr. And so there's these, these different kind of categories or families of methodologies here. And there's really, like you say, Chris, a lot, a lot happening in, in these different areas. But that's, that's kind of where we're headed in the conversation, I guess, for those, for those listening as there's kind of pick apart some of these things. I don't know, Chris, how long, I mean, I, I kind of remember OCR happening for a very long time. I mean, I, I didn't grow. Neither one of us, I think, grew up with computers at least that had OCR on them or computers in general. I do remember in grad school, you know, processing some, you know, papers or other things and applying some type of OCR maybe in some tools on these. But. Yeah, what's your history there?
Chris Benson
Yeah, well, I mean, early OCR was really not very good, you know, and this was kind of, you know, certainly before kind of the current generation of AI. And I'm using generation very broadly here, like the last 15 years. And it's come a long way with these new technologies and stuff. I know when I was younger, some of the kind of pre AIOCR technologies just were like, I remember trying them when I was younger and kind of going really not working. Like it's almost costing me more effort than it's worth it. So things have changed dramatically. I mean, it's so good now and there's so many approaches to it as we're, as we're going to dive into.
Daniel Whitenack
Yeah, yeah. And I think that maybe a good starting point for that. If we just start with OCR is really thinking about the processing pipeline and the different components that are involved in it, because that really drives what compute is needed, how fast it is, how performant it is, you know, and it kind of distinguishes it as a category. So if we just start with ocr, I think we could do that now just by way of reference in terms of like how things are processed through a kind of quote classical OCR model or a typical OCR model. These would be things like tesseract or paddle ocr, these sorts of Technologies that we're thinking of what, what happens is an image is input and then ideally kind of text or characters are output. If we just contrast that, because everyone's talking about LMS now, the typical processing pipeline with LLMs is, you know, not images come in, but text comes in. That text is split apart into tokens. Those tokens are assigned indices, like within a vocabulary, those that kind of array of indices is embedded into a dense representation by transformer based model often and then what is predicted on the output side is a, is an array of probabilities corresponding to different tokens such that you can know what is the most probable next token coming out of the model. So you kind of have text come in, that text is split apart into tokens, that's embedded and then output are these probabilities of next token. So if we just contrast that with the OCR model, first of all, we have a different type of input, right? We have an image and that image is made of pixels. And often so we have this image, it's made of pixels in the output. Actually not dissimilar to the LLM, there is an output of probabilities at the end, it's just an output of kind of probabilities of characters. So what happens is if you look at a big image, it might have regions of characters in it or words. And what happens in the OCR model is you take that big image with a lot of characters, there might be some pre processing on the image like a resizing or something, but then there is one kind of model that detects the area or regions where there are kind of characters or text, text regions. And then you take each of these text regions and you put it through like a convolutional neural network or an lstm. And then that outputs through a sequence model a probability of characters or the probability of what character corresponds to that region. Right. So essentially that OCR model, it's really just looking at that big image, determining where there are characters or text regions and then for each of those, predicting what that character or text region is. Right. So that's how the processing goes, which is in some ways is kind of seems kind of almost brute force. Right. You're splitting it apart into all of these regions. Right.
Chris Benson
As you were talking though, I was also thinking back over the history of the show and we're talking like this is the first time I think you've said LSTM in a while. In a bit. Yeah. How many years has it been since we talked about that? And, and Recurrent neural networks which were also involved and then kind of Transformers also starting to bridge the gap there. Wow. Taking us back a little ways there.
Daniel Whitenack
Taking us back. Yeah. So this is really kind of, in a lot of ways a brute force type thing. You're really splitting apart that image into these different regions and then for each kind of trying to detect which character. Now, similar to what you were saying, we're talking about maybe convolutional models or architectures, maybe lstms, which is a long, short term memory, recursive type of network. These kind of traditionally in these tools, like the OCR tools, are rather small models by today's standards. And as such, even though it's kind of you're brute forcing all these characters, they are fairly efficient in terms of where you can run them. So I can run one easily on my laptop. I can run it on a cpu. I don't have to have a large gpu.
Chris Benson
True. It's interesting is that evolution and the different kind of branches of possibility in terms of how you might approach the problem have developed Any thoughts? Do you have any thoughts around? As we went from LSTMS and got to convolutionals and then Transformers started making an impact on that. Maybe after we come out of the break, we can talk a little bit about kind of how those evolved and why the different selections became kind of primary over time.
Sponsor Representative
Well, friends, it is time to let go of the old way of exploring your data. It's holding you back, but what exactly is the old way? Well, I'm here with Mark Dupuis, co founder and CEO of fabi, a collaborative analytics platform designed to help explorers like yourself. So, Mark, tell me about this old way.
Mark Dupuis
So the old way, Adam, if you're a product manager or a founder and you're trying to get insights from your data, you're wrestling with your Postgres instance or Snowflake or your spreadsheets. Or if you are and you don't maybe even have the support of a data analyst or data scientist to help you with that work. Or if you are, for example, a data scientist or engineer or analyst, you're wrestling with a bunch of different tools, local Jupyter, notebooks, Google Colab, or even your Legacy bi, to try to build these dashboards that someone may or may not go and look at. And in this new way that we're building at Babi, we are creating this all in one environment where product managers and founders can very quickly go and explore data regardless of where it is.
Daniel Whitenack
Right.
Mark Dupuis
So it can Be in a spreadsheet, it can be an airtable, it can be in postgres Snowflake. Really easy to do everything from an ad hoc analysis to much more advanced analysis if again you're more experienced. So with Python built in right there in our AI assistant, you can move very quickly through advanced analysis. And the really cool part is that you can go from ad hoc analysis and data science to publishing these as interactive data apps and dashboards, or better yet, at delivering insights as automated workflows to meet your stakeholders where they are in, say, Slack or email or spreadsheets. So if this is something that you're experiencing, if you're a founder or product manager trying to get more from your data for your data team, today you're just underwater and feel like you're wrestling with your legacy BI tools and notebooks. Come check out the new way and come try out Fabi.
Sponsor Representative
There you go. Well, friends, if you're trying to get more insights from your data, stop wrestling with it, start exploring it the new way with Fabi. Learn more and get started for free at Fabi AI. That's F A B I AI Again, Fabby AI.
Daniel Whitenack
Yeah, Chris, so you, you were just kind of getting into, I guess, maybe. Why? Assuming we have ocr, right? That does work in the sense that you can, you can predict characters, you can pick out these text regions. So, you know, OCR models have obviously got better over the years. So why, why is there a need for something else? Why is there a transition to maybe other architectures or, or other things? So, so what I would say is there's kind of, if you think about that process of the image coming in and you splitting apart those text regions, you kind of end up with all of this kind of plain text output and any sort of logic around the reconstruction of that document, especially related to the layout of the document, is problematic, I would say. And I would say these are often highly dependent on the actual quality of the pixels that are input. Remember, the pixels are input here and often the images are kind of resized on the inputs to these models or they need to be just in terms of the input size. So you've got kind of this combination of problems of not having an understanding of the layout, but also requiring kind of clean scans of the, of the documents, if you will, which is definitely, definitely a drawback of this approach, I would say.
Chris Benson
Yeah, I mean, I can remember back in the day with the traditional ocr. I mean, that was not just a problem, but it was constant. You know, you would use The OCR on a document and you had to pretty meticulously go through the document afterwards to correct a lot of the error on that. You know, that didn't change really until we got past the traditional into more of the vision based model. So definitely, definitely seeing the progression there.
Daniel Whitenack
Yeah, yeah. And I mean that kind of naturally transitions us into one of the things that is now a part of our world and helps with that, at least a part of that problem, the structure and layout problem, which are what are called document structure models. So the most, or one of the most popular of these is called Dockling. And there's different families of these. Dockling. It might be confusing because there's some models that are kind of labeled as Dockling models. There's also a toolkit called Dockling that IBM released, which isn't actually just one model. It's a whole series of pipelines and options around document processing. But one of the core concepts here, whether it's in use in that, in that library or in reference to a model, is that a document structure model in, in terms of the, what it does or the differences, it actually doesn't do any ocr, it doesn't detect text, it doesn't convert, you know, images to text and this sort of thing. What it does is it tries to predict the structure of the document or a structured representation of the document. Because remember with OCR we don't really have that right. We just have the prediction of these characters in these different croppings of the image. And so with Dockling or a similar document structure model, what happens is you have that document that's input a document or an image and then what happens is that a kind of parser extracts layout primitives. So that might be like rectangles or certain shapes or vectors or fonts. And then a layout model, again kind of part of this document structure model, layout model then makes predictions for what those regions are, should be classified as. So things like, you know, titles or paragraphs or headings or tables, et cetera, and then output of the model rather than predicting characters again. So I'm not getting text out of this, I'm not getting characters or text. What I'm getting is a structured output representation of the document, usually in kind of JSON markdown HTML HTML format, which basically tells me, okay, you put in this document, over here is a, is a table, over here is a title, this region corresponds to a heading, there's a paragraph over here. And that way when you have these more complex documents, maybe two column papers or White papers with a bunch of tables or data sheets or that sort of thing. You kind of have this structure laid out. You have the classification of that structure. And so actually a Dockling model or this type of document structure model is often used in combination with an OCR model. And it would kind of go like document comes in, you detect all the structure of the document, right over here's a table, here's a paragraph, here's a title. Okay, well, now let me send that title bit into an OCR model and then actually get the text associated with the title. Right. And so now you've overcome a little bit of that limitation of the raw OCR by applying this structure on top and you can reconstruct the document, you know, as a markdown document with all the tables and titles and that sort of thing.
Chris Benson
It's funny, as you, as you kind of describe, you're going through the process there as a very loose analogy, it reminds me somewhat of, for those of us in the audience who are programmers like you and I, it reminds me a little bit of the way programming languages are compiled into this tree structured format that's called an abstract syntax tree, and asked, you know, where it kind of captures what, regardless of what the originating language is, it kind of captures the, the essence of what the program is before it's compiled into machine code or whatever, whatever your target is. But it kind of feels like Dockling is doing a. Somewhat at a higher level, obviously, but doing a little bit of a similar thing in terms of capturing all that structure out of the dock.
Daniel Whitenack
Yeah, yeah. It would be like the OCR model has an output of character probabilities. Right. The LLM has an output of token probabilities. The document structure model actually has an output of this tree structure or the tree representation of the structure of the. Of the document. So it's that kind of processing pipeline where you pick apart these layout primitives and then you classify each one. So really it's kind of main piece of. This is the classification piece of each of these elements and then assembling that into this tree structure, which, yeah, is certainly very useful. I think there's. It's worth noting that this does help you handle more complicated documents. Again, though, it doesn't solve the text extraction piece. You still kind of need to add that piece in. And often this is more computationally heavy than just raw OCR, which can run on CPUs. Often I think I've run Dockling models also on CPU or on constrained environments. I think Hugging Face released a small Dockling model which is also geared towards that side of things. Obviously you have the same trade offs with any kind of model size. The smaller ones maybe don't have the same level of performance but will run on more constrained environments. The larger ones maybe have higher performance but they might need a GPU to run.
Chris Benson
As we talk about this, would you say that Dockling is still a very modern and current way of doing things given the fact that hugging face is released models and are there use cases where you would not necessarily want to go to this in your view? Like where might you say I don't like I get the benefits that we've talked about. Where might we say not the right fit?
Daniel Whitenack
Yeah, I would say that you really kind of want to use this when you need to preserve the structure of the documents that are input and you maybe have complex structures again like the data sheets or multi column or mix of columns and other things. This is really useful at that point. But if you just have like a RAW scan that's relatively clean and all of it's just text and you need to detect all of that text, then maybe an OCR model is totally fine and the structure model is overkill. Right. But yeah, I would say this is still very much in widespread use now and quite powerful. We've used it on a few different projects as well with good success. And it is still a model that I would say even though it's a little bit more computationally expensive than ocr. We'll talk about language vision models and deep SEQ OCR here in a second. It is not, not at the level of computation of those types of models, which means you could still embed it kind of within your application or something, maybe run it on a commodity gpu, that sort of thing. So it is still really useful in those ways as well.
Chris Benson
Thinking a little bit about different use cases, we still today if you go and use different types of Office tools, and I don't necessarily mean Microsoft Office, but that genre of productivity tools and you're doing file format changes and stuff across. I know, I know. Recently, I think about a week ago, I was trying to move a, a keynote just into a PowerPoint context and you would think in 2025 we would have gotten past that, didn't. Do you think this is something that is either used at some level or could be used at levels in terms of trying to, to capture that kind of complex structure and get it into a different format without losing the gist of what the communication was? Am I on target there?
Daniel Whitenack
Yeah, Yeah, I think the limitation, I guess, is in how rich that description is. Right. Like you might get these heading or you might get these labels like heading, title, paragraph, et cetera, table. But ultimately, if you were to need to reconstruct that, you have to decide how you are going to render a table, how you are going to render a title, which may be very different than the original, you know, Keynote, let's say the keynote presentation, and you're going and putting it in, you know, Google Slides or something like that. So actually that I think that rendering piece is still a quite challenging one. What I would say maybe this is a generalization because we've actually used Dockling models in other ways than what I'm about to say. But one of the very frequent uses of these models is for the processing of documents that are feeding into, let's say a rag, a retrieval augmented generation pipeline. Why would that be? It's sort of because the cleaner and more context relevant you can make that those chunks of text into your RAG system, the better results you're going to get in the responses from the RAG system. And so if you're just processing your documents that have some complex structure using ocr, all the text might get jumbled up and thus the knowledge and the context in the document is kind of jumbled up. Even though all the pieces are there, they might be out of order or they might be something like that. In the case of rag, you actually don't need to render anything, you just need to parse it well and preserve the structure. Right. So actually I think Dockling or these document structure models are a really good way to do that document processing for input to RAG pipelines, because there you probably just need things to be represented well in Markdown or some similar text format, not in a cool PDF that you recreate or something like that.
Chris Benson
Yeah, you know, that, that I'm just thinking of like, you know, it wasn't too far back, you know, a year, a year and a half, and RAG was all new at the time and now it is so embedded into our workflows. It. Lots and lots of organizations out there.
Daniel Whitenack
Yes.
Chris Benson
And I'm thinking about the fact that I wonder how many people out there are using Docling in that capacity, you know, as that input to that workflow. And it would probably, you know, having the contextual aspect of the information saved structurally in that way would probably. Yeah, I agree with you. I mean that that makes perfect sense intuitively that it would, it would, you would definitely have a RAG system able to give you Better answers on that.
Daniel Whitenack
Yeah.
Chris Benson
Do you, have you seen that in that use case much out there or is that very much one off? What's your gut feeling about that?
Daniel Whitenack
Yeah, definitely. I would say in particular toolkits like Dockling the Toolkit and there's other ones like Market down, which I think is a toolkit from Microsoft. We've used those over and over in RAG systems and I know other people do as well. Certainly people also use vision models, which we'll talk about here in a second. But I would say again, the, in the RAG system you want to preserve that structure. You don't want things out of order, but you really don't care how they're rendered necessarily. You just need to preserve the structure and ordering. And so that's, that's, that works out really good for, for RAG systems.
Sponsor Representative
So most design tools lock you behind a paywall before you can do anything real. And Framer, our sponsor, flips that script with Design pages. You get a full featured professional design experience from vector workflows, 3D transforms, image exporting, and it's all completely free. And for the uninitiated, Framer has already built the fastest way to publish beautiful production ready websites. And now it is redefining how we design for the web. With their recent launch of Design Pages, which is a free canvas based design tool, Framer is more than a site builder. It is a true all in one design platform. From social media assets to campaign visuals to vectors to icons, all the way down to a live site, Framer is where ideas go live, start to finish. So if you're ready to design, iterate and publish all in one tool, start creating for free today@framer.com design and use our code practical AI for a free month of Framer Pro. Again, framer.com design time.
Daniel Whitenack
All right, Chris. Well, there's a couple of, I guess variations on the, on the next, you know, types of models. We maybe it would be helpful to talk about language vision models or vision language models first and then talk about kind of deep SEQ OCR which is kind of a different kind of animal as it's not OCR like we talked about before, it's not vision model like we're about to talk about. But the vision model is actually kind of diff or it's more similar to the LLM than the OCR model I think. So a language vision model, what that means is that the input to the model can actually be an image and a text prompt. And so this is often how it works. Like if you go into a multimodal kind of chat thing and you upload an image and say, hey, what's going on in here? Who's, you know, who is this in this, or what product is this in this photo? Or you know, all of those sorts of things. So you want to ask about the image or you want to ask about, like give it as extra context to the language model. So the language vision model actually takes an image and, or text and then the output is similar though, to the large language model in the sense that it's just going to output a stream of probable tokens. So this isn't actually in, in one sense, this is not document processing, but it could be used for that. So it's, but it, it doesn't have to be used for it. So it could be used just to enhance the chat experience or to have a multimodal experience, or to reason over images. Right? Or to even classify images. It's kind of a general purpose reasoner over images. And what happens is you kind of take a large language model and you add kind of a vision transformer into the mix and the vision transformer takes the image and converts it into an embedding, the transformer piece of the LLM takes your text and converts it into an embedding, and then you smash both of those embeddings together into a vision plus text embedding. And that's what's used to generate the probability of the tokens on the output. So again, image or text coming in, text going out the other end. And where this plugs into document processing is I could upload an image of a document, right? And just as my prompt say, hey, reconstruct this table in this image, right? And maybe that works, and it actually works quite well, depending on of course, the model and what image you put in and that sort of thing.
Chris Benson
I'm kind of curious as you're kind of going through that fusion process between the text and the image thing, do you have any insight on whether those operate kind of in parallel and come together at some point or like how, how that fusion, how that fusion is able to generate the better outcome? Is it one of those things we just know it does, or do you have any.
Daniel Whitenack
I think the key thing here is that at least in my understanding, and our listeners can correct me if I'm spewing nonsense here, but in my understanding, part of it is that yes, there are these two pieces. And so the input, the image input goes through the vision transformer, the text goes through different layers of A transformer network, those embeddings are generated, they're smashed together, but that whole system is jointly trained together towards the output. Right? So it's not like you train the one that makes sense and then you train the other one and then you hope they work well together. It's kind of like you join them together at the hip to start with. You train the whole system on many, many, many of these kinds of inputs and outputs. And that's what. Obviously it's not interpretable in the sense of knowing how or why it outputs certain things, but it is able to recreate that probable output. And that would be what I say would be a major contrast with something like using Dockling plus ocr, because then actually you do get a human observable structure of a document out and text corresponding to that. With the language vision model, you toss an image and text in and text comes out, and there's no real interpretable connection between the structure or content of that text on the output and any region in the images or specific characters in the images. It's all just us related via the semantics of those embeddings, not any sort of structure or anything like that.
Chris Benson
It's fascinating. It sounds like when you. I'm just kind of once again, thinking back over, you know, the whole conversation and the maturity that's. That's evolving in this capability. And so I guess as we've kind of hit that point, what's the next step in it? Where do you see things going?
Daniel Whitenack
Well, I think the. Or at least a next step, it might not be where everything is going, but I think a next step is kind of represented by what Deepseek has done with Deep Seq ocr. So there are many language vision models or vision language models. I've heard it both ways. There's the one we use kind of Quen2.5 vision language model. We've used that one quite a bit. Really great model. I mean, the best of these. The reality is the best of these are all coming out of China, at least at the moment of this recording in terms of the vision language model side of things. So there have been these models over time, but they have limitations in the sense that most of these vision language models still assume a fixed resolution of the input of that image. And they do require, you know, they still require, you know, huge, huge training data sets and that sort of thing. But I think one of the main limitations is this fixed resolution size, right? So no matter the size of your document, how it's structured, all of that you're going to get this fixed resolution, which often does kind of create problems. And so what Deepseek OCR has kind of done is that they actually have kind of a different processing pipeline that doesn't take. So it doesn't take the whole image as a whole image, but actually what happens is it takes the input image and then it splits it up apart into these kind of image tokens, if you will. So small, small vision tokens that then are kept at their higher resolution and they're combined with the kind of big, the, the whole image, right? So you take the whole image, you combine it with these vision tokens and then. Or a global full resolution view, I think they call it. So you get this global page plus these tiles, and each of these tiles are kind of vision tokens are smashed together with the global page. And the idea is that you actually don't lose. It's a way of kind of representing this image or this document in a kind of compact token sequence where you are not limited by the resolution of your, of your document. And so what that means is that deep SEQ ocr, at least in terms of how it seems right now, is that it does a good job at preserving certain shapes of, shapes of characters, line breaks, alignments, very tiny mathematical equations or notation, right? You get sort of little dots or, or a carrot above mathematical notation. And so really what Deep SEQ OCR is kind of taking some of these ideas to the next level and preserving a lot of that information from the larger document into these kind of full resolution tiles which can then be processed through the model.
Chris Benson
Could you talk a little bit about like when we're talking about resolution, like what kind of level set, what does resolution mean in this context as we're talking about, you know, specific resolutions and then a multi resolution thing. Can you kind of clarify what that is?
Daniel Whitenack
Yeah, yeah. So if, if I have just kind of reducing it to thinking about a single page, right? If I have a single page and I represent that as of a document, I represent that as an image. You know, it might be however many pixels by however many pixels, right? Let's say 1000 pixels by 1500 pixels, right? But in a vision language model, typically regardless of what image you input, it's going to resize it to whatever, 256 by 256. And if you imagine taking that larger page, smashing it down into 256 by 256, you're going to lose little handwriting or diagrams or code or equations or little tiny fonts or footnotes et cetera, all of that stuff. And so what Deepseek is saying, well, let's not lose all of that context, but let's also not have to keep everything in the same resolution. Let's take the tile, let's tile this image. And now we have the original resolution of the document, but the tile is not there. But we also don't lose the ordering or the context of where that tile fits because we have the global view of the page. And so it's kind of like when we put text into a transformer, we actually don't lose the ordering often either. Right. We understand where text is related to other text. And this is kind of a similar concept. We're not losing any of the resolution, but we're also not losing the structure of where these kind of tiles are placed, if you will.
Chris Benson
That makes perfect sense. And so it's kind of the natural progression, you know, if we're going back a few years and, and talking about the way convolutional neural networks are working and the fact that you were constantly having to go to, you know, reduce the size down, but that created problems in terms of you were doing analysis of what was in the pictures, you know, identification of, of whatever. And, and the lack of resolution could sometimes make that a challenge. And this, this solves that in a particular way.
Daniel Whitenack
Yeah, yeah. Which. The kind of last generation or, I don't know, current or last, I don't know what generation we're in. The. The bulk of vision language models at the moment do not solve that because they still force this kind of fixed resolution. Now, at the same time, deep seq, OCR it is, is also a larger model. It does require GPUs to run. But this is only the kind of first generation of these. Similar to vision language models, large language models, I'm sure there will be a gradual shrinking of these models at higher performances as more and more people train them. And who knows if this is the right approach, kind of quote, right approach to go down, but it is interesting. One of the things I find interesting here, Chris, is we talk a lot about large language models, and for the most part, they all operate the exact same way. And we've been talking about them operating the exact same way for some time. But if you look at the progression of these models, these multimodal models, as we've gone through this conversation, they all do operate in quite different ways. And so there's a lot of, you know, to your point at the beginning, from my perspective, maybe from a nerdy perspective document processing is very much not boring because there's actually such a diversity and such innovation going on here with much more diversity on the model side and the technical side than what you see in large language models.
Chris Benson
And not only that, but our listeners have come through this with us. This is probably not something most of them have been hitting on lately and so not only have they earned their if they're in the US at least their Thanksgiving meal for tomorrow by the time they've done this, but maybe coming out of the holidays they can go back into the office and kind of give an upgrade to their RAG system and be wizards at how effective RAG is being for their organization. Because I definitely learned a bit along the way here about that. I have a whole bunch of use cases in mind now I'm thinking, oh gosh, we can go back and do this and that and the other. So fantastic explanation of these different, different approaches and kind of the timeline about how they develop. So thanks for doing that.
Daniel Whitenack
Yeah, of course. And Happy Thanksgiving again. Chris Happy Thanksgiving to all our listeners. Hope you enjoy your your tofurkey.
Chris Benson
There you go. And even if you're you're outside the U.S. we are thankful for you listening in and looking forward hope that you have whatever holidays you celebrate. We hope they are very good going over the next few months here.
Podcast Host
Alright, that's our show for this week. If you haven't checked out our website, head to PracticalAI FM and be sure to connect with us on LinkedIn X or BlueSky. You'll see us posting insights related to the latest AI developments and we would love for you to join the conversation. Thanks to our partner Prediction Guard for providing operational support for the show. Check them out@prictionsguard.com also thanks to Breakmaster Cylinder for the Beats and to you for listening. That's all for now, but you'll hear from us again next week.
Episode Title: Technical advances in document understanding
Release Date: December 2, 2025
Hosts: Daniel Whitenack & Chris Benson
Theme: Exploring the evolving landscape of AI-powered document understanding and processing technologies, focusing on practical, real-world advances and implementations.
In this “Fully Connected” episode, Daniel and Chris take a deep dive into the technical advances in document understanding. They explore the history, current state, and recent breakthroughs in document processing AI—discussing classical OCR, document structure models (like Docling), vision-language models, and the innovative DeepSeek OCR. The conversation balances technical explanation with practical advice, highlighting use cases, trade-offs, and implications for business workflows and AI practitioners.
On practical focus:
“We pride ourselves on bringing that practical, productive, and accessible approach… the difference in the conversations we have on this show is we’re really focused on getting people into this technology so that they can use it day to day in a fun way.” — Chris (06:39)
On the user experience:
“The cleaner and more context relevant you can make those chunks of text into your RAG system, the better results you’re going to get in the responses.” — Daniel (29:26)
On innovation:
“From my perspective, maybe from a nerdy perspective, document processing is very much not boring because there’s actually such a diversity and such innovation going on here.” — Daniel (47:02)
On motivation to learn:
“Maybe coming out of the holidays they can go back into the office and give an upgrade to their RAG system and be wizards at how effective RAG is being for their organization.” — Chris (47:23)
| Timestamp | Topic/Segment | |------------|-------------------------------------------------| | 04:37 | Why Document Processing Still Matters | | 07:45 | Brief history & evolution of OCR | | 11:12 | Classical OCR Processing Pipeline Explained | | 20:26 | Document Structure Models (Dockling) | | 22:25 | Structure vs OCR, JSON/Markdown outputs | | 27:07 | When & why to use structure models (trade-offs) | | 29:20 | Structure models in RAG systems | | 34:18 | Vision-Language Models Explained | | 35:05 | Multimodal input/output, use-cases | | 39:51 | DeepSeek OCR and advances in resolution | | 44:03 | Technical method: tiling and high-res context | | 47:02 | Diversity & innovation in document models | | 47:23 | Real-world implications and use cases |
For further learning or practical application, check out toolkits and models mentioned in the episode:
Stay practical, stay curious, and consider upgrading your document workflows with these modern advances!