Loading summary
A
Welcome to another snackable episode of the Business Lunch Podcast. Normally, it's me, Roland Frazier, and my business partner, Ryan Deiss. But these snackable episodes let me share research I've been doing in a format you can actually listen to with the help of AI. So here's today's episode on giving AI eyes beyond video transcripts. Let's get into it.
B
You know, it is. It's really funny how easily we just accept a little bit of magic in our daily workflows.
A
Oh, absolutely. We do it all the time.
B
Right. Like you paste a YouTube link into one of those, you know, anything AI video summarizers, you hit enter, and a few seconds later, boom, you get a neat little bulleted list of everything that happened.
A
Yeah. And it feels completely seamless.
B
It does. And you probably sit there and think, wow, the AI actually watched that entire video for me. But, well, I am here to completely shatter that illusion for you today.
A
It's a tough pill to swallow, but,
B
yeah, because it didn't watch a single frame. I mean, it didn't see the charts, it didn't observe the presenter's facial expressions, and it definitely didn't read the dense text on the screen.
A
Not at all.
B
All it did was quietly pull the transcript and just, you know, speed read the words.
A
Which is, to be fair, a really brilliant parlor trick.
B
Oh, yeah, for sure.
A
And for a long time, I mean, we have all been perfectly happy to fall for it, but when you really think about it, relying purely on a transcript, it creates a massive blind spot.
B
A huge one.
A
Right. Because for a vast majority of the content out there, you know, software tutorials, complex product demos, heavily researched keynote presentations, well, the spoken words are essentially just the background music.
B
Just the background music. I like that.
A
Yeah. Because the actual data, the undeniable proof, the real substance, all of that is
B
communicated visually, which is exactly why you are here with us today. Welcome in. If you are someone who, you know, wants to understand how things actually work under the hood and really how to get the most out of these AI tools without just falling for the marketing hype, this is exactly the place to be.
A
We're glad to have you.
B
Today we are exploring this really fascinating experiment outlined in an article from Crazy Experiments. And the authors of this piece, they set out to literally build, well, eyes for an AI, specifically Anthropic's Claude. Right, Claud. So it can actually process what is taking place on your screen rather than just what is being spoken into a microphone. Okay, let's unpack this. Because the mechanics of how they Solve this. Inherent blindness are incredibly clever.
A
They really are. Yeah. And to understand why this experiment is even necessary in the first place, you have to look at a fundamental quirk in Claude's architecture. Okay, Claude cannot natively take video as an input. It just can't. You can hand in an image file, you can hand it, you know, massive text document, but if you try to hand It a standard MP4 video file, it just hits a brick wall.
B
It just doesn't know what to do with it.
A
Exactly. The architecture simply isn't built to ingest moving pictures. Now contrast that with Google's Gemini, which actually has this native video door built right into its foundation.
B
Right. Gemini can just take it.
A
Yeah, Gemini can take video frames and audio straight into its context window. Claude just doesn't have that native doorway.
B
And I imagine that is incredibly frustrating if you are someone who, you know, practically lives inside Claude's ecosystem all day long.
A
Oh, absolutely.
B
Specifically, if you are using Claude code, their command line tool, asking it to just, hey, watch this video and tell me what is on screen. It's an impossible request. So if the doorway doesn't exist, you basically have to build one yourself. And the authors of this experiment did exactly that. They stitched together four completely free open source pieces of software to create a custom Claude code skill, and they called it Claude Video.
A
And what's fascinating here is the sheer elegance of the workaround. Because the developers, they didn't attempt to like, rewrite Claude's neural network to magically understand a video file format.
B
Right. That would be way too complex.
A
Way too complex. Instead, they approached it as a translation problem. They convert a format Claude inherently cannot understand, which is video, into a format. It excels at understanding, which is text and images. Exactly. A highly structured stack of timestamped images and text.
B
You know, it's, it's like listening to a sports broadcast on the radio versus actually watching the game on television.
A
That's a great way to put it right.
B
On the radio, you hear the announcer say the score, you know, the quarterback threw the ball. But you completely miss the gravity defying catch in the end zone or you know, the specific defensive formation that allowed it to happen in the first place.
A
You missed the context.
B
Exactly. You are getting the narrative spine, but you are entirely missing the physical reality. So how do we get that full television broadcast into Claude's text based brain? Well, the first hurdle is actually getting the file right. They use this command line utility called ytdlp. It reaches out, downloads the video, and crucially, it grabs the native YouTube captions for free, assuming the creator actually uploaded them.
A
But that's the catch, right? Relying on native captions is a huge gamble because a lot of creators just do not upload them.
B
Yeah, they just rely on the auto generated ones.
A
Exactly. So the tool needs a robust fallback. And this is where they introduce FFmpeg, which is. Well, it's a legendary open source media processor.
B
It's everywhere.
A
It really is. The script uses FFMPEG to rip the audio track from the video, condensing it down into this tiny mono audio clip.
B
Okay. And if those native captions don't exist, we move to the next layer of the translation. The script takes that stripped down mono audio clip and feeds it to an AI transcription tool called Whisper.
A
Right.
B
And I'm looking at the economics of this, and it is almost unbelievably cheap.
A
It's pennies.
B
Not even. They route the audio through Groq's whisper large V3 model, which cost roughly, get this, a fraction of a penny per minute. That's around $0.019. Or they can use OpenAI's Whisper One model at less than a cent per minute. But Again, if the YouTube video already has captions, this entire transcription layer gets bypassed. Meaning the text extraction costs absolutely $0, which is amazing.
A
But once the text is secured, we still have to solve the visual problem, right?
B
The eyes.
A
Exactly. So the script calls on FFMPEG again, but this time it commands the software to mechanically extract individual frames from the video and save them as standard JPEG images.
B
So now you just have a folder full of JPEGs and a long text file of the transcript.
A
Yep.
B
How does Claude actually make sense of that? Like, how does it know which picture goes with which sentence?
A
And that is where the clever prompting comes in. The Claude video skill physically lines them up in the prompt that's being sent to Claude.
B
Oh, I see.
A
Yeah. It takes a block of text from the transcript, it looks at the timestamp, and it inserts the corresponding JPEG image file directly underneath it in the context window.
B
That's so smart.
A
It repeats this process over and over, essentially creating this synchronized multimedia flipbook. Claude reads the text, looks at the picture from that exact moment read, reads the next text, looks at the next picture.
B
It's forcing the AI to process the visual evidence right alongside the spoken narrative.
A
Exactly.
B
Now, theory is great, and you know, a multimedia flipbook sounds wonderful on paper, but I always want to see the receipts.
A
Naturally.
B
Does this elaborate process actually capture anything that a basic text only transcript misses? So, to prove the value of this tool. The authors ran a head to head test.
A
Yeah, a real world test.
B
They used a 14 minute and 45 second screen recording tutorial by a developer named Nate Herc who is demonstrating how to give a powerful tool to Claude code.
A
And we should point out a screen recording tutorial is arguably the absolute worst case scenario for a transcript only AI.
B
Oh, total.
A
Think about the nature of a tutorial. The entire point of the video is the code being typed and the interface reacting on the screen. The narrator is usually just gesturing at the screen verbally saying things like as you can see here or let's click on this.
B
Right, which means nothing as text.
A
Exactly. The spoken words are entirely dependent on the visual context.
B
And we see this play out perfectly. In the first round of their test, they isolated just the first 70 seconds of the tutorial. The host of the video asks CLAUDE code to pull some quote unquote wins from his school community platform.
A
Okay.
B
When the authors fed only the transcript to the AI, it caught the names of the community members, Michael, Chris and Fernando. It also caught a spoken token. Cost 260 tokens sent, 132,000 returned.
A
And that was it.
B
That was it. It missed absolutely everything else.
A
Wow.
B
Now I'm going to play devil's advocate for a second here.
A
Go for it.
B
If I am a busy person prepping for a meeting, isn't a transcript usually good enough? If I just want the gist of a video, like why do I need to know the exact file size of a JSON response or the specific dollar figures on a dashboard?
A
Well, if you're watching an entertainment vlog, sure, the gist is fine. But in a tutorial, a financial breakdown or a dashboard review, the narration is not the primary data.
B
Right.
A
The voiceover is acting as a tour guide, pointing at the real content. If you are a developer trying to replicate that environment or a business analyst trying to understand the financial stakes being presented, the gist is entirely useless.
B
It doesn't give you what you actually need to do the work.
A
Exactly. Relying on the transcript means you are actively choosing to ignore the primary data source.
B
And I am looking at the results from the frames transcript method for those exact same 70 seconds and the difference is staggering.
A
It really is night and day.
B
The transcript alone simply said, here is Michael's win. But because Claude could see the synchronized frames, it pulled exactly what that win was. A $31,000 AI deal plus a retainer.
A
It's incredible.
B
The second win from Chrisatsu was a €16,000 deal. The third from Fernando Gomez was 3.2k plus 299k monthly recurring revenue for an agency in Malag.
A
And let me guess, not a single digit of that financial data was spoken out loud by the narrator.
B
Not a single digit. It completely flips the value of the summary from a vague overview to a precise data extraction.
A
And it gets even more granular with the technical details. You mentioned the token cost earlier. The host spoke those high level numbers, but the screen actually showed a massive complex breakdown, right? The frames allowed Quad to see that the Response was a 529 kilobyte JSON file that it wrote on a 450 character cookie. And most importantly, it showed that only about 2,000 tokens actually hit the context window.
B
Wait, really? It saw all of that?
A
It saw all of it.
B
The tool even caught the user's exact operating environment purely from analyzing the visual interface. Like it documented that they were using VS code running Claude code version 2.1.133 on the Opus 4.7 model featuring a 1 million context window running Claude Max.
A
It saw everything on the screen.
B
It saw the entire project file tree structured down the left side of the screen. And again, the narrator never uttered a single word of that.
A
This test perfectly illustrates the massive data gap we just blindly accept when we rely on standard summarizers. We are getting the sanitized headline, but we are completely missing the underlying receipts.
B
Okay, here's where it gets really interesting. What happens when the transcript actively misrepresents the core argument being presented?
A
Oh, this is a huge problem.
B
In round two of their test, they jumped ahead to the 4.16 mark of the video. The presenter puts up a slide comparing two different ways AI can interact with your computer's files. Specifically, it compares MCP servers, which is the model context protocol, a newer standard for AI tools, against a traditional CLI or command line interface. Okay, tracking out loud. The presenter says MCP used 35 times more tokens than the CLI and reliability drops from 100% to 72%.
A
I mean, if you were reading the transcript, that sounds like a solid, mathematically complete data point.
B
It really does. But it's like reading the abstract of a scientific paper and completely skipping the methodology section, where the actual proof lives. Yeah, because the slide on the screen showed the detailed receipts under a heading called schemabloat. The slide actually read GitHub. MCP equals 43 tools equals 30,000 plus tokens before the agent does anything.
A
Wow.
B
The presenter never said that crucial data point out loud. And he skipped right over the Concrete numbers and just gave a sanitized ratio.
A
If we connect this to the bigger picture, this really touches on a fundamental human behavior and how we communicate in professional settings.
B
How so?
A
Well, when humans give presentations, we are terrified of reading our slides word for word.
B
Oh yeah, death by PowerPoint.
A
Exactly. It is the cardinal sin of public speaking. So we put the dense concrete citable data on the screen for the audience to absorb visually, and we speak a highly sanitized, high level summary that makes total sense. We say schema, bloat and 35 times more, but we show 43 tools and 30,000 tokens. A transcript only AI is fundamentally designed to capture that sanitized summary and permanently discard the concrete proof.
B
So we are basically weaponizing our own presentation advice against the AI.
A
Exactly.
B
We've trained humans to speak in summaries to keep the audience engaged, and now the AI is getting trapped by that very habit.
A
It's a fascinating paradox.
B
At this point, anyone listening to this is probably sold like you want to give Claude eyes. But I am trying to figure out how this doesn't completely bankrupt you. If this is so incredibly powerful, why aren't we running every single video through this pipeline?
A
Well, we have to examine the hidden costs. And ironically, the true cost here is not financial.
B
It's not?
A
No. As we mentioned, the audio transcription is practically free and native captions cost nothing. The real budget you are spending here is context window tokens.
B
Ah, how does the math on that work? Like if I feed an hour long video to Claude, isn't it going to ingest thousands of images and immediately crash my context window?
A
That is the exact constraint the developers had to solve for when you extract 80 frames from a video at the default 512 pixel width, you are eating up anywhere from 50,000 to 80,000 image tokens in your Claude context window.
B
Just for 80 frames?
A
Just for 80 frames? Images are incredibly token heavy compared to text, so the tool imposes hard limits. It is not magic vision that sees every single frame like a human eye.
B
Okay?
A
It mechanically samples the video at a maximum rate of 2 frames per second, and it enforces a hard cap of 100 frames total.
B
Wait, that means if you try to scan a full 45 minute keynote presentation with a 100 frame limit, the tool's gonna have to spread those frames out drastically to cover the runtime. You'll end up sampling, what, one frame every 27 seconds or so? You are almost guaranteed to miss the exact moment the speaker flashes that critical slide with the methodology on it.
A
Precisely. You cannot treat this as a blun whole video scanning tool. The solution is treating it as a precision surgical instrument. I like that you use the cheap text transcript to find the general area of interest. Say you notice the speaker starts talking about schema bloat at the two minute mark, Right. Then you tell the tool to densely sample a very specific narrow window, maybe from 2 minutes and 15 seconds to 2 minutes and 45 seconds, where, you know, the crucial visual breakdown happens.
B
Okay, that makes a lot of sense. But there is also the issue of resolution trade offs, right?
A
Yeah, definitely.
B
The tool defaults to capturing frames at 512 pixels wide, which helps keep that massive token cost somewhat manageable. But at 512 pixels, trying to read tiny, dense code on a complex IDE screen is. Well, it's a coin flip.
A
It really is.
B
The text can become a total blur. Now, you can command the tool to bump the resolution up to 124 pixels so it can clearly read the fine print. But doing that literally quite quadruples the token cost for every single frame you capture.
A
And that adds up fast. But to combat that token bloat, the developers included a really brilliant little feature using the dedupe flag.
B
Deduplic de duplicate.
A
Exactly. This leverages an FFmpeg filter called MDECimit. What it does is analyze the pixel differences between consecutive frames. Okay, so if you are watching a talking head interview or a screen recording where the user doesn't move their mouse or Type anything for 10 seconds or the screen isn't fundamentally changing.
B
Right. It's basically the same picture.
A
Right. The tool recognizes that lack of movement and automatically drops those near identical consecutive frames. It flat out refuses to waste your precious tokens capturing 10 identical images of a static screen, which can cut your token costs by 30 to 50%.
B
Oh, wow. That is an incredibly elegant way to preserve the context window.
A
Yeah, it's very smart.
B
So now that we've covered the mechanics, the test results, and the token economy, I want to talk about the user experience. Because this CLAUDE video skill runs locally inside your terminal using CLAUDE code, the workflow feels completely different than using a standard web based tool.
A
It does. The output doesn't just dead end in a browser window. The video data becomes live. In your terminal session, you have the full transcript and you have the visual context loaded directly into Claude's working memory, completely ready for you to interact with.
B
So what does this all mean for you? The user picture your terminal window. Right? Now, you type in the Claude video Command alongside a YouTube URL. Suddenly, instead of having to open A browser and sit through the video your terminal downloads. It strips the frames, reads the text, and preps the data, all the background. Yeah, it's like having a master sous chef in your kitchen. The sous chef perfectly prepares all your ingredients. You know, chopping the veggies, prepping the meat. So you, the head chef, can just step cook whatever query you need.
A
I love that analogy.
B
You can instantly pipe that live data and say, summarize the on screen steps into an actionable checklist, or extract every single dollar figure shown on that dashboard and format them into a markdown table.
A
You are fundamentally changing the nature of video here. You are turning it from a destination, something you passively have to sit through and watch, into a highly structured, queryable input.
B
That's a massive shift.
A
It is. But it is crucial to recognize that the best results come from the synergy of both streams. They cover each other's weaknesses.
B
How do you mirror?
A
Well, a screenshot of a blank terminal window is completely useless without the transcript explaining what the host intends to type next.
B
Right.
A
Conversely, the transcript is useless without the frame showing. The actual error code that popped up as the result of that action is
B
the perfect marriage of intent and result.
A
Exactly.
B
Now, just to manage expectations, there are scenarios where this specific tool will fail because the underlying script relies on YTDLP to download the source file. It completely fails on gated content.
A
Right. That's a hard wall.
B
If a video requires a user login, if it is region locked, or if it is a private corporate video sitting on an internal server somewhere, the tool just cannot grab it.
A
Which leads to a really pragmatic framework for when you should actually deploy this skill, because you don't need it for everything. If your entire daily workflow already lives inside Google's ecosystem, you should honestly just use Gemini. It has native video ingestion built right in, so it is far less fiddly to set up.
B
Makes sense. And if you are just trying to pull the key arguments from an interview where, you know two people are sitting in chairs talking, do not waste 80,000 image tokens. A cheap standard text transcript will give you everything you need. But if you do your heavy lifting in Claude and the video you are trying to understand actively shows you critical information like complex code structures, financial charts, or software interfaces. Well, this skill is absolutely essential.
A
It effectively patches a silent failure that nobody in the AI space really wants to talk about. The immense danger of an AI confidently reporting on a reality it only half perceived.
B
We have covered so much ground today, we shattered the illusion that AI natively watches videos out of the box. We examined the massive quantifiable data gap between spoken summaries and visual reality, proving how you can miss $31,000 deals and complex software environment environments if you rely on text alone.
A
We really did.
B
We dug into the intricacies of the token economy, the mechanics of ffpeg, the necessity of precision sampling, and ultimately, how to turn passive video into a structured, queryable knowledge base right in your terminal.
A
It is undeniably a massive leap forward for personal productivity. But looking at this experiment leaves me with a rather profound thought for you to consider as we wrap up.
B
Okay, let's hear it.
A
We are rapidly moving toward a world where I will be able to perfectly perceive and index the entire visual world. Every terminal command you type, every microscopic piece of data on a slide you flash for just two seconds, every fleeting facial expression. If we know that our primary audience for this content might soon be an algorithm extracting raw data rather than a human absorbing a spoken narrative, how will that fundamentally change the way we design presentations, build software, or even communicate on camera?
B
That is a wild paradigm shift to consider. Are we going to start embedding like invisible metadata in our presentation slides specifically for the AI to read, knowing the human audience will never even see it?
A
It's very possible.
B
Fascinating stuff. Well, thank you so much for joining us on this deep dive. Keep questioning the tools you use, keep experimenting, and most importantly, keep learning. We'll catch you next time.
C
Hey business owners, I've got a quick question for you. Do you feel like you're missing the data you need to make strong business decisions? If so, it's probably time to build a CEO dashboard. It's an easy way to get everyone in your company literally on the same page, focusing on the numbers that matter. So the scalable company put together a free spreadsheet template that will give you everything you need to deploy your own dashboard. And to make it even easier, Ryan Deiss recorded a short training on how to use it. If you want to get your hands on the template, go to businesslunchpodcast.com dashboard. That's businesslunchpodcast.com dashboard and you can download it for free.
Host: Roland Frasier
Date: June 2, 2026
This "snackable" episode explores a groundbreaking experiment: building “eyes” for AI, specifically for Anthropic’s Claude, allowing it to process not just the transcript, but also the visual data inside videos. Roland Frasier dives deep into the limitations of current AI video summarizers, the experimental tool Claude Video, and how giving AI access to synchronized visual evidence (frames) fundamentally changes what it can understand and extract from video content.
The episode exposes the hidden limitations in AI video summarization, reveals the clever engineering behind giving Claude the ability to “see,” and reframes video as an actively queryable knowledge source rather than a passive, linear medium. As AI tools evolve to perfectly parse not only what is said but also what is shown, Roland Frasier encourages listeners to rethink how we design, deliver, and document our digital content for both human and algorithmic audiences.