Summary6 min read

Podcast Summary: How I AI — Using Veo 3 to Create AI-Generated Music Videos (Tiny Desk Concert with Notorious B.I.G. and Kurt Cobain)

Host: Claire Vo
Guest: Anish Atraya (General Partner at Andreessen Horowitz, AI Consumer Investor)
Date: August 18, 2025
Episode Theme:
Exploring practical, creative, and accessible ways to use cutting-edge AI tools for personal projects, specifically focusing on generating music videos with AI and leveraging multimodal AI for cataloging collections.

Episode Overview

In this episode, Claire Vo and guest Anish Atraya dive deep into how AI has unlocked new avenues for creativity, centering on Anish’s process for recreating a “Tiny Desk Concert” music video with famous artists who could never appear together in real life, using AI tools like GPT-4o, Veo 3, Hydra, and multimodal models. They also explore practical consumer workflows, such as easily cataloging books or records with video and Gemini Flash, and discuss how AI is redefining creative constraints and expanding possibilities for everyone.

Key Discussion Points & Insights

1. Remix Culture, Creativity, and AI in the Arts

AI as a Creative Multiplier:
- Anish discusses how remix culture–from mixtapes to hip-hop sampling–is the predecessor to today's AI-powered creativity. With modern AI, constraints are different, and creativity is amplified.
- Quote:
  "Sampling was the foundation of hip-hop, and I think AI is just the next manifestation of sampling—it'll be as important for music as hip-hop was." — Anish, [05:02]
AI Expanding Artistic Possibility:
- Both Claire and Anish emphasize that, rather than diminishing creativity, AI gives creators more tools and expands the scope of what’s possible in art, music, and writing.
- Quote:
  "It just gives me so much more tools, so much more breadth, so many more things I can play with and build. And so it really opens up this, like, creative artist side of me." — Claire, [04:32]

2. Workflow #1: Building an AI-Generated “Tiny Desk Concert” Video

a. The Project Concept and Motivation

Anish is inspired by the NPR Tiny Desk format and envisions resurrecting or combining artists like Notorious B.I.G. and Kurt Cobain for fictional performances via AI.
Uses current AI tools to achieve a respectful, creative result.

b. Step-by-Step Workflow Breakdown

| Step | Tool/Method | Insight or Quote | |----------|-----------------|---------------------| | Generate Images | GPT-4o for prompt engineering, 4.0 ImageGen | “I just ask it to generate an image of Kurt Cobain … playing a Tiny Desk concert.” — Anish [07:38] | | Find Audio | YouTube for live audio; 4K Video Downloader | "I actually found a Biggie cover band playing live in Brooklyn, pulled that down from YouTube and extracted the actual vocals from Notorious B.I.G." — Anish [11:27] | | Audio Processing | Adobe Audition (formerly Cool Edit Pro), Demux for stem separation | "Demux is this amazing technology that allows you to extract the vocals from any song." — Anish [15:36] | | Video & Lip Sync | Hydra ("upload a still and sync to audio"), alternatively Sync Labs | "Hydra is nice because it actually generates the video … and then also adds the audio." — Anish [09:35] | | Editing and Stitching | Capwing for video assembly | "Capwing is so easy and so useful. Highly recommend it." — Anish [23:40] |

Prompting Technique:
- Anish favors concise, open-ended prompts to let the AI explore creatively.
- Quote:
  "You’ve got to give the AI the space as well. If you overly constrain it, it just really struggles." — Anish [17:27]
Handling Technical Constraints:
- Accepts current short clip limitations as creative constraints that inspire new forms of art.
- Quote:
  “Once we actually got the technology to sample for more time, we actually got less creativity, I would argue. So I sort of love the constraints that the technology gives us today.” — Anish [14:31]

c. Demo & Reaction

Claire is emotionally moved by the quality and specificity of the generated video.
- Quote:
  “Something like this makes me almost want to cry… It always felt so inaccessible to get these amazing ideas that I had in my head into a thing.” — Claire [24:57]
Discussion on artifacts and limitations (e.g., AI-rendered cigarettes, duplicated characters), leading to unexpected, often delightful results.

3. Workflow #2: Cataloging Books and Records with Gemini Flash

a. Consumer-Focused Use Case

Video Instead of Images:
- Anish builds an app using Google AI Studio and Gemini Flash that catalogues his record and book collections by having users flip through their shelves on video.
- Quote:
  "I would have actually, I thought you were going to show us like you took a picture of it and you cataloged it. But this idea of a video and then extracting the frames, I just haven't changed my mental model to match these multimodal models..." — Claire [30:20]

b. Workflow Steps

Record a video flipping through a collection.
Use a prompt in Gemini Flash to extract and catalog book/album covers, author/artist names, and titles frame-by-frame.
Deploy the app via Cloud Run and shareable links.

Speed and Accessibility:
- Takes "15 minutes for a working demo" but requires more time for public deployment.
- Quote:
  "The era of personal software is upon us…” — Anish [33:22]

c. Applications & Vision

Enables regular users to build personal, hyper-custom tools.
Potential for expansion: kid’s book cataloging, fan fiction creation, educational content.

4. Bonus: Consumer AI Tools & Financial Planning with Comet

Comet Browser from Perplexity (AI-Powered Browser):
- Anish highlights Comet's ability to automate browsing and analyze personal finance dashboards.
- "The assistant feature in Comet makes every website dramatically more useful and it's been a big unlock for me." — Anish [37:04]
Personalized & Accessible AI for All:
- Discussion on how AI tools are becoming accessible to non-tech audiences (e.g., parents), and children’s intuitive use of AI for interactive learning and play.
- "You can just play with the technology instead of just being broadcast to from technology, which is really new.” — Anish [38:01]

Notable Quotes & Memorable Moments

"Sampling was the foundation of hip hop, and I think AI is just the next manifestation of sampling." — Anish [05:02]
"We become so attuned to what’s possible, we forget that this would be… witchcraft three years ago." — Anish [09:04]
"Give the AI the space as well…if you give it less constraints, sometimes it has unexpected results, but often they're unexpected, you know, delightful." — Anish [17:27]
"Something like this makes me almost want to cry… It always felt so inaccessible to get these amazing ideas that I had in my head into a thing." — Claire [24:57]
"The era of personal software is upon us." — Anish [33:22]
"My children form my consumer AI theses for me… my 6-year-old… put [Meta AI glasses] on his face and asked this personal AI a question." — Claire [39:49]
"Now everything kind of is [possible]." — Anish [40:56]

Timestamps for Important Segments

[03:42] – Why Anish got into AI for music and creativity
[06:09] – The Tiny Desk inspiration and working with AI-generated video
[09:35] – The key workflow: turning still frames and audio into video with Hydra & Sync Labs
[14:31] – The power of constraints in creative AI workflows
[19:00] – Using emotion and gesture in AI video generation
[23:40] – Using Capwing for video editing
[24:57] – Claire’s emotional reaction to the AI-generated music video
[27:56] – Workflow #2: Using Gemini Flash for video-based cataloguing
[37:04] – How Comet AI browser boosts personal finance management
[38:01] – Ways AI will transform the consumer world and children’s creativity

Closing Thoughts

This episode showcases how practical, approachable, and inspiring today’s AI tools can be—not only for tech professionals but for anyone with a creative itch or organizational need. Both fun and functional workflows are deconstructed, and the ongoing shift from "what AI can do" to "what can I do with AI?" is center stage. Constraints aren’t a hindrance but a wellspring of new art, and the era of deeply personal, customizable software is unmistakably here.

Listen & learn more: howiaipod.com

Loading summary

Transcript128 lines

[00:00]
A
It's like the most creative satisfaction I've had in my whole life. So I generated all these clips in a pretty straightforward way. I used GPT4O to help me with the prompts, said, hey, help Me capture grunge 1990s Seattle inspired by some of these music videos. And then as you can see, it gets progressively more like camcorder grimey. So I generated all this stuff and then I threw it together into a music video. All right, let's watch it.
[00:29]
B
You get the patented Clairvaux raised hands reaction on this one. I cannot believe this is AI generated. It's so high quality. It's so specific. An aesthetic in a wardrobe, an emotion. You have inspired me. After this podcast, what music video am I going to make? It's so much fun. Welcome back to How I AI I'm Claire Vo, product leader and AI Obsessive, here on a mission to help you build better with these new tools. Today we have a fun and inspiring episode with Anish Atraya, General Partner at Andreessen Horowitz and AI Consumer investor. But we're not going to talk about portfolio companies or the future of AI. No, we're going to use AI to build music videos, analyze our bookshelf, and help us plan our personal finances. Let's get to it. To celebrate 25,000 YouTube followers on how I AI, we're doing a giveaway. You can win a free year to my favorite AI products including v0repl.it lovable, bolt, cursor and of course chat PRD by leaving a rating and review on your favorite podcast app and subscribing to YouTube. To enter, simply go to howiaipod.com giveaway Read the rules and leave us a review and subscribe. Enter by the end of August and we will announce our winners in September. Thanks for listening. This episode is brought to you by Notion. Notion is now your do everything AI tool for work with new AI meeting notes, enterprise search and research mode. Everyone on your team gets a note taker, researcher, doc drafter, brainstormer. Your new AI team is here, right where your team already works. I've been a long time Notion user and having using the new Notion AI features for the last few weeks. I can't imagine working without them. AI meeting notes are a game changer. The summaries are accurate and extracting action items is super useful for standups, team meetings, one on ones, customer interviews and yes, podcast prep. Notion's AI meeting notes are now an essential part of my team's workflow. The fastest growing companies like OpenAI, RAMP, Vercel and Cursor all use Notion to get more done. Try all of Notion's new AI features for free by signing up with your work email@notion.com Howiai Anish I am so excited to have you here. And let me tell you why it is because I have spent the majority of this podcast Talking about enterprise B2B product management, how to manage your manager or manage yourself as a manager, or how to vibe code. That has been the topic of how I AI and today we are just going to have a little bit more fun. So why did you start to come to these AI projects that are a little less like work related or technical and actually just a little bit more fun? How did you get here?
[03:42]
A
Great. Well, I'm excited to have some fun today. I mean, I've been passionate about music forever. I think most of us are. I've been DJing and making music for 30 years. But music is very constrained. You know, there's only so many ways you can work with it. An example of that is if you look at a track that has all the instruments mixed down into a final MP3 or WAV file. There's no way to just extract the vocal or just extract the drums. So you're really limited by a set of choices that were made in the studio. And, and with AI, you can do all this crazy stuff like disentangle a track into just the vocals and just the instrumentation. So what really got me excited at first was everything you could do with AI and audio. And then that of course fed into all of the new video models and videogen and lip sync and all of the new technologies we're seeing. So it's just, it's like the most creative satisfaction I've had in maybe my whole life.
[04:33]
B
Yeah, I agree with you. One of the things that I have so much fun with AI on is people are really worried that it takes away the most fun, most human, most creative parts of not just building things, but creating music, creating art, creating writing. And I in fact feel like it just gives me so much more tools, so much more breadth, so many more things I can play with and build. And so it really opens up this like creative artist side of me in a way that has been really hard to access as an adult, also with limited time.
[05:03]
A
Yeah, no, and it's actually a fun conversation we'll have over a glass of wine sometime. But if you look at music culture, music culture has kind of been defined by remix culture for the last 40 years. You know, like, the mixtape was the first time that you could take the music and do something, you know, the cassette tape, and do something of your own with it. And then that, of course, evolved into, you know, hip hop, which also sampled and which also had a lot of suspicion on it. But sampling was the foundation of hip hop, and I think AI is just the next manifestation of sampling, and it'll be as important for music as hip hop was.
[05:32]
B
Well, and we'll stop opining about AI and the arts. But the other thing that this remix culture makes me think about is kind of the next step that we've seen in the past couple years, which is kind of audio and video remixing. This, like, TikTok memes, these dances, these things where you're taking a snippet of creativity, turning it into your own thing, and then releasing it to the world in a new version. So I definitely think we're seeing this not just the audio side, but also at the video side, which brings us to your use case. So tell me what you built or what you created, maybe, and I'm excited to walk through how you got it done.
[06:10]
A
Amazing. Amazing. Great. TinyDesk is the best. So if you haven't gotten into Tiny Desk, most people have seen it. It's just. It's so cool. It's so fun. And of course, you know, like, creativity loves constraints, and the constraints of TinyDesk are incredible. There's a really good one from clips that just dropped last week. And I mean, anyway, there's an infinite number of them. It's a fun format. It's sort of like the Unplugged format of the 90s. So I love TinyDesk. And I got to thinking about all of the artists I'd want to see on TinyDesk. And of course, some of them are no longer able to be on TinyDesk because they're not alive anymore. So that got me thinking about how I could do a notorious Big Christopher Wallace Tiny Desk. And do we have the tools and technologies? And, of course, can we do it in a way that's respectful and not derivative? And I did it, and it seemed like it kind of worked. Maybe we can cut to it so your audience can check it out. And the workflow is pretty simple.
[07:05]
B
We'll do a little clip of it. I. And then we can work through how it got there.
[07:14]
A
To all the ladies in the place with style and grace Allow me to.
[07:17]
B
Lace these livable dishes in your bushes who rock grooves and make moves with all the mommies the back of the.
[07:23]
A
Club Sipping Moet is where you bomb the back of the club mackin holes.
[07:27]
B
My crew's behind me Mad question asking blunt passing music blasting But I just can't quit because okay, we love it. It's great. And you made that?
[07:39]
A
I did make it, yes. And it took surprisingly little time. Yeah. So let me show you exactly how I made it. So I started with 4040 is the best general purpose multimodal model in my opinion. I use it for everything and I just ask it to generate an image of. And we're going to do Kurt Cobain. That'll be fun today from Nirvana. Of course, that's from when I was in high school playing a Tiny Desk concert. So let's see what it comes up with.
[08:05]
B
While this is loading. You know, you mentioned that 4.0 is the best kind of multimodal all purpose model. I generally agree. You know, 4.0 ImageGen had this super viral moment a couple months ago when they released it. What do you feel like 4.0 imagegen is particularly good at? Compared to some of the other imagegen.
[08:25]
A
Models, it's very good at prompt adherence. So you can do things. And I think that's because of the infrastructure underneath it. It's a different infrastructure from the diffusion based models that preceded it and BFL Flux. A bunch of others do this now as well and it's great. But I think it was just the most productive image model because you could manipulate it in such a fine grained way.
[08:46]
B
Yep. And I remember the biggest improvement when the 4Zero Image Gen came out is that it could actually spell things and write letters out. That was a magical moment. So I have to call out that NPR in the top corner of this image is actually done correctly. Look, there he is with his cardigan. There he is.
[09:04]
A
Okay, I'm gonna remove the guitar actually so that it is acapella because I think that might work a little bit better. But look, this is the vibe of Tiny Desk. You know, it's as if you're seeing a photo from the 90s in the tiny Desk studio. So I just, I love this and I think that we, we become so attuned to what's possible, we forget that this would be, you know, witchcraft three years ago. Witchcraft.
[09:27]
B
Right. What is the purpose of this? Are you storyboarding? Are you creating an asset that's going to go into another tool? Why start with image on this flow?
[09:35]
A
So, so I'll talk through essentially what I'm going to do. So there's this product called Hydra which is the Best way to. I think the best way to take a still frame and add custom audio to it. So create a video that has sort of animated from the still frame and includes the audio with the right lip sync. So. And there's a bunch of amazing tools to do this. Sync Labs is one of my absolute favorites as well. But Hydra is nice because it actually generates the video. So it does the text to video or the frame to video, and then it also adds the audio. So what we're going to essentially do is take this frame, we're going to get the audio from YouTube, we're going to stem separate the audio. So we get the audio track we want, and then we're going to put them together in hydraulic, and that's it.
[10:21]
B
This really is remix culture.
[10:23]
A
It's amazing, isn't it?
[10:24]
B
It is amazing. Okay, so the asset that you really need to go into this videogen lip sync tool are two things. You need a still image that can be used to generate the video, and then you need some sort of audio to sync this to. So I know we're looking at this music example, but what other examples have you seen people use this kind of workflow for?
[10:47]
A
I think we underestimated how useful it would be to add custom audio to video. And there's been a bunch of great. You know, one of the early examples was taking a speech that somebody was giving. I know Javier Millay did a really famous one. And essentially lip syncing, changing the language to English and lip syncing it. That went really viral a couple of years ago. So we've seen. And then of course, you can imagine a character, a photo of a character that you generate, and then you want to animate them doing something and speaking at the same time. So, you know, stories are told this way, and these technologies make it really, really easy to do. So.
[11:21]
B
Oh, we got him. Great. Okay, so now he. He's got bad posture, but we'll.
[11:26]
A
We'll allow it.
[11:27]
B
Very grudging.
[11:28]
A
I think he always did. Yeah, exactly. Okay, so now we've got Kurt. Now what I would do if I didn't actually have a. So tiny desk has got a really specific acoustic aesthetic, which is. It sounds like live instrumentation. So for the Biggie example, I actually found a Biggie cover band playing live in Brooklyn, and I pulled that down from YouTube and then I extracted the actual vocals from the Notorious B.I.G. and laid them over. But in this case, Nirvana did a really famous New York City Unplugged concert in 93. So there's video of them playing in the way that they would in audio. In the way that they would on tiny desk. So that is right here.
[12:10]
B
Even in the same cardigan.
[12:13]
A
Even in the same cardigan. Isn't that amazing?
[12:14]
B
Yep.
[12:15]
A
Okay, so I use this nifty Little tool called 4K Video Downloader, which is slightly sketchy, but that's okay.
[12:24]
B
I love these little utilities that you just, you know, you Google. Like, how do I get audio out of YouTube? And then you look at the scariest website possible and you just cross your fingers that your computer won't go up in flames and you download 4K video downloader.
[12:39]
A
Yes, my. Yes, my data is definitely going somewhere sketchy as a result of this.
[12:45]
B
So for the vibe coders that are listening, I have a request for startup, which is go, go find all these slightly scary little utils and build me ones that are less sketchy looking.
[12:56]
A
100%. 100%. It's a great idea. Okay, so now we actually have this. So we've got the video. Yep. Now we're gonna open Adobe Audition. Okay, so this is a tool that people who have been working in computer audio have been using for 30 years plus. It used to be called Cool Edit Pro. It's completely beloved and it's very, very easy to use, which is why so many of us use it. It was, of course, acquired by Adobe many years ago. It's now called Audition. So I go to audition and I take this video and I just drop it in. So here we actually have the audio from the video, which is really, really cool. I'm gonna zoom in and I'm going to see the first few seconds of it are blank. So let's just cut that out because we don't want to hear that. Then we're gonna zoom out and we're gonna take, I don't know, let's take 15 seconds and you can kind of see the audio, the video in the bottom left corner there.
[13:51]
B
Oh, got it. So it's combining the audio and video just so you know exactly what you're syncing up to.
[13:56]
A
Exactly.
[13:58]
B
And I'm gonna pretend like you're doing 15 seconds because we're doing a very efficient podcast here. But one of the limitations I know, having used some of these audio and video gen tools, is you're getting small clips right now with what we're working with. And so, you know what I'm looking forward to is the day where I can have the hour long Nirvana unplugged tiny desk.
[14:22]
A
Totally.
[14:22]
B
But you know, do you feel, do you ever feel constrained by the kind of length of assets being generated or the quality, I mean, sort of.
[14:31]
A
But again, I think creativity breeds constraints. So to not to over rotate on hip hop. But if you look at the reason that so many samples were used in hip hop and creative ways in the 80s and 90s was the actual drum machines and samplers had very limited sampling time, so you could only sample a second of anything. So you couldn't really sample four bars. And that's why so many producers put tracks together that use these many 1/2 samples in surprising ways. And once we actually got the technology to sample for more time, we actually got less creativity, I would argue. So I sort of love the constraints that the technology gives us today.
[15:09]
B
Well, I also love my complaints. I'm like, isn't it annoying that you can't revive Nirvana and overlay their audio and generate a completely fictional concert for longer than 15 seconds in probably under a 30 minute podcast live? My complaints are so ridiculous because the idea of creating something like this even a year ago sounds so, as you said, impossible that we get so spoiled once we get used to these tools.
[15:37]
A
100% right? No, exactly. Like, I mean, this stuff, we would have called it witchcraft three years ago. It would have been okay. Now there's two things you can do with this. If we wanted to do an acapella only version, for example, we can use a technology called Demu. So Demux is this amazing technology that allows you to extract the vocals from any song. So here I've forgotten what the actual command line is, so I just do this. I looked it up in perplexity. What's the actual way to extract two tracks with dmux, we do this dmux, two stems, vocals, and then let's go find the path. Okay, so this command is going to take that audio file we saved of the first 15 seconds of this concert and it's going to extract the vocals from the instrumentation. So this will be Kurt Cobain singing come as you are, a cappella, which as far as I know has never happened. Which is pretty cool. And then we simply come back here and we say, start frame, upload an image. Use this. Okay, that's our Kurt Cobain audio script. Upload audio. And let's use actually the full audio with all the instruments, add a video, and then we just say, man singing on tiny desk.
[17:01]
B
What I love about your prompting compared to other how I AI guests is every prompt has been subbed six words. You're very simple in terms of describing what you want. And get high quality quality outputs There. So I don't know what that says about the prompt engineering industrial conflict, but proof here that you can use simple prompts to get pretty cool stuff. If the tool behind the scenes does, does the work for you, I think.
[17:28]
A
You'Ve got to give the AI the space as well. You know, if you overly constrain it, it just really struggles to satisfy you. Whereas if you give it less constraints, you know, sometimes it has unexpected results, but often they're unexpected, you know, delightful.
[17:42]
B
Well, that's what I've heard a lot from folks that come from the more creative backgrounds. Designers in particular tend to be less precise in their prompting because they want that exploration space that then they can narrow in on. And so I really think it, it also comes into play. Your prompting technique can come into play based on kind of what profession or what background you're coming from. Engineers want like the most precise. They not only want the code to work, but they want the code to be written exactly how they would write it. And so they're very precise in their prompt. Where I found designers and more creative folks building different kinds of assets. Really like that wide open space.
[18:21]
A
Totally. Yes, exactly.
[18:23]
B
And while we're, while we're waiting for this to load, it might be interesting. I'm just looking at some of the options at the bottom here. So you have different kind of models that you can use, including one that looks like that they specifically fine tuned for this. Different aspect ratios, orientation, length, probably based on the script. And then you know, the prompt says prompt your character with emotion and gesture. So I am very curious if you put like angsty man singing versus cheerful man singing, if you'd get a different version here, even if the audio and video were the same, it works really well.
[19:01]
A
Absolutely. Yeah. No, this is such a useful storytelling product. It's amazing. And when you combine it with other video gen models like VO3, you can start to tell real stories, you know.
[19:11]
B
Yeah. Okay, let's check it out.
[19:13]
A
All right. All right. Pretty cool.
[19:28]
B
It's very good. It's very good.
[19:32]
A
Very satisfying even.
[19:33]
B
He even manages his mic well, you know, pulls back on some of those totally notes. That's incredible. And so, you know, could you take this and take different clips of the video and sort of generate a string of these, these videos and maybe put them together in a longer form version 100%.
[19:54]
A
Yeah. I actually was inspired by this, so I put together a music video, a little mini music video for a different Nirvana track. Can I show it to you right now?
[20:03]
B
Yes, we would love to see it.
[20:05]
A
Okay. I used VO3 to generate the clips and it turned out great. I think. Hold on one moment.
[20:11]
B
Yeah, and I think if you haven't tried VO3, it is pretty incredible. I mean, I can only generate like two and a half videos every day of three, you know, seven second length or whatever. I'm still capped on usage, but the quality is really good, the physics are really good. It's one of my favorite video models to play with right now. Just as a, just as a consumer, it's kind of, it's. To me, my experience with that model has been, was very similar to my first experience with Mid Journey where just the breadth of things coming out of the model were so incredible to me. So highly recommend folks give that, that model a little spin.
[20:55]
A
It's amazing. Yeah, you've, you've got to get on Gemini Ultra Claire.
[21:01]
B
We have a household Gemini Ultra account, but my husband is the video gen guy, so he's up there and by the time I get to it, we've burned through some tokens. But you know, I read all the, I spent all the money on cursor, so Fair, fair.
[21:21]
A
I know my wife for the first time this month was like, babe, what is cursor? I'm like, ugh, don't worry about it.
[21:30]
B
I know all these like little secret AI tools popping up on the credit card. How I AI is now on Lenny's list with my personal selection of the best AI engineering courses on Maven. You can spend months thinking and playing with AI before really integrating it into your workflow or shipping an actual AI feature. If you want to start building, then these hands on Maven courses are for you. Learn directly from aishwarya Naresh Raghanti, MIT instructor and AI scientist at AWS, or Sandra Schuloff, who has authored research with OpenAI, Hugging Face and Stanford. To pivot into an AI role or successfully lead your company's next AI initiative. Visit maven.comlenny to enroll now. Use code Lenny's list for a hundred dollars off. That's M a v e n.com Lenny to get ahead in the AI era and start building.
[22:36]
A
So this is, these are all the videos I generated. Google Flow. So I was trying to capture like a 1990s high school band auditorium, you know, a little dystopian energy. So I generated all these clips in a pretty straightforward way. I used GPT4O to help me with the prompts because as you can see, this is actually the beginning of my generations. This doesn't. This is like the complete wrong Energy, you know, this. I don't know what this is. Like, early 80s, you know, synth pop or something. So then I went to GPT4O and said, hey, help me capture, like, grunge 1990s Seattle, you know, inspired by some of these music videos. And then, as you can see, it gets progressively more like, you know, camcorder and sort of grimy. So I generated all this stuff, and then I threw it together into a music video, and I put the music behind it. I'll show it to you right now.
[23:25]
B
Amazing. So just restating this foro, helping you refine your prompts to get the aesthetic right, the phrasing, the prompting right, Give you some keywords. Veo to generate these, like, shorter clips. And then do you put it together in, like, Final Cut or something like that?
[23:41]
A
I put it together in capwing. Capwing is so easy and so useful. Highly recommend it.
[23:46]
B
Tip top, girl. So I use Cap cut.
[23:48]
A
Yeah. Got to get on Kapling. All right, let's watch it.
[24:32]
B
Jam.
[24:56]
A
That's it.
[24:58]
B
Okay. You get the patented Clairvaux raised hands on this one. I'm gonna tell you the real truth. Something like this makes me almost want to cry, because I really got into technology. I wanted, like, everybody. I want to, like, make video games and, like, make movies and work for Pixar or direct, like, and it always felt so inaccessible to get these, like, amazing ideas that I had in my head into a thing like, could you film it? Could you access the people? Did you have the time? Did you have the music? Did you have the creator? And you just put together this amazing, amazing music video.
[25:37]
A
Thank you.
[25:37]
B
I'm so impressed.
[25:39]
A
Thank you. It was so fun. It was so easy. And I also, like, music videos are a lost art form.
[25:45]
B
Totally.
[25:45]
A
I'm so excited to see everybody making music videos for all their favorite tracks because what a cool way to contribute. And in no way does it actually dilute from the original. I think it's a. It's a testament to the original and our appreciation of it.
[25:58]
B
No, it looks like a love letter, and I have to. I have to call out. When I was watching it, there's a lot of it that I think is incredible. I like how the cameras, you know, like, pan and zoom in. The part that really got me was the sequential shots of the teenagers in the hall. And I was like, I cannot believe this is AI generated. It's so high quality. It's so specific in an aesthetic, in a wardrobe, in a motion. And it got me until. And again, AO3 good at physics. Until there's, like, a guy with, like, a pack of Camel cigarettes on his arm. And, like, the cigarettes are, like, halfway coming out.
[26:34]
A
Yes, yes, yes, Totally. That's right. Well, the. Actually, the. And the other funny artifact is if you look at the end when the band is playing and a bunch of people are jumping out of the crowd, four people jump out of the crowd at the same time. They look the same, and they're making the exact same, like, you know, like they look like acrobats at a circus or something.
[26:53]
B
It's like the end of, like, an 80s TV special where they all jump up.
[26:57]
A
Totally. Yes. Yes.
[26:59]
B
That's amazing. You have inspired me, truly. After this podcast, I'm like, what music video am I going to make? It's so much fun.
[27:07]
A
Do it, do it music.
[27:08]
B
I mean, music videos. You could do, like. Like fake movie trailers.
[27:12]
A
Yes.
[27:13]
B
Also documentaries. I mean, we're doing the fun art. You know, heart. Heart and soul filling stuff. But I also think the ability to create educational materials that are compelling and interesting with this technology are also right there.
[27:29]
A
I mean, if you look at fan fiction, fan fiction's enormous because people want to contribute to the things they love. And now we get fan fiction for every medium. It's so cool.
[27:39]
B
Okay, Sold. All right, that was. That was just workflow number one. We're going to go pretty fast through workflow number two, which I think is a little bit more of a practical, practical one, but still connected to the arts. So walk us through what your second workflow is.
[27:57]
A
Cool. Yeah. So one of the things that I think is really under hyped, underappreciated, underused is all of the multimodal capabilities. And the model that does this really well, actually, is Flash Gemini Flash. So it's just. It's great. It's one of the very few models that can do video analysis and ingestion. It can do all kinds of amazing things, and yet I don't see it being used out there a lot. I thought I would use it to create an app that would help me catalog my record collection, because I've got, you know, like, every dj. I've got so many records, and it's such a pain to keep track of them and know which ones I had and which ones I didn't. So I did a very quick app on Friday that let me take a video of flipping through my record collection and then using Gemini to extract artist names, album names, photos.
[28:41]
B
It's.
[28:42]
A
It's really, really cool. So I thought today we could do something similar. Except for books.
[28:47]
B
This is amazing. And we were talking before we started recording. This is going to help me because over here I have like a hundred books and 100 records piled up on shelves that have definitely not been cataloged. So I can't wait to see what this looks like.
[29:02]
A
Perfect. I got you. Let's share. So here we are in Google AI Studio. So I'm sure folks are familiar with AI Studio, but if you're not. Actually, I think it's the best product surface to inter with all the Gemini models. One of the best anyway, because it doesn't have all of the kind of overhead and links and constraints that a lot of the other Gemini products have. This feels like somebody just took a blank piece of paper and brought the best manifestation of the Gemini models forward. So I really love AI Studio. It's my starting point for all of these things. And then in AI Studio, you can see here you can of course chat, you can stream with your phone or with your webcam, you can generate media and you can build apps. This is a very good app builder and this is the best way to build off the shelf apps, I think, that integrate with Google models. So here I've typed, you know, create an app that takes a video of a person flipping through their book collection and extracts the author and title of every book shown. Then I give it a suggestion for how it could do it, which is you could do this by taking the video and first extracting the frames that show distinct books, and then have a vision model, analyze those frames to extract the information. Make sure you extract every book shown, say, sequentially.
[30:20]
B
What I have to call it here is, you know, what's interesting is people know that these models exist and they generally know some of the capabilities, vision, you know, text to speech, or speech to tag, all this stuff. But what's really hard for people to do, and I appreciate you showing us, is think of novel ways you can access the abilities of those models. I would have actually, I thought you were going to show us like you took a picture of it and you cataloged it. But this idea of a video and then extracting the frames, I just haven't changed my mental model to match these multimodal models in order to take, you know, take advantage of things that can be more efficient, allow you to do things. And so I really think it's great that you're coming to this from how could I solve this with audio? How could I solve this with video? How could I solve this with text? And knowing that the models can do kind of the hard work on the back end.
[31:14]
A
Thanks. Yeah, look, I completely agree. And video is just of course, it's so much more rich than image. And this is the way that we built. We bring a lot of the outside world online, I think. So I've been really inspired by video. I saw something on Twitter where somebody had set up a mini app that watched him shoot free throws and kept count. You know, you could. I mean there's just so many ways that this will be productive. I'm very passionate about AI for parents and I've got kind of a neat video idea in there as well. So to me there's like the sort of skeuomorphic technologies which is using the new technology with the old assumptions and then there's the native ways to use it. And this feels like a very native way to use the models.
[31:52]
B
Well, to connect the two things that you said, the, you know, basketball shooting analysis in kids. My husband did upload every single one of our eight year old's basketball games to a video analysis to get like each kid's stat. No way. Shooting percentages, all the. They actually don't even keep score at this age. So he got to like get the score.
[32:14]
A
I love that.
[32:15]
B
Yeah, I totally love that. Okay, so now we have an app.
[32:19]
A
Yeah. So I'm going to take a video here of me just flipping through my stack of books. Okay. I've taken the video.
[32:35]
B
Okay. And that took all of seven seconds, so.
[32:38]
A
Yeah, exactly. Yeah. Now, you know, the one edge here that's kind of interesting is this is really. It's really easy to get something working, but if you want to publish an app that a lot of other people can use, it then becomes more work.
[32:51]
B
Yeah.
[32:52]
A
So I probably. It took me 15 minutes to create this for my record collection. At least create the working demo in primitive. But then it took me half a day to get it live so anybody could use it.
[33:03]
B
And what's interesting about that is I feel like a lot of individuals are just going to build their own tools and presume other people are going to build their own tools. And so maybe this will just inspire somebody to build their own record collection extractor, which might be faster than trying to find yours online and reusing something somebody built.
[33:22]
A
I mean the era of personal software is upon us, you know.
[33:26]
B
Totally. Okay, so what it's doing, taking this video, it's going to do frame by frame extraction of the. Again, something that is just so time consuming and then it's going to use the vision capabilities. What model do you know is behind the scenes of all this. You say flash.
[33:41]
A
It's Flash. Flash 1.5. And I can kind of skip ahead and show you what that would be, what this looks like. So here's one that I built yesterday with essentially the exact same prompt.
[33:53]
B
Yep.
[33:53]
A
So let's run it in parallel and see if this one's any happier with us.
[33:57]
B
Okay. And I did notice one was light mode and one was dark mode. Was that just.
[34:03]
A
Yeah, this is just some of the randomness of the models. Yeah, exactly.
[34:08]
B
Oh, I do. I do have to say I like the progress indicator of the second one. It told me how many frames it's extracting. Oh, look at this.
[34:16]
A
So here we go. You know, this is the Chris Dixon book, the Paul Graham book, this very nerdy book that Mark asked me to read when I was hired. This is a really good Thomas Sowell history book. Anyways, this is my entire stack of books, every single one of them. You can see a photo. It's extracted the author and the book name. So it's. And, like, you know, this is just a couple of prompts. That's it. And it generated it. So this is what's possible. And then if you go here to deploy with cloud run, you get a deployed version of it that's actually running on the cloud. And now you can send this link to anyone. Now, this is going to cost you API credit, so maybe you want to be a little bit deliberate, but you're pretty much ready to go with this really sophisticated video processing app that would have taken, I don't know, a month of time previously.
[35:07]
B
Yeah. Amazing. And so useful because now I can figure out which of these also very nerdy books we have. We've read. I also see some duplicates up there.
[35:17]
A
Totally. Yeah. It's not perfect. Yeah, exactly. Well, actually, in this case, the photo's duplicative, but it detected the Ben book and the Chris book separately, so.
[35:27]
B
But yes, I need this, man. I need this for the pile of kids books I have up in my kid's closet so they even remember what they have. Okay, this is great. Well, thank you so much for showing us these fun use cases. I have to call out as we hop into our lightning round. One thing I noticed, which is you are using Comet.
[35:49]
A
I am using Comet.
[35:50]
B
Tell me a little bit more about why that new browser is your browser of choice and what are you getting out of it.
[35:59]
A
Comet is so good. I mean, I've been skeptical of the new browser thing because it just feels like the ways to improve the browser in the past have been very incremental ambitious, but there just wasn't that much surface area for new browser features. And now with Comet from Perplexity, it can do a bunch of really incredible things. My favorite thing that it can do is what's called rpa, which is where the models operate your browser on your behalf. So you've seen a bunch of examples of this of like, hey, go find me a flight and pay for it. Which is interesting. The way I've been using it is in my finances. So I'll go into Robinhood and I'll say, hey, why don't you tell me how my portfolio is performing? Why don't you tell me where I could get stocks that have similar upside at a lower cost basis? What stock should I buy next? Are any of these memes? I mean, you can just go so deep and look, I could probably figure that out by clicking around the website and downloading the data, but now I don't have to. So this assistant feature in Comet makes every website dramatically more useful and it's been a big unlock for me.
[37:04]
B
I love this whole episode because you've actually shown a couple use cases, including talking about personal finances with Comet, that really are consumer use cases. Again, as I started at the beginning, we're doing a lot of like, how do you work this inside of an enterprise? How do you write code with it? But I think the real underappreciated transformation is going to come in consumer experience. I think we're so early. I mean, as somebody who does a podcast trying to educate people, I just realized we're so early on consumer adoption of AI. And so I have a question for you which is if you could get like my mom or one of my friends that is less, not in Silicon Valley, less in the middle of this in a room and say, you know, let me show you three things in 15 minutes that are totally change how you think about your life or things that you never knew were possible. What would, what would be those things? What are the consumer side things that you're excited about?
[38:01]
A
So I have kids and parenting's on my mind all the time. And the ways that my kids use models are amazing. So for my 4 year old, ChatGPT reads her a bedtime story, but not just a bedtime story, one that where she can ask infinite questions. You know, so what was the king's dragon's name? What color was it? Where did it come from? Did it have any kids? You know, she's really into unicorns and alicorns. Like tell me A story about an alicorn and a golden egg. And so she can just really interact with the bedtime story. And ChatGPT is far more patient and creative than we usually are. So that's one way. And look, she can't really use a computer otherwise, other than watching YouTube. And then for my son, he'll set up two figures like Sandman and Spider man and then he'll take a photo of them in ChatGPT or one of the other models and say, hey, who would win? And then it'll do this whole, you know, oh, Sandman would win in these conditions and Spider man. But maybe Spider man does this. So they're just, they're able to kind of play with the technology instead of just being broadcast to from technology, which is really new. That's like the near term stuff. I think in the longer term, I think that the models can really help with a lot of social emotional learning. If you look at the classroom, part of it of course is academics, but part of it is just teaching children to be good people for the world. And a lot of that comes in observing how they're behaving and interacting. And we never had a technology that could do that. If your kid went to a great school, there might be a second teacher in the classroom focused on social, emotional. So I think that's how AI shows up in the classroom. It's probably less like homework helpers and assignment generation and more observing the social dynamics in a classroom and helping kids be better people.
[39:49]
B
Yeah. Well, calling back to what we were saying earlier about trying to identify the AI native way of doing things, I watch my children so much. I say that my children form my consumer AI theses for me because the other day my 6 year old was playing Minecraft and he wanted to know how to do a command and he literally went to my purse, picked up my meta AI glasses, put them on and said, hey, Meta, how do I transport to the woodland mansion in Minecraft? And I was like, wait, this is like, it's not type into ChatGPT. It's not even ask Alexa. He took this physical device and put it on his face.
[40:30]
A
Amazing.
[40:31]
B
And asked this personal AI a question and did that just really opened my mind to again, I think multimodal is going to change. I think hardware is going to have a real place to play here. And then this like AI native generation is going to think about accessing information and building things in a totally, totally different, totally different way. So I am, I am with you on all of that.
[40:56]
A
I love that yeah. And it's interesting because we have been taught what computers can and can't do, but they haven't been taught any of those things. So when I generate an image of, you know, a Harry Potter image for my son, I'm like, wow, do you see how I just generated that? He's like, dad, of course the computer can do that. So they just assume that everything's possible, and now everything kind of is.
[41:15]
B
Oh, my gosh, we had it. As I say, when I had to walk uphill both ways for my Internet, like, that's right. Okay, we'll get you out of here. One last question I have to ask. You have had such success with generating these complicated assets, but when AI is not listening to you, when it is giving you really poor results, what is your prompting technique to get it back on track?
[41:39]
A
I mean, I don't know if it's a prompting technique, but it's a sort of. It's a. It's a mindset. Two things. One is go with it, you know, like, let it take you to some strange, unexpected places, and you might be amazed at the results. I think the other is just reducing this sunk cost fallacy thing where, you know, you create a GitHub branch, you try to do something really ambitious. It's just like falling over over and over again. Just abandon the branch and start over. Because you didn't actually do any work. You feel like you did work because it did work, but that's not you doing work. And I think being a lot more willing to abandon sort of approaches that aren't working is the sweet spot.
[42:18]
B
I completely agree. Well, thank you so much for showing us all these works clothes. It was totally inspiring. I want to get off this podcast so I can go play. So thank you for making my day and I know everybody's going to love the episode.
[42:30]
A
Thank you, Claire. Super fun.
[42:32]
B
Thanks so much for watching. If you enjoyed this show, please like and subscribe here on YouTube or even better, leave us a comment with your thoughts. You can also find this podcast on Apple Podcasts, Spotify or your favorite podcast app. Please consider leaving us a rating and review which will help others find the show. You can see all our episodes and learn more about the show@howiaipod.com See you next time.