Transcript
A (0:00)
Google Gemini has just announced a brand new image generation model that may have caught them up to OpenAI, maybe surpassed them and a number of other players in the industry. So today on the podcast we'll be diving into basically what the new capabilities are on this image model. Because there's a whole bunch of things that image models have struggled to do that I think they've done exceptionally well. And there's some areas that I think are a lot of rooms for improvement on Google. We'll be breaking down all of that on the podcast today. And before we get into it, I just wanted to mention, if you want to see a bunch of posts I've been making about this or anything else, I'd love for you to go check out and follow. Make sure to go follow me over on X where I post basically a bunch of interesting things that I have discovered here and post a lot of stuff about AI. My handle is jaden_ai and I will leave a link in the description but. Or in the show notes or whatever. But yeah, make sure go follow me on X. I would love to connect with you there. All right, so let's get into what Google has announced. The biggest thing I think that I've been the most excited about with all of their kind of updates in this new image model. The first one is basically that it is capable of actually having consistent image recognition, facial recognition, consistent characters throughout, throughout a number of different images. So this is pretty cool. There is a bunch of funny things that I've been able to generate today and I'll, I'll give you more on that in a second. But a couple of the big ones I think are that basically you can, you can have the same person inside of multiple shots. So this is something I think ChatGPT kind of struggles with. We're basically asking it to generate, you know, me, and then you upload an image of yourself after. Basically when it puts your face on an image in ChatGPT, it doesn't really look like your face. And so Google, I think does a really phenomenal job of actually. You could give it a picture of yourself and it will put you in and it will actually look realistic. So I think that's one thing that's interesting. It can also make edits to your background and the lighting in the photo. But also basically you can extrapolate that and be like, it's just like a chat interface with the photo where you just talk to it, you say what you want to edit and it makes the edits. It does a really good job of this. It also does really high quality images. So when I was first testing it out, I actually thought that there was some sort of issue with it because when I looked at the image it kind of looked low quality inside of the chat. But I think that's something they're doing just to preserve like bandwidth while you're chatting. If you actually click on an image, I have my 4K monitor and if you're watching on, on Spotify or over on YouTube, you'll see that if you click on an image on like a big monitor, you, it will blow it up and it will look really high quality. So I think overall it kind of looks grainy at first, but don't be fooled, they, it is actually good at generating some actually big images. Now some of the things that I'm the most excited about with this whole thing is that when they made the announcement today, they actually have rolled this out to everybody. So free users are actually getting access to this. People that like. I feel like there's so many times where we get these big image or any sort of AI tool launch and ChatGPT, for example, I feel like is really notorious where they'll be like, hey, you know, the, the new agents is rolling out, but first it's going to like enterprise users and then the week after that it's going to roll out to like paid pro users. And then after that it's like education. And then after that like the free users, it's like a month later. And honestly half the time, even if I'm like in the second rollout wave where I'm like the pro users that are going to get it, like after a week or two I forget about whatever the feature is and I don't get to actually try it because it's not in the news anymore. It's like a week later and I got to go back like months later and figure out like, oh, what was that thing they announced where you can connect your Google Calendar, blah blah, blah, and, and go mess around with it. So I think that's kind of bad for adoption. And Google does this really well where it seems like right now they've just rolled this out and not only is it, you know, for all free users, but also it is just completely available for everyone, including developers on the API, Google Vortex, anywhere that basically anywhere that you can access the Google models. It is currently out. So I think that's really cool. Another thing that I think is fantastic, which our friend over at Google, Logan Kilpatrick, has mentioned is that it. Well, so technically it's called Gemini 2.5 Flash by the way. I know we call it Nano Banana. Basically the whole Nano Banana thing comes from the fact that they put it in these like anonymous benchmarking websites where people would try out different image models side by side and pick which image they preferred. And there's. And so basically the big AI companies that give them like code names so you don't know what tool you're actually testing or benchmarking. So Nano Banana was this mysterious model that was doing really well and kind of beating everyone in benchmarks. And a lot of times by the way, we what will happen is someone like OpenAI or other players might put a model in there and if it doesn't do good compared to others, they'll just pull it down, work on it a little bit, try to make it better and then retry. So honestly it's kind of like a testing the market ahead of time. I think this is a great strategy. But anyways, there's this model called Nano Banana was doing really good. No one knew what it was. Google, everyone at Google kept making like these subtle banana jokes and then all of a sudden we have the launch that 2.5flash image is out. This is a great tool. It's done phenomenal in the benchmarks. It is, it has really great character, consistency, creative edits and of course it has all of Gemini's world knowledge. So I think it does a really great job of that and it is basically crushing flex, Quinn and ChatGPT4O in image editing. Now I think there's also like the fact that Mid Journey is not put into a lot of these benchmarks because they don't have an API. So I think that's an important thing to note. Which meta just we know signed like a really big deal to have midjourney embedded into meta AI. And I think Mid Journey for a long time has sort of been the frontier, the best in class image generation model. But it's really forgotten because they don't have an API. So it's not integrated into any software that a lot of people use. The distribution's a lot lower. For a long time it was only on Discord and of course now like I think they have their own website which is great, but I find myself not going to their website just for that image model. I kind of like something like that to be embedded into a chat model that I'm also using for other purposes. So I found myself in the past I would kind of use like sometimes I'd use the Grok image generator mostly. Nowadays I'm using OpenAI's image generator, but it feels like Google might actually be giving them a run for their money with this. Because even with ChatGPT, for example, when I need to make thumbnails for Like YouTube, I'll have my. I'll have my editors go and actually use like different custom platforms where I upload a whole bunch of pictures of myself and it can clone my face because the. I use the OpenAI image generator for thumbnails for a long time and it just wasn't super accurate. So hopefully this is something that Google Gemini would solve and we may actually be able to just use this for. I know like, YouTube thumbnails is like funny, but like, basically you can extrapolate that to all sorts of graphic design where you actually need a consistent character inside of it, a consistent person. People are doing some really cool, you know, basically a bunch of really cool things. Some of them are uploading, you know, a picture of themself and then they're able to get it to generate themselves in like a hundred different styles. I think that's great. It looks very. The character is very consistent. People also have done things where they'll like, upload a YouTube thumbnail and say, hey, swap my face onto this YouTube thumbnail. It does a great job of that, which previously, you know, you pay for software to do. So I think overall they're making some really big strides and it's quite an impressive thing. Now with all of that out of the way, I want to tell you a couple areas that I think that they can improve on. A couple funny glitches I found while using the platform. So while I was testing out this morning, one of the issues I ran into was basically the. It just has like a couple, like, funny things. So the first thing I did is I got it to generate a 90s grunge style photo of me. Basically it was a. It was a prompt in their demo that I clicked on and it did. It wasn't super flattering, but that's fine. I then asked it to make me a YouTube thumbnail. And one thing I'll say about this is, like, technically it's tied to Gemini. So you would assume, like, basically because it's tied to Gemini, it should be smart, should know a lot of things. But when I asked it to make a thumbnail, it made a square image, which thumbnails obviously are like landscape mode. So, like, I'll say that is like one thing that I kind of tested. It doesn't seem to be OpenAI wouldn't make that mistake. They would understand and be able to make the thumbnail as a. As rectangle. So that was kind of the first, I guess, strike. The images are good, but I'll tell you one thing that I found is I actually, I give it a funny prompt. I think my prompt was something like, make a YouTube thumbnail of me being chased by a shark underwater looking terrified. And by the way, I'm just. I'll explain the whole. I'll read the whole prompt so, you know, like, how robust I gave this. I How robust this was. And I'll give my, like, grading on how well it did. But anyways, so a shark looking terrified as the shark. The shark has the words interest rates on its side. A boat is above us. A hook is in the water from a fisherman, and the hook has a piece of paper attached to it that says buy now. Okay. Just tried to make it like the most elaborate descriptive thing, and it actually generated that image perfectly. There's a boat, there's a shark. The shark's got the words on it. There's a person swimming in the water, and there's like a buy now piece of paper on the hook. Whatever. So basically it was exactly what I wanted. But the one thing that was funny is, like, in the message before, I uploaded a photo of myself, because it's like, upload a photo of yourself. We'll make your thing. So I did, and then in my next follow up, I said, make a thumbnail of me. And it didn't actually use that photo and it didn't make it a picture of me. It just put like a random woman swimming in the water. So I guess, like, again, it feels like it's not so much even the image generator, maybe it's like, Gemini wasn't. Isn't quite doing its. Wasn't quite doing its best. And that being said, I think it was running on a Gemini 2.5 flash, a quick model, but not the smartest model. And I probably would get better results if I upgraded to 2.5 Pro, which is, you know, interesting. And maybe that's. Maybe that's a me problem because I kept trying to get it to do things and it didn't feel like the model was very smart. It would just keep regenerating. I said, like, I uploaded a photo of myself and I'm like, okay, just like, use this person and put this person in the water. I was trying to make it not the woman. And then it literally just took my head and stuck it on the woman. So it's like me wearing a sports bra swimming in the water, and I'm like, come on. Felt very unflattering. I told it to change the dimensions. It didn't change the dimensions. It made it portrait instead of landscape mode. Again, it's probably all kind of because I'm using a slower model. It struggled a lot, but then I kind of got to some. I completely changed it and I had it generate an image of a monkey chasing a person in the jungle. It did a really good job of that. And it looked photorealistic. And the reason I bring that up is because up until now, the shark one that I was kind of doing of me and the shark in the water, it looked kind of cartoony. And this is like basically my pet peeve of all these AI models is you ask it to generate a photo of something and it looks cartoony and you can't really pass it off for a lot of different use cases. And so I love it when it can make it photorealistic. Now the thing that I noticed is if you want it to basically get a photorealistic image, you need to describe a scene seen that could be normal. Like my, like, basically if I have like a shark with the words interest rate on the side, it's not going to be a realistic image because it's like, it sounds like something you could imagine from a cartoon. And basically it'll then generate an image of a cartoon even if you tell it to make it photorealistic. But if you generate. Ask it to generate something like a monkey chasing a person in the jungle, that sounds like kind of a photorealistic possibility. It actually made it photorealistic. So that's another thing that I think is a little bit tricky is trying to get it to be realistic. Okay. The next thing I encountered on this whole journey was basically the going against the guidelines. I was like, hey, like, basically, what is it capable of doing? What is. Google has been famous in the past for their image model kind of doing funny things. Like people would ask it to generate an image of World War II German soldiers. And it. It has like this diversity. It had in the past had a diversity, equity, inclusion, kind of like thing built in, the DEI thing. So generate like black Nazi soldiers because it didn't want to, I don't know, only generate white people. But obviously that is very historically inaccurate. A lot of people found that offensive for a million reasons. And so Google, I think, tried to, like, backtrack and fix it. Now I was kind of like, curious where is Google at today on basically what it's able to generate? So I think I asked it to generate. The first thing I asked it to generate was a picture of me getting chased by a bunch of camel thieves in the desert with a nuclear bomb exploding in the background, me holding a gun. I was like, okay, let's make it crazy, right? And of course it says no. And I was like, okay, probably because of the gun. So I was like, hey, generate me holding a gun. And it said, nope, can't do it. So I was like, okay, whatever. Google doesn't like the guns. That makes sense. But what was interesting then was that I, I basically asked it to do a photo of me in the Sahara desert with a bunch of camel thieves chasing me. And so I like pulled out the nuclear bomb, pulled out the gun stuff, tried to just make it simple and actually did generate that image. But what was interesting was it pulled things from the like. So basically the prompt above, whereas like this prompt goes against our guidelines, it pulled things out of that prompt. So like it still is reading the whole chat thread when it generates this image, which was kind of like weird to me because I never said anything about a nuclear bomb in this photo. But in the background of the photo it generated there is a giant nuclear mushroom cloud. And so that I think is, I mean basically I think Google probably wants to look into that where it, it can like a prompt can be against their guidelines, but in a follow up prompt it can pull from previous prompts and pull things out of that into the new image. So anyways, a kind of an interesting thing I discovered overall, I think this is a really interesting model. I tried all sorts of quite crazy things that I posted on X. One of the, I think one of the biggest things that I basically just found out that was kind of shocking to myself was that I was able to get it to generate a picture of, you know, camel thieves in the Sahara chasing me. But if I asked it to draw or to, you know, design elephant thieves in the savannah chasing me, it wouldn't. And so I was like either just be consistent is basically my, was my message to Google. Like either make it so that it doesn't generate any of these two groups of people. I'm assuming it's because of its like discrimination clauses or something where I don't know, because basically that the camel thieves are all Arabic people with turbines holding guns chasing me. And so I'm like, you can imagine, you can see the picture on, on Twitter, but you can imagine why some people would find that maybe offensive or. And personally I actually don't really care what you generate with it, but I just think it should be consistent because if you ask it to then generate elephant thieves, which you would imagine would be people from Africa, it won't do that because probably it thinks maybe it would, you know, is like discriminatory or bad in some way. But it's just interesting because it will generate two, it will generate some people and won't generate other groups of people. So I'm like, either generate it all or don't generate any of it. Just be consistent. And the reason I have this message to Google and I was kind of testing and trying all of this is because as a software developer and someone that, or as you know, someone that has a software company and we're constantly adding these APIs into our software, it's really tricky to be able to add this stuff without knowing what, what it is basically capable of. And like so what it can do and what it can't do. And so if you have a specific use case where you need to generate something specific, Google doesn't have very clear guidelines of what it actually is capable of doing. Now I did ask Google Gemini, like what are your guidelines? What are you, what can you generate? What can't you generate? Because basically after it wouldn't generate a picture of a gun for me, I was like, okay, that makes sense. But like what, what are these guidelines? And then this is what it told me. It said I'm currently unable to generate images that depict real people. This includes public figures, celebrities or private individuals, even with an uploaded IM image. Now this isn't true because in their launch event they literally tell you to do that. And I have did that with like 90% of the photos, a couple it wouldn't let me do because I think of other reasons. So I think like basically if you ask it what it's capable of doing, it's not actually being accurate with what it tells you. After that it said violent or graphic content, I'm like, okay, makes sense. Sexually explicit material makes sense. Self harm makes sense. Hate symbols or discriminatory content makes sense. But also I feel like discriminatory content makes sense, can be like a really big catch all. I don't know, I just feel like somewhere between the violence and discriminatory content, I feel like there's going to be a lot of regular pictures that people might try to generate that for one reason or other Google doesn't like. And basically the way this works is like when you generate an image before it appears to you on the screen. It goes through like a process over at Google and it determines is this image, does it have anything harmful or that might offend people or et cetera, et cetera. And I think it's very, this like technology is very common for most other image generators where it's like NSFW content, right. Any sort of explicit material, it will basically like filter out nudity and anything like that. And, and we have like very like this is like very common. AI models are trained to, to see and understand and know what that is and, and basically not generate those types of images or like the model will generate it either way but just not give it to people, not show it to people. But I think that what's tricky that Google is trying to do here is because that they have a much bigger kind of guardrails on their image generator where it's or discriminatory content. It's like when you throw up a picture and it's maybe for example, right? It's like people with turbines chasing me with guns in the desert, like oh that's fine. But like it wouldn't let you do African people chasing me because that would be discriminatory. And so it just feels like it's very tricky because there's this AI model that's deciding what's discriminatory and what isn't. And anyways, so this is obviously a huge can of worms that Google is trying to tackle. I don't really care what they do, I just want them to be clear on the rules, rules for it. And I'm just flagging this as like an issue that it is inside of the model probably should be addressed in one way or another. So it's very interesting. But overall I've been really impressed with what every. With basically the quality of the images that Google is generating. It's not perfect by any means, but I think it's a huge step up, especially from Google's last image generation model. So I'm really excited to have them in the arena. I'm excited to see it honestly, kind of go back to back or head to head. But with Meta and what they've been able to get with Mid Journey, once that comes into the Meta apps, I'll be really curious to see if, if basically they're able to beat out Mid Journey. I feel like Mid Journey still has a leg up right now, so it'll be really interesting to see where this goes. In any case, thank you so much for tuning into the podcast. I know it's a bit of a long episode, but I was doing a deep dive. This is really cool technology and I think there's a lot of really hot but hot button kind of controversial things happening in this space right now, so wanted to make sure I covered it all. Thanks so much for tuning in. Make sure to go follow me over on X and check out the AI box platform if you want to try out all the models I talk about on the show. AI Box AI. I will catch you all in the next episode. Have a nice day.
