
Loading summary
A
Hello, I'm Andrew Main, and this is the OpenAI podcast. On today's episode, we're talking about images 2.0 with researcher Kenji Hata and product lead Adele Lee. They'll discuss why the new model represents such a major leap forward, the evaluations that mattered most during development, and what people are creating with it now that it's widely available.
B
If Dall? E was the Stone age, as Imagen 2.0 is the Renaissance, it's not only great artistically and aesthetically, but. But it also incorporates science, art, architecture all in one image.
C
We looked at it and we're like, all right, this is better than images 1.
A
Adele, tell me a little bit about how you became a product manager here.
B
So I joined OpenAI a little over two years ago, and before OpenAI, I was an investor my entire career.
A
Oh, wow.
B
So I was in private Equity and spent three years at Redpoint Ventures investing in AI and software companies. And when I first joined OpenAI, it was for a completely different role. I was thinking about how do we build out our data and compute infrastructure and over time made my way over to the product side and for the last six months have been working on ImageGen.
A
It's interesting how you style yourself going from one role, then finding yourself into this space here, which is kind of cool to think about the idea that you have this sort of ability to be useful in different ways.
B
Absolutely. And I think the role of a product manager is. Is just to do the job that needs to be done, no matter what it is. And for Imagen in particular, it's been really awesome to flex a lot of different muscles when it comes to building products, working with researchers like Kenji, but also thinking about, like, what is the gap in the market today that we want to fill and what is the opportunity that we want to grasp here. It's not the same market that it was a year ago when we first released ImageGen 1.0. Now it's a very different landscape. There are multiple image generation makers out there, and ChatGPT is a very different company. Company and product itself too. And so really thinking about the evolution of Image John and its role within Chat GBT has been really, really exciting to me.
A
Kenji, how did you end up working on images?
C
Actually, like when I first started at OpenAI, I also started about two years ago. I was working on like some random audio project initially, just was my first project. And then at the time I just found my way just working on helping them work on images 1.0 prior to the launch. And so Gradually I moved more and more onto the project and then I became full time on it.
A
Basically. What has the reception been like right now for the model?
B
In the last two weeks since we launched the model, usage is up more than 50%. More than 1.5 billion images are generated every week on ChatGPT. And we've seen viral trends emerge across the world, all the way from trends in Asia, from for color analysis and stickers to us where crayon and scribble are going viral, but also a lot of people exploring emergent use cases. And I think it shows the dynamic range of the model, but also how people are able to visually grasp the advancement of the model almost immediately. I think the visual communication reaction that we've seen from our users, for them to say, hey, this is the best, highest fidelity, highest quality and aesthetic model that we've seen has been really awesome.
A
This felt like a really big shift, almost worthy of me not even being an images 2, but almost like just a new paradigm because just the capabilities are through the roof. What made that possible?
B
When we started working on this project, I think we sat down and we discussed what is the step change of capability and use cases that we wanted to build towards. And we believe that image generation has the ability to do so much more than what it does today. You could distill every single output or visual content that you see today into an image. And so that was the mandate that we sought out to improve. And with this 2.0 model, we've improved on various different dimensions. One is text rendering. The ability for text on a page is so much better. Fidelity. The language and words actually make sense and they're actual words. The second of all is multilingual. We've really focused on making this model work in various different languages. We're already seeing that people across the world in Asia and Europe are really resonating with these advancements. The third is photorealism. I think we really saw a lot of feedback from our previous models that the output wasn't very realistic or altered their face or their bodies. One of our mandates was how do we actually make the the image feel more like yourself. And so all the things that you think that the model knows it does because it has imbued the knowledge of the world into its conscience and is able to visually communicate that back to you as a user. And so putting that all together, I think we really get a state of the art image generation model that is the best aesthetic model out there on the market right now. That really represents a new paradigm for image generation, which is a huge part of, I think, AI progress at large, that, that we have an opportunity to work on here.
C
We often listen back, listen to feedback on social media too. So we kind of just take all these things and basically are just aware of it and try to make sure that they're mitigated or completely fixed in some cases in the next iteration.
A
What kind of use cases are you seeing? What are you seeing people do with this now?
C
I think one that's particularly close to like the research team as a general is like infographics, text. I think text in images is so much better nowadays. So I think it just opens up a lot more productive use cases. And from the research side, we kind of think image generation used to always be about fun and maybe unproductive things, but now we're really seeing steps forward into productivity and image generation for any type of use case that you can imagine it for.
A
So you mentioned text. I remember the early models. No disrespect to chimpanzees, but getting into the spell, like OpenAI even looked like a chimp did it. And then now I'm looking at pages of text and finely detailed stuff. And I know that as models get smarter, variable binding, the ability to put things next to each other improves. But this was just a big improvement.
C
Yeah, but I don't think it's like completely unexpected. I think you see a lot of growth in between. Well, first you see between Dall E3 and, and you know, GPT Images 1, there was, if you asked for a grid of random objects, you, you go from maybe like 5 to 8 in Dall E3 to maybe around 16 in Images 1. And then with 1.5, we went to about 25 to 36 consistently. And I think now we could probably do over a hundred. I think this is like a test that we might do internally is just, um, we just ask ChatGPT, give me a list of a hundred random objects, right?
A
Yeah.
C
And then we just send that to our image generator and see how, how many are correct. And usually, you know, it'll get almost all 100 correct. And that's. But you see the, the constant growth over time. Um, so I don't think it's like completely unexpected. It's just a steady pace.
A
That was a test I used to use for like the really old models back with like Ada, Babbage and Curry, like list 100 science fiction books. And then some of them would get. By the time I got to like 22, we just start rep because you realize the model reached the end of it. So we've seen stuff too, like 360, 360 degree panoramas. How did that happen?
B
Yeah, that really came from the emerging capability of the model, which is the ability to render images in any aspect ratio. We discovered that people were generating really long, amazing panoramics, skinny bookmarks as well. And one of the cool capabilities of the model is that not only were you able to generate images in this panoramic aspect ratio, but you could also render images in the style of 360. And we, we saw that it was really fun to actually view these images in a 360 world itself. And so that was a really fun feature that we ended up adding into the product. And it's available on ChatGPT, on web and mobile right now.
A
First thing I did was I made a version of Dogs Playing Poker that in there so you could sit there like you're one of the dogs looking around in there. Which was not something I expected, but it's fun.
B
Yeah, I mean it's really awesome to see how people are exploring new use cases and fun things that they're creating with the model, even far beyond what we expected users to be using it for. I think when we were designing the model we were really deliberate in understanding what people really wanted to see from image generation. There was a lot of latent demand in image generation. People were mostly using it for personal use cases. But we definitely saw a lot of inklings of people wanting to push the model in certain directions that the model wasn't good at. So text rendering was definitely one of those dimensions that we really wanted to improve on. Multilingual was another. And I think world understanding generally is so much better in this model. And that typically means that now people online are sharing a bunch of examples of them creating ImageGen for all different kinds of use cases that we didn't even think existed out there. So I think the model's understanding of aesthetic beauty across multiple different outputs, whether that it's like a fun meme, an image for a 5 year old versus a professional consulting deck, the expansion of opportunity and outputs has been amazing to see in this latest model.
A
It's funny too how one of the things that was trending was taking popular images or photos of people and then having the model make like kind of janky looking Microsoft Paint versions of that.
B
Yes.
A
And did you think that was something you would see was that people are going to use this incredibly capable tool to then go make these silly looking things?
B
Yeah, it's funny because it takes a lot of intelligence to actually create something that is imperfect.
A
That's what I tell people all the time.
B
Yeah. And it's definitely very interesting. In the viral trends that we're seeing online right now. One thing that I think people are really striving for is authenticity, imperfection, nostalgia. We're seeing that in the Ms. Paint, prompt crayons, all different kinds of generations that people are creating. And that really feels like the theme of consumers is they want to interact with AI in a very authentic, imperfect way, that they want to show their imperfections and use AI to help make them look good, but also show a more fun and goofy side of themselves. And I think that self expression via AI is something that we're really excited about and I think it's really part of our mission as a company to make it easier for people to learn more and distribute that intelligence, but also letting them express a version of themselves that maybe wasn't possible before.
A
Kenji, was there a moment with this model where you're saying to yourself, wow, I think this is ready to go?
C
You know, as it's training, we take a checkpoint and then like we just sample from it, right? And just see, okay, how good is this thing? And I think like we just sampled a checkpoint, a model, an image, and we looked at it and we're like, all right, this is better than images 1. We were just like, okay.
A
I remember watching the iteration of one of the early versions of Dall E and how at first it was sort of the wispy, sort of weird, sort of the tendril sort of thing. And talking to one of the researchers like, is, is that going to go away? It's like, I think two, probably two runs away from that. And then just like that, the ability to predict that was amazing to me. And all of a sudden everything got crisp and clear. And then also like looking at, you know, years ago I'd played with like, you know, gans and like doing those things you have to squint and say, I think it's a pickup truck or something like that.
C
Yeah.
A
So it's interesting what you see as you say, okay, this just all of a sudden got much better.
C
And I mean it was just very obvious. You just, you just take the early checkpoint, you just sample an image from it and then you just sample an image from, you know, images one and you just look at the two and you're just, there's just, there's.
A
Why do I like this garbage? This is.
C
I forgot what the image was. It Might have just been like a picture of like a woman at a sea on the seaside, like, you know, overlooking a seaside. We just looked at it and we're like, all right, there's like, no, no question.
A
Yeah, that was the big. The big. The big jump was the photorealism of going from something that looked. That was more of a glossy, idealized magazine cover to something that looked like a really good photograph. So help me understand, like, besides just more compute, how did this happen? How did you get a model that's much better? And also that doesn't take an hour to generate an image. The times are still. I remember in the dall e days, like, we would literally have to, you know, tell us what you want and then an hour later it'd be on Instagram to. Now these things are in chat GPT and it's faster. How is it getting both more intelligent and you're maintaining the same speeds?
C
I think we learned a lot in each release, like between 1.1.5, now 2. And so we take each of the learnings that we've made and we've like, for example, speed, right? One of the things is like, oh, can we make the model more token efficient? Or something like that. And we did a lot of work to make it. To make it produce very good images with less tokens.
B
I think the post training for this model was very interesting in the sense that we really had to think about not only does the model understand world knowledge and how things look and science contexts, math, et cetera in an image, but also what is the taste that will resonate with users. What makes the model or output beautiful? How do you make it look realistic? These are all questions that we had to grapple with when we were post training this model. Because I think that one of the things that was really important for us was that this model was the strongest aesthetic model out there right now, which means that it has more creativity in various different outputs, no matter what that output is, if it's a professional output or a personal output. That range of training and the range of use case I think made training this model a very interesting problem.
A
Do you have any personal favorite benchmark tests you like to do things? You say, I want to see it make an image of this.
B
I have a eval that I call the me me me eval.
A
Okay.
B
It's essentially 100 photos of myself and my friends and my family, and I put everyone in goofy positions. I have about a card or birthday for every single person. Um, and I think it's a really Great eval in the sense that you only know the people around your, you know, faces the best. Um, you also want to create funny things with the model and do things that are relevant. And so one thing for me as the product manager that I'm testing is not only is the raw capability of the model really great, but also, does ChatGPT understand what I want in that context? You know, ChatGPT remembers, you know, that I have a brother, that I have a mom and dad, and what they like to do. And so does the model accurately know how to insert pieces of personalization in the moments that matter in the images? These are things that I'm testing for.
A
How about you?
C
Besides the grid one I mentioned earlier, that's probably the one I've used the most for a while. I think Divya and I were doing a lot about photorealism. We were trying real hard to push on that. Just basically. I know Divya's favorite one was like, a woman holding a jug of orange juice. I don't know if you see. Yeah, there's like, so many images of a woman holding a jug of orange juice.
B
Well, I actually feel like the researchers had a more standard set of images, like, than they, like, bleed on.
A
Yeah. And you get like, the standard. Can it do somebody writing with their left hand on a watch on their right hand and a clock showing this. I think the big. The big leap of the images, like, probably 1 or 1.5, was like a half full glass of wine.
C
The wine glass full of the brim.
A
Yeah, yeah, it was. Exactly. And there were ways I was able to prompt it to do it, but it was. Oh, it was really hard to get a really descriptive, like, you know, red liquid inside this. This one is so fun to prompt. There was a thing people said, oh, can it do like, you know, can it do, like, pixel accurate pixel image style art? And somebody was like, no, it can't. And I. When I hear that, I'm like, okay, let's try. And I found out if I gave it like a 64 by 64 grid and I said, go, go draw the art in there. It did. It just was able to put art into there. And that was amazing to see those kinds of results. And that's. The promptability of this is insane. How do you plan for that? Does it just happen? You're like, oh, wow, this is better understanding this.
B
People come to ImageGen with very vague prompts. Make it better, make me look better, make me cuter. All these things are really vague. And I think it's really the job of the model and the harness to distill that into actually what users want. And I think that's a personality of the model that we trained over time that we've really harnessed the power for. And honestly, I think it also yields a lot of really surprising results that people may not expect. And that surprise is just part of the fun of using ImageGen.
A
I've seen, like, two kinds of prompting sort of emerge. And I remember back with Dolly, I thought, like, oh, I'm a prompt engineer. I'll be great at this. Like, I'll be really good at this. And I. And I, you know, make a raccoon in space and be like, feel proud. And then I'd see an artist, somebody who wasn't a prompt engineer, somebody who actually came from that world, and I'd watch them use their language, and they were doing amazing things. And that seems like that's still holding true.
B
Definitely. I mean, we work with a group of artists very closely when we develop this model, and we're very inspired by artists, designers, marketers, all these different professions that I think have a different way of approaching their profession. And one of the things that was very important for us is we wanted to take the inspiration as well as the best practices for those professions and. And distill that into the way that people interact with the model. And so that's something we've deliberately tried to focus on. One hack that I've seen work really well is the ability to upload inspiration or context into the model. And the model has an incredible ability to take the spirit of that context and translate it into the output.
A
It's interesting because I think that a lot of people worry that, oh, I just push in a button, I get something beautiful, and each. Each model that gets better. It's easier, as you said, to not have to put a lot of effort into it, but when people do put effort into it, they are getting even more amazing results. And it seems like actually that if you're artistically inclined, you're getting even greater control. Because now it, like you said, it understands more about what you're talking about when you talk about depth of field and these other things or whatever you're trying to do. And as you mentioned, it was exciting to see with earlier models, artists who said, oh, I gave it my originals and it gave me these variations, and I know which one works. And just seeing that as this real creative amplifier, yeah, definitely.
B
I think having creative direction or taste or judgment and bringing that to the model is the best way to push it further. I think one thing about this model that I'm really excited about is how it expands the creative outlet for people. I think the ability to create multiple different styles or types or variations has never been easier than with this Imagen model. And I think it's also understanding of different contexts. Like the way that it's able to shift what it's like to be generating an architectural diagram all the way to the aesthetics of a children's book. The ability for it to move so seamlessly across these vectors has been really awesome.
A
The ability to do great infographics and diagrams is very powerful. What kind of feedback have you been getting from people in research and education?
C
We actually have an internal alpha channel where we test our models. And in that there's like a sub channel dedicated specifically towards educators of any level, like elementary school students, all the way up to graduate level. One of the coolest things I saw was there was a biology professor and he put these graduate level textbook rendering pages of things I had no clue about. And he said it was perfectly accurate.
B
I think the ability for this model to distill very complex topics into something that is really easy to understand within an image is one of its strongest capabilities. And we've seen this with students with teachers who are using Imagen to learn different concepts, to also help them create study guides, to help also create personalized content. I think personalized learning is a huge trend that we're very passionate about. And I think the Imagen model helps you as a teacher create something that every kid can understand in their own language and their own preference. And that is something that we're really excited about. We're thinking about this in the context of also, how do we bring more of the elements of ImageGen and into ChatGPT at large so that when people are trying to learn concepts, we're teaching them with ImageGen.
A
I remember when I was in school and kind of prior to a lot of kind of multimedia blowing up posters were a big thing, classroom poster explaining stuff. This really reminded me of how powerful an infographic can be because it allows you to bring as much attention as you want to it. And you can spend the time looking at it and seeing it and you can put a lot more detail into it.
B
I think one really awesome visual shift that I've seen with Imagen is that now in internal presentations, over 50% of the slides are created with ImageGen. And that permeation of communication via images is so powerful when you're trying to explain your Concepts or illustrate what you mean. And I think infographics and the text rendering capability as well as the composition of the text on the page is incredibly powerful with this model. The model's understanding of not only what to say, but how to present it is a superpower. And I'm really excited about future explorations of this where we can think about how do we make this even better, how do we improve the composition, the different kinds of outputs, and also make it editable in the product. These are directions that we're really excited about.
A
How do you see the progression of this? This is great, but typically anytime I talk to somebody, opening eye about what they're working on, they're like, yeah, this is good.
B
But I think we're still super early in exploring all the different use cases that people are really trying to push the model with. And so one of the things that we're really excited about is what is that next stage for Imagen, which is to create the creative agent, ultimately the agent that can work alongside you, be your creative assistant and really understand how you work, what your preferences are, what is the output that you want to get to. And built the product and model ecosystem that helps users kind of have a personal interior designer, personal architect, personal wedding planner, et cetera, all in one image.
A
Then I'll tell you another thing that was kind of amazing was like, I write books and so every now and I have a book come out, I've got to change my social media headers. And. And I just went and I said, oh, find my book cover and write. You know, create a post. You know, create appropriate size social media header that I can put on X or Facebook or whatever. Like let's say first shot. First shot, right. Aspect ratio, everything.
C
We basically did that from the start or trained the models to be good at that from the start. I remember, like I worked on the initial de risks of. Of every. Basically it could do any aspect ratio that you ask.
B
Yeah, yeah. You can now really just easily specify the outcome that you want. Like in the case of yourself, you're like, I want promotional material. I don't have an idea. I didn't specify exactly what I wanted. But the model was able to do the research and then give it to you in the style and aspect ratio that was relevant to you. And that's super powerful. We're already seeing this. You're an author. I've talked to real estate agents who are using Imagen to help them create listings for their apartments or stage their listings. YouTube creators have talked to me about using Imagen for their thumbnails and promotional content. I've talked to top artists who want to use Imagen to connect with their fans. And I think the ability for all different kinds of professions to start to use Imagen to help them with visual creation is super powerful, especially if you're working in a visual and a creative industry. Imagen is such a hack in your professional toolkit. I think it has to be a part of everyone's everyday workflow in the future.
A
This does feel like the. I think it feels like the first time where anything I can reasonably come up with it does a pretty good job of it.
B
We think it's a new paradigm for generation altogether. Like if, you know, we set this in the launch video, if Dall e was the stone ages, imagen 2.0 is the Renaissance.
A
Yeah.
B
And I think that is so true because the model, it's not only great artistically and aesthetically, but it also incorporates, you know, science, our architecture, all in one image together. And I think that composition and knowledge that the model has just means that the outputs are so much more trustworthy, are more powerful and enable so many more use cases. I think that ImageGen and Codex is also amazing intersection of the capabilities that we're setting out to create with both Imagegen as well as coding agents. So many people are using Imagen as a first step to designing a new website or creating a new app. And I think that intersection of having a really strong aesthetic model, which is image generation in combination with strong coding abilities, means that now you're able to zero shot really amazing apps from scratch with both of these tools.
A
Yeah, I asked it in Codex, I said, I took my website, said, could you make me had the image in. Could you create me some, you know, some different concepts for it? And it did these contact sheets and asked for contact sheets, did that give me like four images there. And I said, oh, the one on the upper right, can you go make that? And I watched Codex go make that, which was like, this feels like magic. And then they've implemented as part of pets. And so like if you're using Codex and you say, hey, I want to have like, I have like, I love Raven, so I have like a raven. I said can you make a raven? And then I watched it pull up the Image Gen tool and iterate and make the sprites for it.
B
Yeah, yeah. Sprite sheets are going viral. Yeah, Same with game design. People are loving using Image Gen to help them create new worlds.
A
Any hints on how to do better sprite Sheets.
C
I mean, I've tried to make GIFs internally and I think if I just use the thinking mode or codex and you basically just ask it to generate one initial sprite, it's really good. And then you can just say, can you make the rest?
B
The consistency across multi images has been amazing. We've seen a lot of people try creating 10 page comic books.
A
Yeah.
B
With consistent storylines, you know, multi page slides. I think that consistency of characters and aesthetics is completely unique to this model.
A
That was an example too where there were a lot of workflows out there for working with image models that you had that were kind of janky but you had to figure out how to do. And it's great now because I can do stuff where I can like create characters and say make a character sheet with the different poses and stuff and just go feed it back in and say, okay, now doing this, now doing that, now doing that. And that's just such a. Often sometimes what we need is obviously smarter model, but like context length did so much for ChatGPT, did so much for coding. And with an image model that's able to reliably reference these references is incredibly capable.
B
Yeah, for sure. And we're still trying to improve that as well. It's not perfect. Today we're really trying to develop this visual creation layer for people because every single person you have an aesthetic or personal style or preference. And we're really trying to imbue that into the product that we're building so that people can get to the output that they're wanting easier and faster with Imagen.
A
Any parting prompt tips for people?
B
Well, one of the things I would suggest people try is imagegen thinking. So if you navigate to the thinking or pro models, we have a more powerful version of Imagen in that experience. And in that model you actually are able to search the web, analyze files, leverage tools under the hood, which then yields a better quality and higher composition photo. And the suggestion that I have for prompting that experience is be open ended. I think the model will go and do the exploration itself to understand and try to reason and find information that matters. And I also think giving it a sense of an aesthetic is also super helpful. Using grounding that in a style has been really fruitful for a great result.
A
Good one, good one.
C
I think just being very particular about the style or what you like in general. For me, I like minimalist infographics. Sometimes I think the model can be a little dense and so I just, maybe I'm just a simplistic kind of guy. So I just like very, very clean, a very clean look. So I like that.
A
Adele. Kenji, thank you very much.
Host: Andrew Mayne
Guests: Kenji Hata (Researcher), Adele Lee (Product Lead)
Date: May 14, 2026
This episode explores the revolutionary leap of OpenAI’s Imagen 2.0 image generation model, described as a Renaissance moment for AI-driven visuals. Host Andrew Mayne sits down with OpenAI's Kenji Hata and Adele Lee to discuss technical advancements, user reactions, development insights, and the creative and practical impact this next-generation model is having across industries—from personal artistic expression to professional workflows and education.
“You have this sort of ability to be useful in different ways... what is the gap in the market today that we want to fill and what is the opportunity that we want to grasp here.” — Adele Lee (01:14)
“I just found my way just working on helping them work on images 1.0 prior to the launch. And so gradually I moved more and more onto the project and then I became full time on it.” — Kenji Hata (02:01)
“The visual communication reaction that we've seen from our users... hey, this is the best, highest fidelity, highest quality and aesthetic model that we've seen has been really awesome.” — Adele Lee (02:29)
Technological Improvements
“The ability for text on a page is so much better... language and words actually make sense and they're actual words... making this model work in various different languages.” — Adele Lee (03:28)
Creative and Practical Expansion
“You could also render images in the style of 360... we saw that it was really fun to actually view these images in a 360 world itself.” — Adele Lee (07:36)
Personalized & Contextual Output
“It takes a lot of intelligence to actually create something that is imperfect... they want to interact with AI in a very authentic, imperfect way.” — Adele Lee (09:53)
“If Dall? E was the Stone age, as Imagen 2.0 is the Renaissance, it's not only great artistically and aesthetically, but... also incorporates science, art, architecture all in one image.” — Adele Lee (00:19, reiterated at 24:44)
“If you asked for a grid of random objects, you go from maybe like 5 to 8 in Dall E3 to maybe around 16 in Images 1. And then with 1.5, we went to about 25 to 36 consistently. And I think now we could probably do over a hundred.” — Kenji Hata (06:21)
“One thing that I think people are really striving for is authenticity, imperfection, nostalgia... That really feels like the theme of consumers is they want to interact with AI in a very authentic, imperfect way.” — Adele Lee (09:59)
“Having creative direction or taste or judgment and bringing that to the model is the best way to push it further.” — Adele Lee (18:51)
“The way that it's able to shift what it's like to be generating an architectural diagram all the way to the aesthetics of a children's book... has been really awesome.” — Adele Lee (18:51)
“I have a eval that I call the me me me eval. It's essentially 100 photos of myself and my friends and my family, and I put everyone in goofy positions.” — Adele Lee (14:11)
The conversation conveys that Imagen 2.0 marks both a practical and creative revolution in AI image generation. Its blend of technical prowess, artistic scope, and contextual adaptability offers expansive new horizons for users across industries—making high-fidelity, meaningful visual content accessible, expressive, and widely usable. The podcast ends with tips for maximizing results and a focus on engaging with the tool creatively, demonstrating a vibrant future for generative visual AI.