The AI World Reacts to OpenAI's Powerful New Tools

Summary4 min read

Episode: The AI World Reacts to OpenAI's Powerful New Tools
Release Date: April 22, 2025
Host: The Joe Rogan Experience of AI

Introduction to OpenAI's Latest Releases

In this episode, the host delves into OpenAI's significant advancements in the artificial intelligence landscape, particularly focusing on their newly upgraded transcription and voice-generating models. These enhancements are designed primarily for developers, promising substantial impacts across the AI ecosystem by enabling seamless integration into various software and services.

Upgraded Transcription Models: Moving Beyond Whisper

OpenAI has replaced its long-standing Whisper model with the new GPT4O Transcribe and GPT4O Mini Transcribe models. These models boast improved accuracy and are trained on a diverse, high-quality audio dataset, including data from "chaotic environments" to enhance robustness.

Improved Accuracy:
- Jeff Harris, a member of OpenAI's product staff, stated at [12:45] MM:SS, "These models are much improved versus Whisper on that front. Making sure the models are accurate is completely essential to getting a reliable voice expression."
Language Support:
- While English transcription has seen significant enhancements, other languages like Tamil, Telugu, and Amalayam still exhibit a word error rate of approximately 30%, indicating room for improvement.

Despite these advancements, OpenAI has chosen not to open-source the new transcription models. This marks a departure from their previous strategy with Whisper, aimed at retaining control over more powerful AI tools and potentially securing additional revenue streams.

Enhanced Text-to-Speech Models: GPT4O Mini TTS

OpenAI's new text-to-speech model, GPT4O Mini TTS, introduces a more nuanced and realistic voice generation capability. Unlike previous models that offered a limited selection of voices, GPT4O Mini TTS allows for extensive steerability, enabling developers to customize the voice's tone, emotion, and style to fit specific contexts.

Steerability and Realism:
- At [08:30], the host explains, "Now you get to decide what the voice is. It's trained off of so many different styles and voices that it knows and you can put them all there."
- Jeff Harris further elaborated in an interview with TechCrunch at [20:15], stating, "In different contexts, you don't just want a flat monotone voice. Our big belief here is that developers and users want to really control not just what is spoken, but how things are spoken."
Use Cases:
- Customer Support: Adjusting the voice to be apologetic when addressing customer grievances.
- Motivational Services: Creating voices that can act as energetic gym coaches to motivate users.

These advancements enable more engaging and emotionally resonant interactions between AI agents and users, enhancing user experience across various applications.

Developer Implications and Integration

With the release of these advanced models via API, developers can now embed sophisticated transcription and voice capabilities into their applications. This accessibility fosters innovation, allowing a wide range of industries to leverage OpenAI's technology for diverse applications.

Integration Flexibility:
- The host shared personal experiences at [05:50], "I'm building AI Box and I know a lot of other people use. I'll show you some demos of what this actually sounds like because I have been very, very impressed."
Unified Ecosystem:
- OpenAI's models are becoming deeply integrated into the largest AI ecosystem, ensuring that improvements in their technology elevate the performance of countless applications and services that rely on their API.

Ethical Considerations and Potential Misuses

The host also touched upon the ethical implications of these powerful AI tools. With the ability to modulate voice tones based on user emotions, there's a potential for misuse in manipulating individuals' emotions or spreading misinformation.

Manipulative Potential:
- At [22:10], the host mused, "Imagine this is a possibility... these agents are out there, their ability to manipulate people or to help people improves."
Need for Safeguards:
- Emphasizing the importance of responsible AI development, the host stressed, "We have to build our own safeguards and understandings of how these things work."

OpenAI appears to be aware of these challenges and is likely to implement measures to prevent malicious use, although specific strategies were not detailed in the discussion.

Future Outlook and Conclusion

OpenAI's latest advancements in transcription and voice generation mark a significant leap forward in AI capabilities. By providing developers with more flexible and realistic tools, OpenAI is setting the stage for a new wave of AI-driven applications that offer more personalized and emotionally intelligent interactions.

Anticipated Growth:
- The host anticipates an increase in the number of AI agents, stating at [15:30], "We're going to see more and more agents pop up in the coming months."
Continued Evolution:
- As OpenAI continues to refine and expand its offerings, the integration of these advanced models is expected to enhance the functionality and user experience of a myriad of applications across different sectors.

In summary, OpenAI's powerful new tools represent a pivotal moment in the AI ecosystem, offering unparalleled opportunities for innovation while also necessitating careful consideration of ethical implications. As these technologies become more embedded in everyday applications, their impact on both developers and end-users will be profound and far-reaching.

Notable Quotes:

Jeff Harris on Voice Control:

"In different contexts, you don't just want a flat monotone voice. Our big belief here is that developers and users want to really control not just what is spoken, but how things are spoken."
[20:15]
Host on Steerability:

"Now you get to decide what the voice is. It's trained off of so many different styles and voices that it knows and you can put them all there."
[08:30]
Host on Ethical Implications:

"We have to build our own safeguards and understandings of how these things work."
[22:10]

This comprehensive summary encapsulates the key discussions and insights from the episode, providing a clear understanding of OpenAI's latest developments and their significance within the AI ecosystem.

Loading summary

Transcript1 lines

[00:00]
A
OpenAI has made some big new releases and with these releases there's going to be a lot of impacts in the entire AI ecosystem because they made them for developers. So I'm going to be getting into all of that. Essentially, they have upgraded their transcription and also their voice generating AI models. This is something that I personally have embedded into my software. I'm building AI Box and I know a lot of other people use. I'll show you some demos of what this actually sounds like because I have been very, very impressed. But overall, you know, when OpenAI makes a big move like this, it makes a, it's a big deal because it gets embedded into so many other software and services. So I'll be talking about all of that. Before we get into the episode today, I wanted to mention if you've ever wanted to grow and scale your business using AI tools, you need to join my AI Hustle school community. Every single week I release an exclusive video that I don't share anywhere else, sharing how I use AI tools to grow and scale my companies. And the workflows, the numbers, everything. I can't really share publicly. It's all. We have over 300 members. And the thing that I love about it is we have people from, you know, people that have started hundred million dollar companies and people that have started, they're just getting started on their entrepreneurial journey. You get a lot of perspectives in there. So no matter where you're at, you're going to find other people that can share great insights about what AI tools they're using and really help you kind of kickstart your journey. So if you're interested, I used to have this at like $100 a month and I have dropped the price to $19 a month. So it's discounted right now. It's a great deal. And if I ever raise the price in the future, if you lock in the price now, it won't be raised on you. So there's a link in the description. I'd love to have you join and see you in the school community. All right, let's get into what OpenAI is doing. So like I mentioned, they have upgraded their transcription voice generating models and specifically they're doing this for their API for developers. This is much better based off of what I've listened to, this is much better than their previous versions that they have. You know, I've done a lot of testing for my own software company and essentially, you know, the transcription means that you upload an audio file and it will then create text Right. So it's like doing captions, or you can give it text and it will generate an audio, or you give it text and generates audio, or you give it audio and generates text. It goes back and forth, right? So there it's called Whisper, I think, for the, for the transcription and it's really, really cool. So one thing that I do want to mention here is that as they're kind of rolling this out, we're getting closer and closer to where a lot of companies and AI models are talking about agents and what their agentic vision is, how they're going to build these automated systems, how they're going to go and accomplish all of these tasks independently. Right. And so what I think is really important is you kind of, for a lot of things you need a voice. Like you, you imagine, like, oh, I want like an AI travel agent that I can talk to about my trip and it can give me recommendations. And like, if, if you're just doing that by text, which you totally can, like technically that would like, work and it can get things done. But I just feel like with so many of these agents to feel more realistic, you need that voice. And so OpenAI has been really pioneering, you know, on the frontier with their voice models and on their, you know, consumer usable app, you have these really powerful voice models that you can chat with and not these didn't always translate into what developers could get their hands on. And so it's cool that now they have this API where you're able to do that and they've improved a lot of things. So beyond just being able to kind of generate a generic voice, it's, it sounds quite realistic. I'll give you a demo in a second, but I think this is amazing. And then the other thing that I think is really interesting here is what they actually said to TechCrunch. They did like an interview, they said, quote, we're going to see more and more agents pop up in the coming months. And so the general theme is helping customers and developers leverage agents that are useful, available and accurate. So you and I believe that this was OpenAI's head of product, Oliver Goldman, and he was kind of talking about like what a bunch of these updates were. You know, a lot of these are directed at business customers. Right. It's not like your average person that uses ChatGPT on their phone is like, all right, whatever. They came up with a better, you know, API for their text to speech and speech to text. Like, it's, it's whatever. But the reason why I'm so excited about it. And I think it's important for you to know is because whether you're using, whether you're a developer or not, every application you use that is tying into that ecosystem, which is the biggest ecosystem. OpenAI's AI models is going to start using these new models and so they're just getting better and better and everything we're going to have is getting better and all of the agents that we're going to be using in the coming months and years will be kind of relying on this. So I think it's, for me, this is why I kind of geek out about it and I think it's cool. So the thing that they're like, these are the updates they specifically have said is that their new text to speech model, which is GPT4O mini TTS text to speech is now more nuanced and realistic sounding and it's also what they say is more steerable compared to its other previous, you know, speech models. So essentially as a developer you can now get it to say things in a much more natural language. You can say, speak like a mad scientist or use a really serene voice or you act like you're like one I've said is like, act like you're, you just went on a run, you're super out of breath. Like it can, it can talk like in all of those variations. And so what's interesting is this was available on the app as of like months ago, but it wasn't available for developers to roll out. So OpenAI kind of had a monopoly on this really cool tech which, I mean they made it so that's totally fair. But it's really, really exciting that developers are now going to be able to start building in that, that really nuanced tech voices into like everything else. Anyone will be able to use this. Okay, I'm going to give you a sample of what they're saying is a true crime styled voice. Okay. And then they also have a sample for what is a female professional voice. It's like very serious kind of like female voice talking about stuff. So I think this is really amazing. And the cool thing is that it's so steerable. So like if I'm like, I want like this type of person to speak in this type of way and I want them to be like acting like a gym coach and I want them to be like really rah rah motivational, like it will change how this thing is talking. So to me that's so exciting. It's beyond just like, you know, in the past we've had like a drop down with like, okay, pick the, your top favorite of these like seven or eight voices and you're just drop down, pick your favorite voice. Now you get to decide what the voice is. It's trained off of so many different styles and voices that it knows and you can put them all there. So I think this is very cool what they specifically said. Jeff Harris, so he's a member of the product staff over OpenAI. He was doing an interview and he said quote in different contexts. You don't just want a flat monotone voice. If you're a customer in customer support, you want to make the voice apologetic because you've made a mistake and you can actually have the voice have that emotion in it. Our big belief here is that developers and users want to really control not just what is spoken spoken, but how things are spoken. I love this concept, right? Like if I'm calling customer support and I'm really mad, they could literally like have sentiment analysis on what I'm saying and be like, okay, this person is upset, change your tone to be more apologetic. Or this person seems very happy. Match the mood or vibe of the person. So like there's all these really things to the point where I know this sounds terrible but like you can so this will happen. So I don't know, put this on your radar. It's like this person's, you know, if I wanted to really politically polarized people in a country and I had some robocaller using this, I'd be like, this person's really mad. Like get mad at them back, like try to rile them up. Like, I'm sure this is like one of the things they're trying to like stop from happening, but like imagine this is a possibility. So I'm putting this out there as like things people will be doing. Am I excited about that one? I think they'll probably shut it down. But I just say be aware because these things, as these agents are out there, their ability to manipulate people or to help people improves. We have to build, you know, whatever. We have to build our own safeguards and understandings of how these things work. But it's very, very interesting what it will be capable of doing in the future. So their new speech to text models, GPT4O transcribe and GPT4O mini transcribe essentially are replacing their really long time Whispers model that they've had. And they said that they've quote, trained it on a diverse high quality audio data set. They don't ever give you exactly where they got their data set from. They say they even like trained it in very quote unquote chaotic environments. Which is interesting, what I would assume for this, because they were kind of like, I don't know, scared to say anything about this in the past. Is that it? Probably a lot of this was like YouTube. And I mean you can imagine someone filmed a YouTube video of people arguing. Someone filmed a YouTube video of someone apologizing. Someone filmed a YouTube video of like literally everything in the world and then just grabbed the audio from that. That's my assumption of how they would get such a powerful model based off of some of the executives that said, oh, I don't really know if we use YouTube and like resign, aka Miriam. I would just say that's almost definitely been trained off of YouTube. So anyways, am I mad about that? I don't know. But I'm stoked that the technology is better. This is what Harris also said about this quote, these models are much improved versus Whisper on that front. Making sure the models are accurate is completely essential to getting a reliable voice expression. And accurate in this context means that the models are hearing the words precisely and aren't filling in details that they didn't hear. So they're talking about like not making these things hallucinate. They're doing a bunch of really cool things according to their own internal benchmarks. It is much more accurate. It has a, what they're calling their word error rate. So it's about 30% right now out of 120%. And that is for Indic and Dravian languages like Tamil, Teluge, Amalayam, Canaan. So that means that three out of every ten words that the model gives you is going to be different from a human transcription in those languages. So that's not fantastic. But other than English, this is obviously much better. So OpenAI right now, this is not what they've done in the past, but they are not planning to make their transcription model openly available. They historically released a new version of Whisper for commercial use under an MIT license and they are not doing that this time. So they said that because this is, quote, much bigger than Whisper, it's not a good candidate for an open release. So they're not open sourcing it. This is kind of the thing that they've been doing in the, in the past where they're always making things more and more closed source and less and less open sourced. This is what a lot of companies. Elon Musk. There's a lot of drama people are upset about. So I do think that this is very interesting. They said that this is a quote also directly from them. They said quote. They're not the kind of model that you can just run locally on your laptop like Whisper. We want to make sure that if we're releasing things in open source, we're doing it thoughtfully and we have a model that's really honed for that specific need. And we think that end user devices are one of the most interesting cases for open sourcing models. AKA they're like it's too big and powerful, you can't run it on your computer. We're not releasing open source. Obviously they make more money when they don't release it open source. So there's that element of it. So you could say maybe they're trying to save you from running on hardware that is incapable, or you could say they're trying to make more money. It's up to you however you want to interpret that. In any case, I'm excited about having access to this regardless. Yes, I'm happy to pay for it, whatever. As a developer, that's what I would expect, but I'm really happy to have the ability to access this technology. Very exciting. Big update from them and so thanks so much for tuning in. If you enjoyed the episode today. If you learned anything new I would love review on the podcast. It mean the world to me. I really appreciate all of the incredible people that have reviewed AI chat over the years. Thanks so much for tuning in and if you want to join the AI Hustle school community, there is a link in the description to that. I would love to help you grow and scale your business or your career using AI tools, something I'm passionate about. And I make a video about this every week for over a year now, so it's been a ton of fun. Thanks so much for tuning in and I will catch you next time.