OpenAI Unveils Breakthrough Features That Could Change Everything - The Last Invention is AI

Summary

Podcast Summary: "OpenAI Unveils Breakthrough Features That Could Change Everything" – Joe Rogan Experience for AI

Release Date: April 14, 2025

In this insightful episode of the "Joe Rogan Experience for AI," host Jordan Harbinger delves into OpenAI's latest advancements in artificial intelligence, focusing on groundbreaking updates to their transcription and voice-generating models. These enhancements not only elevate the capabilities of AI-driven applications but also have profound implications for developers, businesses, and the broader AI ecosystem.

1. Introduction to OpenAI’s Latest Developments [00:00]

Jordan Harbinger kicks off the episode by highlighting OpenAI's significant new releases aimed at developers. He emphasizes the broad impact these updates will have across the AI landscape, noting their integration into myriad software and services. Jordan shares his firsthand experience embedding these advancements into his own software, AI Box, and expresses excitement over the improved performance and realistic voice generation capabilities.

2. Enhanced Transcription and Voice-Generating Models [05:30]

OpenAI has upgraded its transcription model, transitioning from the well-known Whisper model to the new GPT4O Transcribe. This shift promises more accurate and reliable audio-to-text conversions, enhancing applications like captioning and transcription services. Additionally, the introduction of GPT4O Mini TTS (Text to Speech) marks a significant leap in voice generation, offering more nuanced and realistic sounds compared to previous iterations.

Jordan elaborates on GPT4O Mini TTS:

“The new text to speech model, which is GPT4O mini TTS, is now more nuanced and realistic sounding and it's also more steerable compared to its other previous speech models. As a developer, you can now get it to say things in a much more natural language.” ([08:45])

This steerability allows developers to tailor voice characteristics—such as emotion, tone, and style—to better suit the context of their applications.

3. Steerable Voice Generation and Developer Capabilities [12:15]

The GPT4O Mini TTS's ability to produce varied and context-aware voices opens up innovative possibilities for developers. Whether it's creating an AI travel agent that offers recommendations with a friendly tone or a gym coach delivering motivational speeches, the model's flexibility enhances user interaction and engagement.

Jeff Harris, a member of OpenAI's product staff, provides deeper insights:

“In different contexts, you don't just want a flat monotone voice. If you're a customer in customer support, you want to make the voice apologetic because you've made a mistake and you can actually have the voice have that emotion in it. Our big belief here is that developers and users want to really control not just what is spoken, but how things are spoken.” ([15:45])

This capability ensures that AI agents can respond more naturally and empathetically, improving user experiences across various applications.

4. Impact on Developers and the AI Ecosystem [20:10]

Jordan discusses the broader ramifications of OpenAI's updates, noting that these tools will be embedded into a vast ecosystem of applications. The enhanced models will enable more sophisticated and reliable AI agents, fostering innovation and efficiency in numerous industries.

Oliver Goldman, OpenAI's Head of Product, shares his vision during an interview with TechCrunch:

“We're going to see more and more agents pop up in the coming months. And so the general theme is helping customers and developers leverage agents that are useful, available, and accurate.” ([22:05])

This perspective underscores the anticipated proliferation of intelligent agents capable of performing a wide array of tasks with greater autonomy and precision.

5. Potential Applications and Ethical Considerations [25:40]

While the technological advancements are promising, Jordan raises important ethical considerations. The ability to manipulate voice tonality could lead to misuse, such as generating deceptive robocalls that mimic specific emotional states to influence or deceive individuals.

He speculates:

“Imagine this is a possibility... these agents' ability to manipulate people or to help people improves. We have to build our own safeguards and understandings of how these things work.” ([28:30])

Jordan highlights the dual-edged nature of such technology, emphasizing the need for robust safeguards to prevent malicious use while harnessing its benefits for positive applications.

6. OpenAI's Approach to Model Distribution [30:25]

Contrary to their previous practices, OpenAI has decided not to open-source the new GPT4O models. Previously, models like Whisper were released under an MIT license, allowing widespread use and modification. However, the GPT4O models are deemed too powerful and complex for open-source distribution.

OpenAI's official stance:

“We're not the kind of model that you can just run locally on your laptop like Whisper. We want to make sure that if we're releasing things in open source, we're doing it thoughtfully and we have a model that's really honed for that specific need.” ([30:50])

Jordan interprets this decision as a move to retain control over the distribution and usage of these advanced models, balancing innovation with responsible deployment.

7. Accuracy and Performance Metrics [35:05]

The transition to GPT4O Transcribe brings notable improvements in accuracy over Whisper. However, while English transcription is highly reliable, performance in Indic and Dravidian languages still exhibits a word error rate of approximately 30%. This indicates progress in multilingual support, albeit with room for further enhancement.

Jeff Harris comments on the advancements:

“These models are much improved versus Whisper on that front. Making sure the models are accurate is completely essential to getting a reliable voice expression. And accurate in this context means that the models are hearing the words precisely and aren't filling in details that they didn't hear.” ([38:10])

Ensuring precise transcription without hallucination is crucial for the reliability and trustworthiness of AI-driven communication tools.

8. Conclusion and Future Outlook [40:20]

Jordan wraps up the episode by reflecting on the significance of OpenAI's latest updates. He expresses enthusiasm about the potential these models hold for transforming various applications and enhancing user experiences. At the same time, he acknowledges the ethical responsibilities that come with such powerful technologies.

He advises listeners to stay informed and proactive in understanding these developments to fully harness their benefits while mitigating potential risks.

Key Takeaways:

Enhanced Models: OpenAI's GPT4O Transcribe and Mini TTS offer improved accuracy and realistic, steerable voice generation.
Developer Empowerment: These tools enable developers to create more engaging and context-aware AI agents, fostering innovation across industries.
Ethical Considerations: The advancements necessitate robust safeguards to prevent misuse, highlighting the importance of responsible AI deployment.
Strategic Distribution: OpenAI's decision not to open-source the new models marks a shift towards controlled distribution, balancing innovation with security and monetization.

This episode provides a comprehensive overview of OpenAI's latest breakthroughs, offering valuable insights for developers, businesses, and enthusiasts eager to stay ahead in the rapidly evolving AI landscape.

Transcript

Jordan Harbinger (0:00)

OpenAI has made some big new releases and with these releases there's going to be a lot of impacts in the entire AI ecosystem because they made them for developers. So I'm going to be getting into all of that. Essentially, they have upgraded their transcription and also their voice generating AI models. This is something that I personally have embedded into my software. I'm building AI Box and I know a lot of other people use. I'll show you some demos of what this actually sounds like because I have been very, very impressed. But overall, you know, when OpenAI makes a big move like this, it makes a, it's a big deal because it gets embedded into so many other software and services. So I'll be talking about all of that. Before we get into the episode today, I wanted to mention if you've ever wanted to grow and scale your business using AI tools, you need to join my AI Hustle school community. Every single week I release an exclusive video that I don't share anywhere else, sharing how I use AI tools to grow and scale my companies. And the workflows, the numbers, everything. I can't really share publicly. It's all. We have over 300 members. And the thing that I love about it is we have people from, you know, people that have started hundred million dollar companies and people that have started, they're just getting started on their entrepreneurial journey. You get a lot of perspectives in there. So no matter where you're at, you're going to find other people that can share great insights about what AI tools they're using and really help you kind of kickstart your journey. So if you're interested, I used to have this at like $100 a month and I have dropped the price to $19 a month. So it's discounted right now. It's a great deal. And if I ever raise the price in the future, if you lock in the price now, it won't be raised on you. So there's a link in the description. I'd love to have you join and see you in the school community. All right, let's get into what OpenAI is doing. So like I mentioned, they have upgraded their transcription voice generating models and specifically they're doing this for their API for developers. This is much better based off of what I've listened to, this is much better than their previous versions that they have. You know, I've done a lot of testing for my own software company and essentially, you know, the transcription means that you upload an audio file and it will then create text Right. So it's like doing captions, or you can give it text and it will generate an audio, or you give it text and generates audio, or you give it audio and generates text. It goes back and forth, right? So there it's called Whisper, I think, for the, for the transcription and it's really, really cool. So one thing that I do want to mention here is that as they're kind of rolling this out, we're getting closer and closer to where a lot of companies and AI models are talking about agents and what their agentic vision is, how they're going to build these automated systems, how they're going to go and accomplish all of these tasks independently. Right. And so what I think is really important is you kind of, for a lot of things you need a voice. Like you, you imagine, like, oh, I want like an AI travel agent that I can talk to about my trip and it can give me recommendations. And like, if, if you're just doing that by text, which you totally can, like technically that would like, work and it can get things done. But I just feel like with so many of these agents to feel more realistic, you need that voice. And so OpenAI has been really pioneering, you know, on the frontier with their voice models and on their, you know, consumer usable app, you have these really powerful voice models that you can chat with and not these didn't always translate into what developers could get their hands on. And so it's cool that now they have this API where you're able to do that and they've improved a lot of things. So beyond just being able to kind of generate a generic voice, it's, it sounds quite realistic. I'll give you a demo in a second, but I think this is amazing. And then the other thing that I think is really interesting here is what they actually said to TechCrunch. They did like an interview, they said, quote, we're going to see more and more agents pop up in the coming months. And so the general theme is helping customers and developers leverage agents that are useful, available and accurate. So you and I believe that this was OpenAI's head of product, Oliver Goldman, and he was kind of talking about like what a bunch of these updates were. You know, a lot of these are directed at business customers. Right. It's not like your average person that uses ChatGPT on their phone is like, all right, whatever. They came up with a better, you know, API for their text to speech and speech to text. Like, it's, it's whatever. But the reason why I'm so excited about it. And I think it's important for you to know is because whether you're using, whether you're a developer or not, every application you use that is tying into that ecosystem, which is the biggest ecosystem. OpenAI's AI models is going to start using these new models and so they're just getting better and better and everything we're going to have is getting better and all of the agents that we're going to be using in the coming months and years will be kind of relying on this. So I think it's, for me, this is why I kind of geek out about it and I think it's cool. So the thing that they're like, these are the updates they specifically have said is that their new text to speech model, which is GPT4O mini TTS text to speech is now more nuanced and realistic sounding and it's also what they say is more steerable compared to its other previous, you know, speech models. So essentially as a developer you can now get it to say things in a much more natural language. You can say, speak like a mad scientist or use a really serene voice or you act like you're like one I've said is like, act like you're, you just went on a run, you're super out of breath. Like it can, it can talk like in all of those variations. And so what's interesting is this was available on the app as of like months ago, but it wasn't available for developers to roll out. So OpenAI kind of had a monopoly on this really cool tech which, I mean they made it so that's totally fair. But it's really, really exciting that developers are now going to be able to start building in that, that really nuanced tech voices into like everything else. Anyone will be able to use this. Okay, I'm going to give you a sample of what they're saying is a true crime styled voice. Okay. And then they also have a sample for what is a female professional voice. It's like very serious kind of like female voice talking about stuff. So I think this is really amazing. And the cool thing is that it's so steerable. So like if I'm like, I want like this type of person to speak in this type of way and I want them to be like acting like a gym coach and I want them to be like really rah rah motivational, like it will change how this thing is talking. So to me that's so exciting. It's beyond just like, you know, in the past we've had like a drop down with like, okay, pick the, your top favorite of these like seven or eight voices and you're just drop down, pick your favorite voice. Now you get to decide what the voice is. It's trained off of so many different styles and voices that it knows and you can put them all there. So I think this is very cool what they specifically said. Jeff Harris, so he's a member of the product staff over OpenAI. He was doing an interview and he said quote in different contexts. You don't just want a flat monotone voice. If you're a customer in customer support, you want to make the voice apologetic because you've made a mistake and you can actually have the voice have that emotion in it. Our big belief here is that developers and users want to really control not just what is spoken spoken, but how things are spoken. I love this concept, right? Like if I'm calling customer support and I'm really mad, they could literally like have sentiment analysis on what I'm saying and be like, okay, this person is upset, change your tone to be more apologetic. Or this person seems very happy. Match the mood or vibe of the person. So like there's all these really things to the point where I know this sounds terrible but like you can so this will happen. So I don't know, put this on your radar. It's like this person's, you know, if I wanted to really politically polarized people in a country and I had some robocaller using this, I'd be like, this person's really mad. Like get mad at them back, like try to rile them up. Like, I'm sure this is like one of the things they're trying to like stop from happening, but like imagine this is a possibility. So I'm putting this out there as like things people will be doing. Am I excited about that one? I think they'll probably shut it down. But I just say be aware because these things, as these agents are out there, their ability to manipulate people or to help people improves. We have to build, you know, whatever. We have to build our own safeguards and understandings of how these things work. But it's very, very interesting what it will be capable of doing in the future. So their new speech to text models, GPT4O transcribe and GPT4O mini transcribe essentially are replacing their really long time Whispers model that they've had. And they said that they've quote, trained it on a diverse high quality audio data set. They don't ever give you exactly where they got their data set from. They say they even like trained it in very quote unquote chaotic environments. Which is interesting, what I would assume for this, because they were kind of like, I don't know, scared to say anything about this in the past. Is that it? Probably a lot of this was like YouTube. And I mean you can imagine someone filmed a YouTube video of people arguing. Someone filmed a YouTube video of someone apologizing. Someone filmed a YouTube video of like literally everything in the world and then just grabbed the audio from that. That's my assumption of how they would get such a powerful model based off of some of the executives that said, oh, I don't really know if we use YouTube and like resign, aka Miriam. I would just say that's almost definitely been trained off of YouTube. So anyways, am I mad about that? I don't know. But I'm stoked that the technology is better. This is what Harris also said about this quote, these models are much improved versus Whisper on that front. Making sure the models are accurate is completely essential to getting a reliable voice expression. And accurate in this context means that the models are hearing the words precisely and aren't filling in details that they didn't hear. So they're talking about like not making these things hallucinate. They're doing a bunch of really cool things according to their own internal benchmarks. It is much more accurate. It has a, what they're calling their word error rate. So it's about 30% right now out of 120%. And that is for Indic and Dravian languages like Tamil, Teluge, Amalayam, Canaan. So that means that three out of every ten words that the model gives you is going to be different from a human transcription in those languages. So that's not fantastic. But other than English, this is obviously much better. So OpenAI right now, this is not what they've done in the past, but they are not planning to make their transcription model openly available. They historically released a new version of Whisper for commercial use under an MIT license and they are not doing that this time. So they said that because this is, quote, much bigger than Whisper, it's not a good candidate for an open release. So they're not open sourcing it. This is kind of the thing that they've been doing in the, in the past where they're always making things more and more closed source and less and less open sourced. This is what a lot of companies. Elon Musk. There's a lot of drama people are upset about. So I do think that this is very interesting. They said that this is a quote also directly from them. They said quote. They're not the kind of model that you can just run locally on your laptop like Whisper. We want to make sure that if we're releasing things in open source, we're doing it thoughtfully and we have a model that's really honed for that specific need. And we think that end user devices are one of the most interesting cases for open sourcing models. AKA they're like it's too big and powerful, you can't run it on your computer. We're not releasing open source. Obviously they make more money when they don't release it open source. So there's that element of it. So you could say maybe they're trying to save you from running on hardware that is incapable, or you could say they're trying to make more money. It's up to you however you want to interpret that. In any case, I'm excited about having access to this regardless. Yes, I'm happy to pay for it, whatever. As a developer, that's what I would expect, but I'm really happy to have the ability to access this technology. Very exciting. Big update from them and so thanks so much for tuning in. If you enjoyed the episode today. If you learned anything new I would love review on the podcast. It mean the world to me. I really appreciate all of the incredible people that have reviewed AI chat over the years. Thanks so much for tuning in and if you want to join the AI Hustle school community, there is a link in the description to that. I would love to help you grow and scale your business or your career using AI tools, something I'm passionate about. And I make a video about this every week for over a year now, so it's been a ton of fun. Thanks so much for tuning in and I will catch you next time.