Loading summary
A
I'm grateful for marketers like you. Not the ones waiting for their boss to tell them what to learn, but marketers who actively plan for their future. Because you listen to this podcast, you're already ahead. You're seeking to understand AI instead of waiting to see how this AI thing turns out. But here's what I've learned after a decade of running conferences. Interest doesn't create results, implementation does. That's why we created AI Business World 2026, where you'll master AI skills that make you indispensable, where you'll get your questions answered by experts, and where you'll connect with over a thousand marketers who are implementing AI right now. Years from now, you'll look back at this moment and remember this is when you got ahead. Head to AIbusinessWorld live and secure your competitive advantage. Welcome to the AI Explored podcast, helping you put AI to work. And now, here's your host, Michael Stelzner. Hello. Hello, Hello. Thank you so much for joining me for the AI Explored podcast brought to you by Social Media Examiner. I'm your host, Michael Stelzner, and this is the podcast for marketers, creators, and and business owners who want to know how to put AI to work. AI voice is becoming indistinguishable from human voices. There's a big AI voice opportunity for creative businesses right now. In today's episode of the AI Explored Podcast, we'll explore AI voice agents. My special guest is an expert who helps businesses deploy AI voice solutions. He's the founder of Arose AI, an AI voice agency. His YouTube channel is exclusively focused on AI voice agents. Tommy Crist, welcome to the show. How you doing today?
B
I'm good, Michael. Thank you so much for having me.
A
It's my pleasure to have you. Let me start with this question. How did you get into AI?
B
So I was 18 years old going into college, and I was always entrepreneurial. And I'd always had some hunch that AI would be big, even before ChatGPT, but I was never technical or anything, so especially back then, like, it was machine learning, like all this stuff that was just way over my head. And then, yo, I get Into College and January 2024, I'm a semester in. My previous company had just muttered around, it was just okay. And I was getting really bored. And in my dorm room, I started just watching YouTube videos, taught myself a bit about coding, bit about AI. And then, you know, spring of 2024 really stumbled into voice. AI became sort of become really viable for businesses. There were startups coming out of San Francisco that were making it really easy to build voice agents without using code, which was right up my alley. And I started to learn a lot about that, really immerse myself in it, start creating content around it. And then, you know, the business really followed from there as people started to reach out because they saw my content and have me build it for their businesses.
A
So you are currently a college student still, right? And you're growing a business on the side, so you're, you're still pretty early in this game. I'm just curious, from your perspective, how big a deal do you feel like AI is right now? And when you started going to college, did you have any idea that this is what you're going to be doing?
B
I had no idea. But AI is huge. I mean, every single kid I know in college is using it, whether it's for their classes, you know, a lot of looking for internships, jobs, they're using it there. A lot of kids are scared AI will take their jobs. And so for sure, being in college, being around people that sort of, you know, grow up with technology, etc. It's massive. It's everywhere. It's a part of everybody's lives. It's a bit different when I sort of get out into the wild with my business. I am working with folks who have run businesses, legacy businesses and stuff, where they're not as familiar with it. But, you know, AI is just. It's everywhere and kids are getting really comfortable with it.
A
So why did you decide to, just out of curiosity, go and start filming videos on YouTube?
B
I had had YouTube channels, like entertainment ones when I was younger than gaming, stuff like that, when I was 11 or 12, just sharing my interests. And YouTube was really how I learned how to build these. And I saw that all the creators I was watching, you know, most of them had a business where they would also sell it. And so I thought, you know, I already loved YouTube. I'm on it. This is exactly how I learned. I'm a customer of their product, of the content. I felt I'd be pretty good at being able to build it and explain it. And I really wasn't at first, but you'll know that when you're starting. And now I just got better and better, and it just ended up being a really good way to show people that you actually knew what you were doing with this new technology in a way that I don't think I would have been able to communicate through something like, you know, cold email or any other channel to distribute my business.
A
Folks, Tommy is for sure the youngest guest that I've had on this particular show, but he's extremely savvy, as you're going to learn today. I found you through a YouTube video I had received, newsletter in my inbox about some buddy who came out with some new voice technology. And I did a search and your videos came up and I reached out to you and here we are. So folks, you know, or folks that are a little bit older that have kids like Tommy's age, like this is the thing, this is an incredible opportunity and Tommy is going to be exceptionally successful at what he does because he chose to go ahead and do what he did. So, Tommy, you've got a series of clients that you've worked with and you've done some work and you've attracted a lot of attention with the work that you've done with. What's one of the biggest misconceptions you are seeing out there when it comes to AI Voice and AI Voice agents?
B
So the biggest one I see is that it can be a plug and play deal where you sign up for some $50 a month service and you get an AI voice agent that works perfect for your business. I can tell you firsthand after building these for all this time, some agents, if they're super complicated, can take 80 to 100 hours to build. You know, some can be on the, the lower end of, you know, 20 or 40. But you know, you have to put those hours in. So whether that's you working with another company or you putting in yourself, you know, that sort of time fee will have to be paid. And so or else you're going to pay for it with just, you know, a product that doesn't work. You can't just sign up for a software and expect to put in an hour or two of work and reap all the benefits of it.
A
We're going to define a little bit about what AI Voice agents are in just a minute, but let's start with when they're done. Well, when businesses employ AI Voice inside of their business and they do it the way you recommend doing it, what are the benefits? What's the upside for those businesses?
B
A really obvious One is just 24, seven call handling. You never have to worry about someone working in 9 to 5 and calls coming in after hours. For some businesses, such as, like home services or emergency services, H Vac, et cetera, that is extremely important, especially these companies where, you know, people are just googling it. You show up in a Google search, if you don't Respond there is going to the next number they see. And so that 24, 7 call handling is really important. And you know you're only really paying for the time the AI is actually on the phone. And so you pay literally for the value you get. Whereas you know, if you have a receptionist or a human doing it, you might be paying for that entire eight hour block where they're only on the on the phone for three or four hours. The second I notice is reliable answers and functions such as booking meetings. Humans make mistakes. AI isn't perfect, but generally if you build it a certain way and it responds that certain way, nine out of ten times that rate will continue forever. No matter how many calls it takes, you're really able to tune in exactly what you want it to say and when you want to say that. Third is unlimited scalability. Whether you take one call a day or a thousand calls a day, the agent will behave the same. There's no extra added effort that you really have to put in to take that many extra calls. And again, your costs will just scale up equivalently. They're all variable costs. And so that unlimited scalability to grow with your business, you never have to worry about bringing in an extra receptionist or outsourcing to a call center once you hit X number of calls. So that's sort of a peace of mind there. And then lastly, it's just really clear cost savings and roi. If you track any metrics for phone calls in your business, it's pretty easy to then calculate how much those phone calls cost. And so you have a receptionist taking 100 calls a day, figure out the math of what those calls cost and then Compare that to 8 to 12 cents per minute for and AI to do it. And you can almost always see a clear ROI there. Generally, I see with my clients we try and price our services based on the value we give them. And we almost always try and give them an 8 to 10x return on their investment within the first year. Where, you know, if they're saving $60,000, we'll charge them $6,000 for that agent. And so I think 8 to 10x is completely reasonable for these type of systems when thinking about savings or roi.
A
I mean, first of all, wow, right? Let's define what the heck a voice agent is and let's give some examples of what we can actually do. You've already kind of hinted at a couple, but how do you define what a voice agent actually is? Because you know, the word agentic AI is all over the place, right? And the Word agent is all over the place. But when someone says voice agent to you, what does that really mean?
B
More technically, a Voice agent is three different components and really three different AIs working in unison. And I like to think of it like the ears, the brain and the mouth, where you first have the speech to text, which is the ears, where it transcribes whatever the person, the human on the other line is saying and turns that into text. And then you have the brain, which is more common, like LLMs, you know, so we use GPTs in our own agency. So we'll use GPT 5.2 now. So that's the brain and that's text to text. So that takes the response, runs it through whatever instructions you've given the agent, and then outputs what you want it to respond with. And then the last is the mouth, which is text to speech. And these are also more common companies you might have heard of, such as Elevenlabs, Cartesia, and that transcribes, it puts a voice to the text you outputted and repeats that back to the human. And it does all of that within about a second or so. And then off of that, there are different things you can do with that information, such as update your CRM, you know, log the call in a Google sheet, or maybe send someone an email and, you know, a bunch of different business functions and different softwares you can integrate off of the actual content of the voice agent. So there's sort of that ears, brain and mouth, and then the actions you can take.
A
Okay, I got a couple questions. First of all, Gemini is a multimodal model, right where it can see, it can hear, it can speak. Do you think that we're eventually going to get to the point where we're not just transcribing, but we're actually listening because, you know, if I say hello versus hello, you know what I mean? It means something different, but it won't be translated into the transcript. Do you think we're getting to the point soon where it's actually going to listen? I'm curious what your thoughts are.
B
100%. I think that will be so, like, for context of my journey through voice AI, like these things sort of worked, but really weren't great when I first started a year and a half ago. And they made huge leaps, I would say, from spring 2024 to spring 2025. But in the last six months or so, and we've even seen this with LLMs, there's diminishing returns and that sort of plateaued the Next big jump for voice AI will 1000% be once it's fully multimodal? And you know, you don't have separate brain, ears and eyes or mouth, and you can notice a lot more of, you know, the nuance of conversation. And that's really where a lot of these agents currently fall short in performing human. Like, I think we could get there. We obviously already have multimodal agents. It's the only problem is the latency and the cost. So you could build a voice agent with gemini and even OpenAI has their own now, but it costs a lot. It's about like 50 cents per second, I think, to use that, which is five times more, and the latency and the performance really isn't there. And so as that starts to get optimized, you know, I think that'll just be a massive jump.
A
Love it. Okay, so we talked about how voice agents have three components. They have ears that listen and transcribe, they have the brain which interprets it, and then they have the mouth which speaks back. I love that concept. And of course you can integrate it on the back end. We're going to get into a lot of this stuff. Let's talk about some actual examples of where people could put this to work. I know you've got a whole bunch that you and I agreed on, so let's go through some of those.
B
Yeah. So there's fundamentally two things you can do with phone calls, inbound and outbound. So either a call is coming to you or you're going out and making a call. And so really the main inbound use case is any sort of receptionist, customer support, etc. Those all sort of end up being really similar across businesses and those are able to handle FAQs. So you upload a knowledge base with any extra information you want the agent to know. And then, you know, you just sort of look at your, your business and you figure out, okay, I get eight different types of calls. There's one to book meetings, there's one for FAQs, there's one to ask what our hours are. And you sort of build in the responses to that. And build, not a deterministic system, but you figure out what your common calls are and it's able to answer all those and then take action, like booking a call. I think the more interesting use cases, not more valuable, just more interesting because I think people think about them less are a lot of the outbound ones. So one really common one is follow up. So say you are either an E commerce platform or a Shipping company. And there's a lot of porch pirates around the holiday season. And so you actually want to call people right before their package is picked up or delivered to make sure it's not sitting out there forever. And that would not be viable to hire a call center or a team of humans to do. The ROI just isn't there. Well, it only takes 10 cents to make that call. Maybe it's worth doing. And then another one is reactivation campaigns. So I built one for a car wash. And basically this car wash had a huge list of old dead leads of people that had churned or previous customers. And, you know, they wanted to give them a summer sale for their car washes. You know, whatever the deal was, I think it was from 37amonth to $19 a month. And so again, to have a human call thousands of leads, maybe take a month or two. Like, the sale might be up by the time a human gets through all those numbers. But an AI can make those calls, hundreds, thousands a day, pitch them on this new reactivation campaign they're running. And so I think there's a lot of use cases, both replacing existing business functions as well as, you know, creating new ones that wouldn't be viable just because cost too much for a human to do, but it's still valuable to the customer and to the business.
A
Well, and let's be very clear, this outbound thing is not just a broadcast of a recording. This is an interactive agent. Is that correct?
B
Yes, that is correct. So handle objections, everything.
A
Yeah, so that's. That's like really, really important, the fact that it's calling you and it's saying, hey, your package that you ordered has arrived on your porch a day early. Just wanted to let you know. Or it's about to show up, or it has showed up. Let me know if you got it or call me back if you didn't get it, and I'll be happily to take care of you. And then presumably there's a number they can call back on. They can say, I didn't get it. Then they can begin a process of trying to, like, put a, I don't know, a request out with the delivery company or something like that. Right?
B
Yeah, exactly.
A
Very cool. Okay, how good do these things sound? Because I'm sure this isn't the brains of a lot of people. Like, does this sound like us? Or does it literally sound like a computer talking?
B
So that is obviously a thing you have to work on and get a lot better. And that comes down a lot to prompting. But a lot of these voice agents, a lot of people won't know. And it's sort of. You almost see. I see. I listen to a lot of calls. I've automated hundreds of thousands of calls. I've listened to a lot. You sort of see it break down. Where older people might not recognize it as much, where younger people who recognize how LLMs like ChatGPT or Gemini talk might recognize it a bit faster. But from a pure just like the text to speech standpoint of how the voice sounds, it's pretty indistinguishable. And usually, which is interesting because most people don't think about this. Most people actually recognize it's AI from what it says, not how it sounds. Like when you're talking to ChatGPT and it's almost overly personable and trying to really appease you. The AI can be like that a lot of the time, unless you sort of prompt it correctly to remove that. And then people will notice that There's.
A
A moment happening in companies right now. Leadership is discussing 2026 priorities. AI implementation is at the top of the list. And someone asks who on our team can actually lead this? That's when it becomes clear some marketers are prepared, most are not. The ones who invest in real AI training, the hands on, implementation focused kind of suddenly become invaluable. Here's what happens when you attend AI Business World 2026. You walk in uncertain, you walk out confident. You arrive overwhelmed by AI tools. You leave knowing exactly which ones to use and how. You come feeling irreplaceable. You return indispensable. You multiply your output. You end guesswork on which AI tools actually work. You create content faster while keeping your authentic voice. And you automate hours of daily work and you teach others inside the company how to do it. When your company asks who can lead our AI implementation, you'll raise your hand with complete confidence. Head to AIbusinessWorld live and grab your tickets today so you can become your company's AI expert. So let's talk about like, first of all, we're going to get into some of the weeds here a little bit, but let's say everybody's listening is like, okay, I'm interested. I want to explore how to create my own voice agent. Are there any kind of things we need to consider before we start?
B
Yeah, of course. I think the number one thing businesses have to think about is is this a use case worth actually solving or, or does this just sound and look cool to like tell my friends I implemented because there's A lot of use cases such as, like, I find a lot of the, the flashy ones such as automating a hundred percent of your calls with a receptionist or you know, outbound sales and all this stuff like sounds great and then it just sort of falls flat. It's really a big project to take on at the beginning. So I think making sure you have a clear use case that's rooted in like, this is a real bottleneck in your business to scaling or profitability and it's something that can actually be solved by voice AI and not some other tool is the number one thing you need to think about outside of that is obviously legal issues. Trump just made like an executive order on AI. There's a lot of stuff moving. So I'm both not a lawyer as well as this is like an extremely up in the air place right now with AI. The one thing I can say about Voice AI is that back when it like first released in 2024, I think it was February 2024, so before a lot of these platforms even came out, the fcc, I believe, made a ruling on specifically outbound calls. And I actually have it pulled up so I can read that. The TCPA is the primary law that FCC uses to help limit junk calls. It restricts the making of telemarketing calls and the use of automatic telephone dialing systems and artificial or prerecorded voice messages under FCC rules. It also requires telemarketers to obtain prior express written consent from consumers before robocalling them. And basically this ruling bundled AI generated voices under like robocalls. And so you have to make sure you sort of follow that. I will say that ruling specifically. Like the only things that specifically mentioned were deep fakes or impersonating family members to get money, like extremely nefarious things.
A
What's the name of that law if people want to do a little research on it.
B
So this is called the FCC makes AI generated voices and robocalls illegal. This is on the www.fcc.gov and then the headline is FCC makes AI generated voices and Robocalls Illegal. One thing I will mention is again, this is a gray area. A lot of these things are changing. There's been a lot of laws made both statewide and nationally that haven't been enforced or have been enforced. Yeah, I haven't seen actually anything. This law has been effect for almost two years of anyone actually being charged under this law.
A
Well, and here's how I would interpret this. If you want to start safe, do inbound calls. Right. To begin with, or do transactional calls. Right? Kind of like, hey, we're just letting you know that your package was delivered, that kind of thing. But you could just as easily do that over text, I would imagine. So probably start with inbound. Talk to me about the ethical side of this too, because there is the question of disclosure. You know, what's your thoughts on that?
B
Yeah, I've seen with my own clients a lot of business owners are split. Some want to explicitly say in the opening line, like, hey, this is Melinda, the virtual receptionist for XYZ company. And some they want to just make it as human as possible and not really have people ever know it's an AI I've honestly seen the same results for both. For my clients. I really don't think it matters a whole lot. And so whatever you're more comfortable with. I know there were certain states specifically out west that did have laws where you had to explicitly mention a whether a phone call is recorded. So all these platforms we build these on, they record the phone calls with the AI V that you had to explicitly mention if it was AI. Now, again with the executive order and everything, I'm not sure where those stand legally, but I can tell you from a practical perspective, I haven't seen a huge difference in customer tickets being actually solved or hang up rates or anything between the agents that explicitly mention their AI and those that don't. And I think just from a consumer perspective, you interact completely differently with something you know is an AI and something you don't. And if you know something's an AI, you might actually have a better experience because you know how to interact with it. You know, maybe to give it a bit more time or explain things a bit more clearly than if it was a human. But, you know, I've seen both sides and I haven't seen huge differences.
A
So something along the lines of, you've reached company X. My name is Tommy. I am the virtual customer support agent. And this interaction will be recorded for quality control purposes. Something along those lines, how can I be of assistance? Right. Some sort of script like that. Interesting. Okay, anything else before we get on to like the next step of the process? Because so far what we've talked about is you gotta have a legit use case. You need to understand if there's any laws. And then you also have to wrestle with whether you're going to disclose or not disclose that this is an AI. If not, what's next? I mean, like, what's the next part of the process if there's nothing else you want to add there?
B
No, nothing else I need to add. You really hit the nail on the head. And so after you sort of figure all that out and you feel like you have a solid use case rooted in like an actual roi, then I would go into the discovery process. A lot of people just want to jump in and build it and, you know, you can make a demo real quick and, you know, show your team so that they know what's going on. But, you know, before I dedicate serious time into this, I would actually map out and think of it as if I were to hire an employee to do exactly what I want this voice agent to do. You know, what would I tell them? What resources would I give them? What are the sops I tell them to do on a daily basis? And that once you lay that all out and you don't need to do that, obviously these things, you can always edit them. So it's not like you have to get all right at the beginning, but it can save you a lot of time and just walk through all that and then start building it from there as well as you always want to think from the beginning. And you should even do this at the use case stage or the important metrics that you actually want to track to see if this thing's even successful because it costs you a lot more if you deploy it unsuccessfully and keep it running than, you know, if you were to never even use it in the first place.
A
Real quick on the standard operating procedures, some people don't even have like a actual real person that the business has you talk to. I would imagine if you did have a real person, this would be a lot easier because you've already got a manual or a list of instructions. And maybe this is just AI is coming in after hours, right? So in those kind of situations. But are you working with businesses where they don't even have a standard operating procedure? And if so, like, what do they.
B
Need to be thinking about in cases like that? For example, we just worked with a company who had, we automated their customer support and previously it was just like a form submission, we'll get back to you within 48 hours type of thing with the name, their issue, they describe their issue, and then an email or phone number to reach back out to them. So for that I would just ask them from my perspective. And if you're doing this for your own business, you just look at your data of, you look at, you know, the past month, you have X amount of support tickets. Let's just bucket these into, you know, what Are all of these. So, like, a lot of them were just asking about, like, when their product would actually be at their doorstep. And so that's one whole flow of the AI. And you just think about, if someone were to actually ask that question on the phone, how would you want it to respond? And potentially what softwares might you need to integrate with? Look up their order number, et cetera. A lot of them are just FAQs. What are your hours? Does it have this ingredient in it, or whatever that might be, and bucket those into just FAQs or, you know, questions about the product, questions about the service. And so anywhere you can find data on what people are asking you and contacting you about, you can find and sort of deduce from there what routes, if that was asked to an actual human or an actual voice agent, what route you'd want to take them down in that flow and how do you want to respond and then potentially take action at the end of that.
A
And I would imagine you're going to have to train the AI to ask one question at a time, otherwise it'll ask like six questions and they won't even know where to go. Is that accurate to assume?
B
Yes, for sure. There's a lot of them where you build them at first, and they will ramble. I make mine only really say one or two sentences at a time. Because they can. Yeah, they can go on for a while.
A
As people are listening to this, there's no reason this has to be done over the phone. I would imagine this could be on a website as well, could it not?
B
Yeah. And actually, a lot of the tools I build these agents with have an equivalent, you know, text chat feature where it's the same thing, just over, you know, chat.
A
But if they wanted to speak to it, couldn't they activate the voice mode on a computer as well? Potentially.
B
Yep, exactly. There's really two main ways people interact with these. One is through a website widget like that, which is, you know, I mean, almost every company has like the chat widget. So it's the same as that, just a little microphone, you click and then through a phone number where, you know, they have some sort of support phone number. When you call that, it's the agent on the other end.
A
Okay, let's talk about the tools, because there's a bazillion different tools, and I'd love you to kind of give us a flyover of the various different tools that people need to consider when they're going to go ahead and put something like this together.
B
Yeah, of course. I will real quickly mention like the tool is really important and you have to think not just about like the current functionality, but the actual team because a lot of these are startups. The like very first AI voice tool was called airai. They're being sued right now. Like it was like a scam. And so you don't want to put all your equity in a major business function in something that isn't going to last. And so all the tools I'm recommending now, I actually know some of the team and they're great. And so the three main ones for building no code solutions are retail AI. So that's rete l l A I.
A
R E T e L L dot AI. Is that right? Okay, Retail. Okay, got it.
B
I think their website is actually retail AI.com and then Vapi AI, which is Vapi AI. And then Eleven Labs has their actual own agent builder. All three of these platforms are really easy. In five minutes you can get a demo up and going. There are some features that might seem a bit overwhelming if you're not a developer or anything like that, but they're really like entry level tools. We use retail AI for almost every project. That's one we found the most reliable. And you know, a lot of these are fully custom. You have full control over the voice, you use the LLM, the prompt, the features you want to actually integrate with. And you know, it's as easy as logging in. They're all free to create an account. There's no monthly subscription. You'll get like 10 free minutes or whatever and you can just start building and testing just right after you create an account.
A
Okay, so it is R e T e L L I dot com. You've got that correct. It sounds like retail, like a retail store, but it's retail.
B
Yes.
A
So what is it about retail in particular that makes it your favorite?
B
Again, like I mentioned when I first started, it was when literally VAPI and retail, they both like launched like within a couple weeks of each other. And they're sort of in this race for the AI voice agent builders. And originally I thought VAPI was the superior product. And then retail has a really small team, so that allows them to move really fast. And they focused a lot on, on actual like consumer product experience and making it really easy to build voice agents, easy to understand. They also have an incredible uptime because you can imagine you're running calls 247 for your business. It's really important that the infrastructure you're building that on doesn't go down and I don't think retail has gone down once in their existence. I think it's like maybe, maybe once. It's like a 99.99% uptime. And as well as you know, the cost is, is very transparent. They'll tell you straight up what your cost is depending on the mouth, brain and ears that you use.
A
Okay, so with retail And Vapi and 11 labs, there's other tools that you have to choose to make all this come together, is that correct?
B
Yes, there is. So those are purely the voice agent, which is the thing you speak to and that speaks back to you. If you want to integrate, for example, with your Google Calendar, say someone's calling to book an appointment with you, you can either do that through custom code, which is how it's been done since coding existed, or there's a lot of great automation platforms out there that help you do this. I think n8 n which is just n the number 8 and then an n which is very popular one is the best. There's a ton of resources for it on YouTube, which I think is like a really underrated. Part of choosing these when you're first getting into them is making sure like there's great places to learn from them, so anything's great. They have free templates as well as if you're a bit more tech savvy, it's actually open source, so you don't have to pay them a monthly subscription. You can just host it locally as well.
A
Okay, and what about on the voices side of it? Like I know 11 Labs actually makes voices, but it sounds like they also make a solution. So like what's the leading source of the voice talent? For lack of better words, to take.
B
A step back just to make sure it's clear. Platforms like retail and VAPI are like the voice infrastructure that bring together what I was talking about, the ears, mouth and brain in a way where you don't have to code anything. So they don't actually create the voices LLM or the transcription. They just add extra functionality and make it super easy for you to put it all together in terms of actual voices. Eleven Labs has long standing been like the leader in text to speech. However, there's other companies coming out of the woodwork that are catching up. One is Cartesia. I think their newest voice model is really powerful specifically for voice AI. These platforms they have just like you look at GPT and there's like a million gbts. The text to speech providers will usually they'll have main one main one that sounds really good. And then they'll have like a fast version that's meant for voice AI, like real time speech. Cartesia's really been pushing that real time speech forward. And their newest voice Cartesia Sonic 3, I believe is really powerful. It's actually quicker than 11 labs. It's cheaper than 11 labs. I think the sound is pretty equivalent. And then as it's been out longer, the reliability has gone up too.
A
So the quickness of it, which is latency. Right, is what we're really talking about has to do with like the amount of time it takes to listen, interpret it and respond. Or is it more really. I don't know, help me understand what you mean by quick.
B
So the latency of an agent generally, I think human speech is somewhere around 800 milliseconds to a second. And so you want your AI agent to be somewhere in there. The three layers of AI voice agents, there's latency at each one. So the speech to text is generally pretty quick, 100, 200 milliseconds. The LLM can vary a lot. So if you use a bigger model, like GPT 5.2 versus like 5.2 nano, that can be 3 or 400 millisecond difference there. And then the voice is also the majority is about equivalent of that, where it can be 300, 400 milliseconds there. And so even if you gain, say you gain an extra 100 milliseconds on the voice by using a quicker model, you're then able to use a more powerful LLM for your agent and have it still respond as quickly as an equivalent agent, but maybe with a slower text to speech.
A
Sometimes I've called these services and I hear this little click, click, click, click, click in the background, which is kind of like trying to like fill the dead space. Do AI agents have the equivalent of that also? Like just some sort of weird. Almost like it sounds like they're typing on a keyboard, but they're not. Do you understand what I'm saying?
B
Yeah. This is an example of one of the features platforms like retail and VAPI offers is something they just call background sound. You can make it be office sounds, you know, a whole plethora of stuff. Like you're out in public. And they both, I'm pretty sure, allow you to upload your own background sounds that will both be there, you know, when the agent is talking and isn't talking. It's pretty subtle to sort of fill that space because it can be really awkward, especially when there's Like a hard cutoff. And that can be a tell that it's AI.
A
Okay, so we've talked about 11 labs and Cartesia, and you recommend Cartesia Sonic 2 or 3 as the leading voice solution and they integrate with these different VAPI and retail AI. What about on the transcription side of it and what about on the LLM side of it? I'm just curious like what your preferred are today.
B
Yeah, the transcription pretty much has a clear winner for voice AI, and that's Deepgram. I'm not even sure retail allows you to switch the transcription, but I would recommend Deepgram. A lot of these platforms also have the availability to have specific transcriptions such as for like different industries such as medical. There's just certain words in medical that it wouldn't normally transcribe, but now that recognizes them, it will transcribe them properly on the transcript.
A
Then can you give it a library of certain words that you also use?
B
Yeah. So for example, I will pronounce my name and transcribe it as Christ when it's actually pronounced Christmas with a Y, but I'll transcribe it with an I. And so I could throw that in as sort of like a keyword to recognize and I transcribe it correctly.
A
Got it. Okay, cool. So we've talked about transcription. You recommend Deepgram. It's the one that is supposedly very fast and very accurate. Is that correct?
B
Yes.
A
And then what about the LLMs? You mentioned that there's lots of options out there. I know. For example, some LLMs are very fast. If you're using like for example, Gemini Flash, I would imagine it's faster than Gemini Pro or I don't know if they have the equivalent kind of thing on chat GPT. But where do you recommend people start with a decent LLM?
B
So for a long time in our agency, even after GPT 5 came out, we kept using GPT 4.1 nano or just straight 4.1. And that's because a lot of times when a brand new LLM comes out, it'll have just a lot of traffic, which can kill the latency, even if that's not what the latency will be in a month or two, just that really high amount of traffic. And so it's not always just about the smartest model, but you really have to think about the latency there and the performance because you know, you are competing with all these other apps, all these other people that are trying to call that API and use that LLM. And so I've heard 4.1 is really consistent. 5.1 is getting a lot more consistent now that 5.2 came out. I haven't tested 5.2 at scale yet. We primarily just use GPT. But I know Gemini, particularly like Gemini 2.5 flash, has been really consistent for other people that I know in the voice AI space as well, I would imagine.
A
You get a chance to experiment with this when you're setting up these tools. Right. You can pick your LLM and then you can kind of. We'll give you the metrics, we'll tell you how fast this response is. So you can kind of run variations and stuff like that.
B
Yeah, exactly. So when you're building it, it's as simple as going to the voice or the lm, hitting a dropdown, choosing which one you want. And when you're choosing one, it'll have the little latency and milliseconds there. And then it'll also have the cost per minute. And then for your agent, once you, like, exit back into, like, the normal agent screen, it'll tell you the total cost and the total expected latency for your agent. And so you can really play around with that.
A
Okay, any tips on how to make these better? Because obviously, you do have to deal with sometimes hallucinization. And like you already mentioned, sometimes they can gab for a long time. So, like, any kind of tips, the.
B
Biggest one for context in our agency when we work with someone is usually about two weeks to deploy it, and then another six weeks of literally just listening to calls and making adjustments. And so just listening to calls, sort of identifying where it tripped up, and then finding what in the prompt most likely caused it to hallucinate or trip up there. Whether you have to remove information or give it more information is really where you're going to see and reap most of the benefits and improvements. And a lot of times, you know, it could be just one little thing. Every call, one little thing with the opening message, you know, an extra comma that makes it pause too long, that will throw people off on the other end. Talking to the voice agent, where you just deleting that one thing can make, you know, a huge difference in how the AI agent performs. And so really, don't underestimate the small adjustments. And don't think just because it's not performing how you want that you have to rebuild everything.
A
What about the prompts? Like, any tips on how to make better prompts?
B
Yeah, I would structure it in just a really simple and clear way, which means you clearly want to define the Role of the agent, like make it look simple to you. You don't want it to just be a block of text. And so define the role of the agent. Let it know what it has access to. So let it know high level what's in the knowledge base. If it has any functions such as calendar booking, let it know about that. Let know the context in which it's being called. So like if you have an outbound agent, like you have to tell it it's calling someone and you know, they might not be expecting it and then give it just very clear instructions on, in this case, do this and then my sort of like secret sauce that I like to include in every agent is two or three examples at the very end which is literally like you said this, you say, hi, you know, this is Tommy, the AI receptionist for XYZ company, like how can I help you? And then the user says that, and then you say this, then user says that and give it the perfect example of what you wanted to do in two or three different common situations.
A
Yeah, for example, like repeat back to the customer. Is this. Am I hearing you correctly? You know, maybe before you actually take the action. Right, so what I'm hearing you say is this is your problem. Did I get that correct? And if they say yes, great. If they say no, then they add more information. If all of a sudden like they start speaking in a different language. Let's say somebody calls in, they're talking Spanish. Are most of these things smart enough to know how to speak Spanish in different languages?
B
Yeah, so they can. You just have to make sure that the brain, the ears and the mouth have a Spanish mode. So like for example, Cartesia's new Sonic is natively multilingual and so it can switch languages. And you know, the LLMs almost all can switch, you know, but the normal 11 labs flash is either they have an English only and they have a multilingual and the multilingual has more latency and so that one wouldn't be able to switch to Spanish right away. And then the other thing is you want to make sure you choose an actual voice that can speak in that corresponding tongue. Like it's not speaking Spanish in a purely American accent.
A
What about things like integration and storing recordings, all that kind of stuff? Any tips on that?
B
Yeah, so I recommend storing your recordings in multiple places, at least two places. So for example, retail vapi, they all store the call logs in their platform locally, but I would also recommend sending those out via webhook to a Google sheet after as well, just to make sure, you know, you have it all stored somewhere. A lot of companies prefer Google Sheet anyways instead of hopping inside that platform every time. And then for functions I would recommend. So there's like three core types of functions. I don't want to get like too specific here, but there's ones you can run before the call, such as like, okay, I have this person's phone number. Let me look up if they're in our database and I can greet them by name.
A
Oh, that's cool. I like that.
B
Yeah, it is pretty cool. There's another one which are in call functions such as, okay, they said they want this time. Let me make sure, you know, were available that time on the calendar and then book it in and confirm that it's booked. Then there's post call one, such as uploading all the information to a Google Sheet. I would recommend if you can at all make sure you can put as many functions post call as possible because it just adds complexity when you have them. Or pre call is fine too, but just adds complexity when you have a ton inside of the prompt. And it's supposed to take all those actions while on the call. If someone hangs up early, that stuff's not done. And so try and be creative and think about ways such as updating the CRM that doesn't need to be done on the call, that can be done after the call. So identify. They're a good lead. You don't need to do that immediately while they're on the call and pause the conversation and add this complexity. You can log your call into a Google Sheet after the call ended and after it's logged into the Google sheet, then you can update your CRM with that information automatically.
A
Wow, Tommy, this has been absolutely spectacular and I know we've just scratched the surface of what people need to understand when they're going to deploy AI voice agents. If people want to connect with you either on the socials or directly and possibly work with your agency, where do you want to send them?
B
You can check out my website Arose AI. Arose AI Forward slash booking. If you would like to book a call with me and work with me as well as if you look up my name, just Tommy Crist. All my social media socials are under there. I'm mostly active on LinkedIn and YouTube.
A
Tommy, thank you so much for coming on the show today and sharing your wisdom with us.
B
Thank you for having me, Michael.
A
Hey, if you missed anything, we took all the notes for you over at socialmediaexaminer.com a89 Be sure to follow this show on your favorite podcasting app. And if you've been a listener for a while, we would love a review. And also do let your friends know about this show. You can tag me on Facebook, LinkedIn and or X. And do check out my other show, the Social Media Marketing Podcast. This brings us to the end of the AI Explored Podcast. I'm your host. Michael Stelzner will be back with you next week. I hope you make the best out of your day and may AI help you become more successful. The AI Explored Podcast is a production of Social Media Examiner. What if you could get year round AI training? That's exactly what's waiting for you with our AI Business Society. To learn more, visit socialmediaexaminer.com AI.
Podcast: AI Explored
Host: Michael Stelzner (Social Media Examiner)
Guest: Tommy Crist (Founder, Arose AI)
Date: January 20, 2026
This episode dives deep into AI voice agents—what they are, how they work, and how marketers and business owners can practically deploy them. Michael Stelzner interviews Tommy Crist, founder of Arose AI, a rising expert in AI voice solutions, who shares actionable insights, detailed examples, and best practices for implementing AI voice agents in real-world business scenarios.
"I started just watching YouTube videos, taught myself a bit about coding, bit about AI. And then... really stumbled into voice. AI became really viable for businesses." — Tommy [02:00]
"Some agents, if they're super complicated, can take 80 to 100 hours to build... You can't just sign up for a software and expect to put in an hour or two of work and reap all the benefits of it." — Tommy [05:44]
"A Voice agent is three different components and really three different AIs working in unison... ears (speech to text), brain (LLM/text to text), and mouth (text to speech)." — Tommy [09:37]
Notable Quote:
“You’re only really paying for the time the AI is actually on the phone... unlimited scalability... clear cost savings and ROI." — Tommy [06:46–09:15]
Inbound:
Outbound:
"It only takes 10 cents to make that call. Maybe it's worth doing. ... An AI can make those calls, hundreds, thousands a day, pitch them on this new reactivation campaign they're running." — Tommy [13:07–15:17]
"It's not just a broadcast of a recording. This is an interactive agent.” — Michael [15:42]
“Handle objections, everything.” — Tommy [15:42]
"Most people actually recognize it's AI from what it says, not how it sounds." — Tommy [16:28]
“I will say that ruling specifically... only things that specifically mentioned were deep fakes or impersonating family members to get money, like extremely nefarious things." — Tommy [21:20]
"Everywhere you can find data on what people are asking you and contacting you about, you can find... what routes... you’d want to take them down in that flow...” — Tommy [25:55]
Key No-Code Platforms:
Supporting Tools for Integration:
How They Work:
"They focused a lot on... making it really easy to build voice agents, easy to understand. They also have an incredible uptime... 99.99%" — Tommy [30:25]
"I make mine only really say one or two sentences at a time. Because they can... go on for a while." — Tommy [27:33]
"Listening to calls, finding what in the prompt caused it to hallucinate or trip up there... really, don’t underestimate the small adjustments." — Tommy [39:04]
On the Future of Voice AI:
"The next big jump for voice AI will 1000% be once it’s fully multimodal... and you can notice a lot more of, you know, the nuance of conversation." — Tommy [11:30]
On Usefulness:
“Unlimited scalability... Whether you take one call a day or a thousand calls a day, the agent will behave the same.” — Tommy [06:46]
On Getting Started:
“In five minutes you can get a demo up and going... as easy as logging in. They're all free to create an account. There’s no monthly subscription.” — Tommy [29:16]
This episode equips marketers, creators, and business owners with a practical roadmap to deploying AI voice agents—focusing on strategy, compliance, tool selection, and iteration—to drive real ROI, not just hype.
For full show notes, visit socialmediaexaminer.com/aipod.