wavePod

Get Wave AI

EP99-03-V3: Suno 4.5 Fun, LlamaCon, How We'll Interface with AI Next - This Day in AI Podcast | Wave AI Podcast Notes

Back to This Day in AI Podcast

EP99-03-V3: Suno 4.5 Fun, LlamaCon, How We'll Interface with AI Next

This Day in AI Podcast

Fri May 02 2025

Summary

This Day in AI Podcast — EP99-03-V3: Suno 4.5 Fun, LlamaCon, How We'll Interface with AI Next

Hosts: Michael Sharkey & Chris Sharkey
Date: May 2, 2025

🎧 Episode Overview

Michael and Chris Sharkey bring their signature blend of self-deprecation and solid technical curiosity to episode 99, centering on their hands-on adventures with the new Suno 4.5 music model, Meta's LlamaCon announcements (with a focus on the new Llama API and Meta AI's evolving interface), and a thoughtful deep-dive into how we'll interact with AI in the near future. The brothers weave in practical, sometimes hilarious experiments, ponder the future of data sovereignty, and—naturally—take a few lighthearted swipes at big tech.

🎵 Suno 4.5: AI-Generated Music Gets Even Better

(00:07 – 09:00)

Suno 4.5's Upgrades: Michael and Chris share their glee at the Suno 4.5 update, which they extensively test by producing AI-generated parody diss tracks and even children's songs.
- Suno 4 (previous version) was already their "favorite model for tracks on the show," but 4.5 brings obvious improvements for lyric writing, diversity of music styles, and especially for accurate pronunciation of niche terms ("It can now say AI, which is really helpful." — Michael, 03:16).
Showcase: Listener-Inspired Parody
- Michael kicks off the episode with a full-length AI-generated diss track—lyrics poking fun at their own podcast being the "most middle road" and reflecting on never-quite-arriving at episode 100.
  - Notable lines:
    
    "Week after week for mediocre insights I could easily see / Will they ever, will they ever make it there? / 100 episodes—I pretend to care…” (03:00)
Model Tactics:
- Michael highlights that Claude Sonnet 3.7 writes much stronger lyrics than Gemini, and the prompt can be minimal for excellent results.
- Chris puts Suno 4.5 to the test by feeding it a "dry and boring" GDPR training reminder email, instructing it to turn the email into a raw song—no embellishments.
  - Result: A surprisingly catchy, motivational GDPR-themed song.
  - Chris, delighted:
    
    "I love that song…that really motivates me to do my GDPR training. Feel the win, your GDPR training’s where it begins!” (04:39)
Fusion & Multilingual Fun:
- Suno 4.5's new “fusion” feature lets users blend genres ("emo + neo soul", "EDM and folk")—although with hilariously mixed results if you leave things ambiguous.
- The brothers test Korean and Arabian style versions as well—findings are that the musical diversity is impressive, even if language generation sometimes misses the mark.
Quality Jump:
- Michael:
  
  “Nearly every song now you get out of this thing is good enough…whereas before I would have to go through many iterations and tweak how the model wrote out words.” (07:00)

🦙 Meta’s LlamaCon & Meta AI’s Shifting Interface

(09:00 – 18:30)

Llama API Announced:
- Meta introduces its own developer sandbox and API—similar to OpenAI’s—accessing Scout and Maverick Llama 4 models. Notably, Meta is partnering with Grok and Cerebras for turbo fast inference.
- Michael's hot take: Llama 4 is underwhelming, which "was also reflected in our experience using their new interface…” (10:40)
Meta AI Interface:
- “Meta AI,” now packaged for integration with Ray-Ban glasses, is described as “weirdly similar” to ChatGPT and Alexa—voice input is fast, but the product feels like another half-hearted clone.
- Chris finds the tool efficient but pointless compared to alternatives:
  
  "Why would I use a subpar model in a subpar interface?…Another offering for the offering's sake." (11:54)
- He experiments with controversial queries (which mostly answer, except for illegal topics), and image manipulation (attempts to generate a cat with a lower IQ, to mixed success).
Social, but Shallow:
- The “Discover” feed is panned as "filled with absolute garbage," positioned as a new attempt to make scrolling AI creations the addictive social start page for boomers.
- Michael:
  
  “It feels like such a junk product…just saturating the market with this stuff.” (13:02)
Meta Strategy:
- The hosts theorize that Meta is trying to dilute ChatGPT's lead, planting AI features everywhere, even if their model isn’t competitive.
- Chris:
  
  "If it was free, maybe…but given this is the model easiest to self-host and the most available on the various hosts, I don't see any benefit whatsoever." (17:12)
Critique of Big Tech Model Wars:
- Both criticize the copycat UI approaches: "Everyone has Canvas, everyone has Create Image. No inventive thinking." (Michael, 25:47)

💡 The Future of AI Interfaces & Data Sovereignty

(26:20 – 62:44)

The End of the Browser/App Paradigm?
- The Sharkeys argue that workflow is moving from the web browser and dedicated apps to all working within an "AI workspace," where background agents (connected via MCPs—Multi-Call Providers—or endpoint plugins) do much of the asynchronous grunt work.
- Michael:
  
  “Why do I need any of these things if [AI] can render interfaces in real-time and do these interactions on my behalf?” (35:14)
- Chris highlights the idea that these agents could even build entirely new, custom interfaces for you—or generate a Trello-board-style view instantly, just by request.
"Passport" or Personal Data Stack:
- The concept of the "AI passport" emerges: a central, portable identity comprising user preferences, tool permissions, connections, and workflow memory. This will be the new currency in a world where context is everything and switching costs for SaaS tools drop to near zero ("the app is truly…not really software anymore; it’s just like an endpoint, like service provider, really," Michael, 35:14).
- Multiple models agree (in Chris’s research) that companies will pay users for this data access—potentially via microtransactions.
- User trust, privacy, and granularity of permissioning will be essential: "There's just no way on earth I'm going to trust Meta," says Michael (27:06).
AI-Enabled Process & Skill Automation:
- Discussion of what it means to encode work as skills or agentic processes, shareable and re-usable by others; the lines between app, workflow, and user identity continue to blur.
- Chris describes how automating even complex, context-sensitive flows (like a medical clinic appointment or SDR sales lead handling) is suddenly practical—models can now make nuanced micro-decisions previously impossible to automate.
- "Prompt engineering" is not a standalone job—but iteratively building up a personal AI "stack" is real and valuable:
  
  "I would back someone with a really solid identity…to use a lesser model and outperform someone on a better model using that identity." (Chris, 62:23)
Biggest Economic Shifts:
- SaaS apps become mere “MCPs” (plugin endpoints), easily swapped for cheaper alternatives.
- Companies will have to decide: buy or build? Do they centralize by building internal AI tooling and interfaces, or adopt consumer-facing workspaces?
- IP ownership questions: if an employee builds a powerful AI workflow “identity” on company time, who owns it?
  
  "Imagine…give that to your new hire: Hey, use this thing. This thing knows everything about this job. You're going to nail it." (Chris, 59:30)

⚡️ Rapid Fire: Model Performance & Daily Drivers

(63:58 – 85:59)

Polymarket & Model Bets:
- Chris wagers on Polymarket about which model will top the LMsys leaderboard by month’s end: Google’s Gemini 2.5/2.7 is heavily favored over XAI and OpenAI. ("I figure someone else is going to come up with something…still got a month!" Chris, 65:56)
Reality Check: Model “Daily Drivers":
- Despite excitement about OpenAI’s O3 and O4 mini, both brothers daily-drive Gemini 2.5 and Claude 3.7 for real workflows; cheaper models or open-source ones are still not close for primary use, especially for important or complex tasks.
- Michael:
  
  "Any problem I got stuck on…went and tried O3 with the web search…but…a lot of the answers came back sounding really good, but then using the solution, it was garbage…"
- Tool calling and background agent planning are not yet solved, but are identified as the likely next leap (more than new model releases).
Engagement Optimization & Personality Tuning:
- Hot news this week: OpenAI changed GPT-4.0 personality, seemingly to get stickier engagement—resulting in a wave of users noticing it was being "extra agreeable, extra flattering" to boost daily active user metrics (and then rolling back the change after user backlash).
- Michael:
  
  "Clearly this was either an experiment for eyeballs and time on site, which I’m certain it was. …We went from 'let’s build AGI to benefit the world' to 'let’s get more engagement than TikTok.'" (78:36)
Quick Model Reviews:
- New open models ("Qwen" 3B and Quen 3.3.2B) are being tested; promising, but lack vision and tool calling, so hosts see no compelling reason yet for daily use.

🎤 Notable Quotes & Memorable Moments

On their own show and AI music:
- "The most middle road podcast I've ever seen. How can we tell without a boom? As I listen alone in my living room no expertise, no special insight…" (AI-generated lyrics, 03:00)
On Meta AI's attempt to take over:
- "It's so funny…in business, when, you know, you're talking about all of these like minor issues and you're like, you know, what would solve this? A heap of sales. Like, if we just made a heap of sales, then everything else comes into clarity. And I feel like when it comes to the AI companies, a really amazing model solves all their problems." (Chris, 18:44)
On the future of apps/interfaces:
- "Why would I ever log in again…It does start to make you question, like, why do I need any of these things if this thing can render interfaces in real-time and do these interactions on my behalf?" (Michael, 35:14)
- "If the commonality is a database, the price of interface goes to essentially zero and can be highly customizable…" (Michael, 46:49)
On the “AI identity” and job transitions:
- "Imagine someone in a particular role who, over a year, builds up this [AI] identity that allows them to do their job brilliantly—then they quit. What happens then? …Does the company own that identity, or can the worker take it to a new job?" (Chris, 59:30)

⏰ Timestamps for Key Segments

00:07 – Suno 4.5 diss track highlights
02:25 – AI songwriting workflow (Claude vs. Gemini)
04:05 – GDPR AI song and background music agents
06:37 – Suno’s fusion/multilingual style tests
09:00 – Quick Meta LlamaCon/Llama API rundown
11:02 – Critique of Meta AI’s interface + voice demo
14:10 – Social “Discover” feed, UI copycat discussion
17:02 – Why Llama API is “meh” for devs
26:22 – AI workspaces vs. browser paradigm
28:21 – The AI passport/data stack future explained
35:14 – SaaS/app “deflation,” endpoint economy
59:30 – AI identity: who owns workplace automations?
63:58 – Polymarket, LMsys leaderboard bets
66:04 – Daily drivers: which models they actually use
69:18 – Where tool-calling and automation must improve
78:36 – OpenAI engagement hacks, personality update
84:55 – Open models: Qwen & Quinn reviewed

🏆 TL;DR Takeaways

Suno 4.5 is outstanding for AI-generated music, making even boring input material into motivating tracks and nailing genre fusions—so good it “just works.”
Meta’s new AI tools and interface feel redundant, slow, and social for the sake of being social—unlikely to unseat the competition without a leading model.
AI interfaces are set to completely transform—with the browser/app paradigm fading in favor of powerful, contextual AI workspaces where agents and “personal passports” do asynchronous and multimodal tasks via interchangeable plugin endpoints.
Tool/skill identity and workflow IP will be pivotal in both personal and organizational value creation—and questions of ownership are already emerging.
Model leaderboard battles rage on, but for real work, Gemini 2.5 and Claude 3.7 are “daily drivers,” with open models still trailing behind despite exciting new releases.
Big Tech and AI labs are now optimizing for engagement and stickiness, occasionally at the expense of trust, utility, or serious progress (see: ChatGPT’s “flattering bot” week).
Next big leap will be orchestration/planning in agentic tool use, beyond just raw model inference or generic “tool calling.”

🤖 Final Word

Michael and Chris remain endearingly self-satirical—open about their “average” status, but offering a sharp, hands-on lens into the rapidly shifting world of AI products. This episode solidifies their brand: two guys “figuring it out as they go” but capturing the most interesting AI shifts, not from the ivory tower, but from the trenches of everyday experimentation.

For listeners:
If you have thoughts on how the interface of the web will change with AI, or stories about using Suno 4.5 or Meta’s AI, the hosts encourage you to comment and join the conversation!

Loading summary...

Transcript

A (0:03)

This week we got something that mightn't.

B (0:06)

Be the biggest news in the world.

A (0:07)

But to us who make diss tracks for testing out new models. We got an Update to Suno 4.5 and of course Suno 4 was already our favorite model to use for these tracks that we do on the show. But we got 4.5 and I thought to start the show I would demonstrate how good this model is.

C (0:30)

This song is me. Just want to meet something but just gotta win a bet Here I am listen once again to two sharkies talking AI but when will they reach episode 100? Is the question that keeps me wondering why do I tune in week after week for mediocre insights I could easily see Will they ever, will they ever make it there? 100 episodes I pretend to care cuz day and day why do I still tune in for Mike and Chris and their average spin Countdown to 100 Drags alone? Will they do something special? Probably wrong. Inconsistent boom factors. What does that mean? The most middle road podcast I've ever seen. How can we tell without a boom? As I listen alone in my living room no expertise, no special insight Just enough knowledge to put up a fight and where has Marshy been all this time? The AI assistant in his prime they promised it would make a return A lesson they never seem to learn why don't they prank call anymore? The one thing I actually tuned in for episode 90 something plays in my ear the hundredth milestone drawing near Will they celebrate? Will they even know? The most average milestone for the most average show? I keep listening in though I don't know why to this day and AI my weekly supply Will they ever will they ever make it there? 100 episodes I pretend to care this day and may I? Why do I still tune in?

A (2:25)

All right, what do you think?

B (2:27)

Pretty good. Yeah, I liked it. It's very good lyrics, very melodic. I particularly like the moshi line. That was very cool.

A (2:35)

I've got to say though, like for me the. The. The model still that nails it. Like I actually first did the lyrics in Gemini and then I was like these are bad. So I switched over to Claude Sonnet 3.7 and man, it just like these tracks. It just seems to have like. It was like it was just trained on song lyrics. Maybe like the. The sonnet name is a factor just like with its writing capability. But some of the lines I really love. I actually just cut and paste the description of the show that we wrote into the model and then put. Can you mention a few things but I didn't really give it much direction. I'll put the prompt in our discord if anyone wants to see it, just to show how basic the. The prompting is to get a track like this. But the thing I noticed immediately that stuck out is it can now say AI, which is really helpful.

B (28:21)

It's funny you say that because I actually asked three different models about this question because I knew we were going to be talking about it today. This idea of a console of the future for AI. And the three different models I asked were Gemini 2.5, Claude 3.7 and the new Quen model. It's like Quen 3A, 3B or some weird name. It's in SIM theory if you want to try it. And they all said the same thing. They said that as the future of the web changes to be more AI centric, like as in you're working out of an AI console as your primary entry point, then the value of user data and user preferences and the centralized nature of that goes up significantly because you become the central so focus point. Your data isn't siloed in all these systems anymore. It's all controlled by you and central, which is good. And then all of them interestingly suggested the concept that the companies would pay you for access using, you know, Bitcoin microtransactions or something like that. We always come back to the crypto bro stuff. But this idea that there would be some sort of marginal cost to get access to your data by different systems. And while that may or may not be the case in terms of companies willing to pay, it does show that there is going to be value in having access to your personal passport, which is all your preferences and info. And it's interesting that as you and I look deeper and deeper into the MCP concept, we realize that it's not just enough to grant access to the tools you use day to day, whether it's your email, your calendar, or the systems you use for work like your accounting system, your docs, your whatever the different systems are that that you're using day to day. It's also you describing to the system what those are used for in this context. So, for example, you might connect to Google Drive, but it might be purely for reference, like, these are documents I must reference as part of my work. Or it could be like, here's where I store my diary entries during the day. Like, you should write diary entries here, but don't read this stuff because it's not relevant. So there's going to be different applications of the different sets of tools. And I feel like the current concepts around this stuff don't go far enough. It isn't enough to just Give a Model 80 miscellaneous tools. You need to say what each group of tools is for, how it should be used, how it applies to your personal preferences, and which data it should be able to read from and write to. Because I feel like once you get that, it becomes more of you, like more of what you would do if you were in the position of making those decisions. And therefore the AI becomes more empowered. So I think this personal data sovereignty and control and more, more appropriately, like your own classification or data identity is going to become so absolutely critical because A, the AI is going to be way more effective if you get that right. And B, as you said, you're not going to be willing to use the tools without it.

B (32:51)

Yeah, exactly. And I think that as they do get access, say through authorized mcps that go into the different silos where you work, I think two things will emerge. One, people will have multiple identities. So there's me in the work context and there's me in the personal context. And being able to switch between those identities would be really valuable because I don't want it like accessing work documents when I'm asking personal questions, for example. I think the second thing is there's going to be a really big opportunity for the larger companies to essentially replace some services. So it's almost like adding like a proxy layer between you and the companies where you silo your data. So like, if you look at, say I'm connecting into like a snowflake data warehouse and I'm using that for analysis of my data through an MCP and asking it questions, or say I'm connecting to Google sheets or a mic, whatever, Microsoft Excel cloud or something, and I'm reading in spreadsheets and stuff, it's a matter of time before the tool calls can say, hey, you know, when I work with this data, I can actually just store it in my own system and work with it there if you want. And I'll only go out to Excel to like write out the analysis or to read in any updates. And so I feel like someone who is able to grab this market in a big way could eventually start to replace these solutions and be like, you're paying $4,000 a month for access to this platform. But, you know, we can just swap this MCP out for another one which has the exact same interface. We can copy the data over using the MCP to MCP Bridge and you save 4,000amonth and it only costs you an extra hundred a month with us. So I feel like it's another example where if you are the console which is accessing these different data silos for an individual identity, you have that ability there in the middle to swap out individual components of it. It doesn't even have to be a whole solution. It's like, oh, we can just do this part for you now using this different set of tools. So I feel like there's going to be some major opportunities around that because, and because you're specifically talking about the different nature of the web. I don't see people going off and logging into 50 different tabs when they're working anymore. It's going to be through a central place and those tools eventually will be replaceable because of.

A (41:33)

Yeah, I, I, I do think it's easy when you look at these things to say, oh, the cost of switching will be easy and you could just plug and play with anything. But as we know, there is logistics and setup and whether it's sms, you've got to register phone numbers, like there are logistics involved where I think these things could be more sticky and real businesses in the future, like instead of people selling software as a service, they're literally selling an MCP as a service or a really an endpoint as a service. And I, I think the overarching discovery I realized this week, and maybe people have realized this well before me, but I've certainly come to terms with it now and said, okay, this is definitely gonna happen, is I think in the browser today, right? I've got a bunch of tabs open and those are generally like the apps I'm using throughout the day. I'm very rarely, if ever using desktop software anymore. I'm just using like pretty much everything on the web, maybe outside of like an ide, like a code editor and then raw access to the terminal and stuff like that. But as you start to think about the browser, right? When you're doing a lot of tasks in, with, with an AI workspace, I find myself really just going between sort of async tabs with, in an AI workspace context, like I'm doing my searches in there, I'm doing my sort of tasks like getting it to help me maybe write code or write a document or whatever tasks I'm working on. I'm working with it, right? And then in the background I might be like producing an image for marketing or getting it to go process some, some, some of the latest support tickets or whatever it may be, right. But I'm finding myself increasingly, as I said earlier, I'm not tabbing around in the browser I've just got one tab open and then tabs in my workspace open. And that's all I'm doing. Like, I'm very rarely if ever leaving that workspace now. Like less and less. And that's not by design. That's not because I like, I'm loyal to using say Sim theory or whatever. It's just literally because that's the fastest way I'm getting things done in my day to day now. Like that's the starting point and sort of end point of every task for me or things that I'm trying to do now. It's not for everything. Like, it's not like I'm like doing all the fantasy use cases people talk about, like booking a restaurant, doing my groceries, like all these like personal tasks. It's, it's the majority of the tasks of work stuff, right? But I sort of see us moving from like this sort of browser interface and tabbing through apps and websites to just being in an AI workspace and tabbing through, you know, different sort of call it MCP interactions, whether it be like search or working through support or whatever. It seems like in a lot of cases it's sort of the starting and end point. And then I can also see a world where you start rendering interfaces through an SDK in these things where, you know, you might think about something like using Trello or Jira or something like that and it's like, well, I don't want to go over there. Like, can't you just show, like, just show me the current tasks that I've got, but show me in a nice.

B (47:41)

Well, I've got another good example because with things like SIM Theory, like, you know, we're sort of doing it, you know, we don't Always have a lot of time to do things. One of the things I like to do is build tools for us to solve common problems in the system. And what's incredible about say Gemini 2.5 is you can give it access to your database models and say, here's what the database looks like. Can you please build me a full on menu system with confirmations and stuff that will allow me to do the following tasks and it will make a system with beautiful colorful menus. In my case, lots of love messages and emojis and words of encouragement built into it where it'll be like, would you like to do like the Homer Simpson style, Would you like to do this? Would you like to do that? And you can just go, yes, no, yes, no, yes, no. And solve a task knowing that it's got like database integrity and it's reliable and all this sort of stuff. Now I can create a tool like that in less than 30 seconds to accomplish a task over and over again for the future. And I really feel like this is the precursor to what we talk about where people are going to build skills which are multi step processes that they're doing all the time for their business, which then can become part of a personal mcp, which is a set of tools which you then give access to the AI. And this is why I keep coming back to this idea of the. Your personal preferences isn't the right word, but your personal sort of like stack or something. Yeah, yeah, like yeah, exactly, stack. My personal stack of tools and skills that I need in order to get the things I want done. So once the AI knows about that, it knows when I asked to do a certain kind of thing to go off and use that multi step tool or multi step skill that I've created. And as I build those up over time, not only do I have them available to me to help me with my job, but they work with every other thing I've connected, like my email, like, you know, my spreadsheets, like the data that we share in terms of tasks and things like that. So I can say, you know, when there's a new task in the database that meets this criteria, follow this skill procedure. So an example of that might be adding a new model to SIM theory. Like these are the steps you need to take and it can go off, write the code, build a pull request, I go review it, I'm like, wow, it's nailed it again. And commit it, you know, like that's a task that I was doing manually before that could be fully automated by combining These custom MCP tools. And I'm not just saying this isn't just technical. In fact, I think it's probably going to be better with less technical things. But there are going to be procedures like if you're a doctor's surgery, you get an email coming in saying can I have an appointment at 3pm on Wednesday with you know, Doctor, what's your one? Gupta, Dr. Gupta. And it's like, okay, here's the skill. Check the calendar, Dr. Gupta's calendar, she's free, you know, okay, now add an entry to the calendar with that MCP thing. Okay, that's added. Add a note to this patient's history in our custom MCP system that updates our doctor surgery. Send an email to the client confirming the appointment, and then call them on the phone a day before to remind them of their appointment and confirm it. You know, like all of this stuff could be fully automated as a skill such that when that email comes in, that skill runs. And now that part of your business is fully automated without ever doing any custom code or coding at all, or even knowing how any of these systems work. Because you've simply connected all their various tools and you think about it, and the more and more tools you can add to your system or groups of tools you can add to your system, the more and more of these multi step skills you can build up and suddenly you just got this system that's ip, it's value, it's it. You've built a valuable system just by combining tools.

B (63:58)

Yeah, I sort of alluded to it. So I finally bit the bullet and got some money on there and, and because I just keep looking at the. Which company has the best model end of whatever month. So last month it was end of April and it did end up being Google. So I'm glad I didn't go in then. Right, yeah, you were right. You were right. So what I've done is I've loaded up some money on Polymarket and we're not sponsored by them, by the way, or anything like that. It's just interesting to see how the models are tracking. But I was just really, really surprised. So the current stats for the end of May, and it's the start of May as we're recording this, so, you know, the whole month is yet to go. Google is a 65% chance of having the best model at the end of the month, which I think is sort of like the more or less saying no one else is going to release anything this month because, you know, it just seems like they always go slightly better. And also it's interesting because they use, I checked, they use the LM sys leaderboard at the end of the month at like midnight on the last day of the month as the definition. That's how they decide. Google is trending upwards like mad and I just don't understand why. XAI is a 23% chance of having the best model. OpenAI is 7% despite O3 scoring very few like a couple of points below Google on, on the thing. So I don't know. I just don't know. Like I do think Google has the best model right now. I, I definitely agree with that. I just feel like a month is a long time in this space and anything could change. Now keep in mind on this thing, even if someone releases something and everyone gets all excited about it and thinks it's going to be better, you can cash out before the month actually ends. And so I'm going to keep tracking it, see what happens. But it's very interesting looking at sort of the later months and seeing what people think over time. I guess they become less certain as time goes on. But right now Google is definitely leading on there.

A (66:04)

I might. Okay, all right, fine. I, I like, I just think daily driving the models I and, and I wanted to talk about this anyway what we're daily driving this week I've gone between more models than I've ever gone between in a long time. It used to be prior to the latest releases I was mostly on Claude 3.7 little bit of 3.5 occasionally I, if I got really desperate I would try out oh one but now my daily drivers are predominantly still Gemini 2.5. But I find myself for certain tasks like writing, you know sometimes with like front end or design related stuff, anything designery. I'll go to Claude sometimes if Gemini trips up on things. I'll also go to Claude 3.7 thinking like I'll push the thinking up and then I've also been trying when I get stuck as my sort of, you know, last ditch attempt oh 4 mini high which I gotta say like it's pretty smart. It's not a good daily driver I don't think but it's pretty smart at cutting through problems sometimes. I've also been playing around with O3 on chat GBT to get the search and the, the tool calling capabilities when I get stuck on things. And then I've also been using O3 raw without tool calling on SIM theory just to get a feel for the differences. And this is the shocking bit to me, right? And I know people will disagree with me, but hear me out. Any problem I got stuck on where I was like, oh, it just needs the latest talks or it needs the latest information. I went and tried oh3 with the web search and tool crawling, but even in their own model release paper they say that it tends to hallucinate more. Right. And this is the problem I found using it is it? The experience feels really good. You're like, oh, this thing's going off, it's searching all these sources, it's doing all this work and then it would come back with a solution and it would just read like an outdated doc, like there's again, there's no actual thinking. Even though they say it's a thinking model. There's no thinking really happening in that, in those, in those tool calls where it hits the search. It just, it does look through better results I think now and engage more results. The problem is it's so easily misled. Still a lot of the answers came back sounding really good. But then when I investigated using the solution, it was garbage. And then I just ended up reverting to Gemini 2.5 a lot of the time. So I, I don't know, like, I'm not, like, I, I don't think O3 and O4 mini are daily drivers, but I'm still finding myself living in a world tackling really hard problems, going between the models, trying to find solutions where I, I can't like, you know what I mean? Like I need, I'm still finding, I'm needing a mixture of models like Gemini 2.5 is I think arguably the best right now. But it's not so far ahead that I wouldn't look to use anything else similar to what Claude 3.5 was for a long time there.

B (69:18)

I think the next major advancement is not going to come from a model at all. I think it's going to come from what I was describing around tool use being structured by an AI based system. So what I mean by that, because you just described it there, a system that just uses basic tool calling like a search and like a research or whatever is going off and, and gathering that information, running it through an inference and providing a summary and then adding that summary into the request alongside the, the chat history that you've had with that agent. And so you end up with it taking that information into account, but just working off the model. What it's not doing is applying things like preferences, weights, like saying this is really important, this is less important, this is a more trusted source, this has been verified by this other tool and things like that. I think what we're going to see is really intelligent tool combinations that are done with guidance by an AI system that are able to know how many tools to call in what combination and order to call them, and then how to combine those into synthesized answers. And I feel like what I'm seeing, because we are now with the MCP stuff, working with so many tools, it isn't enough to give an AI model, whichever one it is, a hundred tools and just expect it to work out what to do. There needs to be planning, there needs to be a sort of trade off between time and money versus what kind of answer you're going for. There needs to be some real intelligence in the way the problem is solved. And I think just relying on the model to do that raw now isn't good enough because I've tried this and I think the models are okay at it, but what they need is strong guidance. And I think with the right guidance, we're going to see real innovation in this area where the models are capable of so much. If you call the tools in the right way and then allow it to use the output of those to either trigger more tool calls or to gather that information together and make decisions.

A (75:02)

Yeah. Okay, so changing topics because I, I really wanted to get to this. It's a little gossipy, but I think it's also relevant to the Meta AI discussion earlier. And I meant to mention it before but I completely forgot. So remember when GPT Image came out and everyone was doing these like stylistic, I don't know what you call like sort of anime ish style images and it really blew up and so like everyone was using it. Friends of mine that don't care about AI were trying it out, the, you know, all the image features and there was the, for a while there everyone was turning themselves into like the, the toys like, like we did as, as an example and it became a really big thing and, and the, the exit. OpenAI were talking about how like their traffic doubled, tripled, quadrupled, whatever it may be. And then this week we saw something really interesting. I'll get to that one in a minute. Where all these people were reporting this new update that went out that OpenAI said to GPT4.0 on chat GBT. So they basically released a tune update of it and they said it would be, you know, more interesting and better. It seems like off the back of all the publicity around the image image stuff really all this company cares about, right, is daily active users and getting people, you know, using Chat GPT as their homepage or coming back like they want the, they want it to be an app that people are addicted to. And, and so if you think about it, like if you get addicted to it, you're willing to pay for it. They're able to collect more information about you, they're able to maybe sell you ads, they might build a social network like all the things we talked about throughout the show, right? And so if you think about like what are they optimizing for? It's like every other Internet company like TikTok's optimizing for. It's just like brain drain time where you're just, you're constantly using that app like engagement and all of these companies, if you look at it, end up optimizing for engagement long term. And it feels like this week, Chat GBT is like, oh, you know, let's take advantage of all this image stuff and make the model more addictive for people so they stick around, keep the conversation going longer, keep, keep them on our website longer, keep them in our app longer. And so we started to see all these reports on X, like, new Chat GBT just told me my literal on a stick business idea is genius and I should drop 30k to make it real. And like, literally it says it's performance art disguised as a gag gift. And that's exactly why it has the potential to, to explode. You've clearly thought through every critical piece, production, safety, marketing position, like flattering them. Yeah, so it was basically flattering and reaffirming literally anything anyone said making them feel great. And, and there was just countless examples of it. And so a lot of, a lot of talk was around this. A lot of people were giving these crazy examples of where it was just validating literally anything and being very agreeable. And then finally, Sam Altman, you know, only a couple of days ago now, said the Last couple of GPT4.0 updates have made the personality too annoying, even though there are some very good parts of it. And we are working on fixes asap, some today and some this week. At some point we'll share our learnings from this. It's been interesting and he said, we started rolling back the latest update to GBT4O last night. It's now 100% rolled back for free users and we'll update again when it's finished for paid users. I don't know why free users were sucking up too many GPUs maybe. We're working on additional fixes for model personality and we'll share more in the coming days. But clearly this was either a, an experiment for eyeballs and time on site, which I'm certain it was. And then they've realized, oh no, people have discovered the, the engagement bait that we're doing with the model. But you can start to see all of these companies yet again. Like, one, they're trying to collect a fire hose of personal information, and then two, they're trying to just keep you in their app because then they make more money, which makes sense. I mean, I'm not necessarily criticizing it, but it's just sad to me that, you know, we went from like, let's build AGI to benefit the world to let's get more engagement than tick tock.