
Loading summary
A
Hi listeners. Welcome back to no Priors Today. We're here with Eric Zelkman, previously of Stanford and Xai. We're going to talk about the contributions he's made to research, reasoning and scaling up rl, as well as his new company, Humans. END Eric, thank you so much for doing this.
B
Thank you.
A
You have had an amazing impact as a researcher, including starting from just your time at Stanford. I want to hear about that. But first, background of how you interested in machine learning at all.
B
I guess going back really far, I've been motivated by this question of you have all of these people out there, all of these things that they're really talented in, all of these things that people are really passionate about. You have so much, there's just so much talent out there. And I've always been a little bit disappointed that so much of that talent doesn't get used just because everyone has circumstances and has these situations where they can't actually pursue those things. And so for me, AI is all.
A
Of humanity's not living up to their full potential. I mean, and then you got to.
B
I mean, the thing I've always been excited about is how do you actually build this technology that frees people up to kind of do the things that they are passionate about? Like how do you basically know, allow people to actually focus on those things? You know, originally I thought of automation as kind of like the most natural way of doing that. Like you, you automate away the parts that like people kind of don't want to do and that, you know, frees up people to do the things that they do want to do. But I guess I realized like increasingly that that's like, it's actually like pretty complex. You actually have to understand if you want to empower people to do what they want to do, you have to really understand what people actually want to do. And building systems that understand kind of people's goals and outcomes is actually really hard.
A
Did you have this human centric perspective when you were choosing research problems to work on originally?
B
I guess at the very beginning when I was choosing research, I was just interested in and how do you actually make these things half decent?
A
So it's more increased capability at all.
B
First I think for me, when I looked at AI or language models back in 2021 or whatever, I was like, these things aren't very smart. They can't do that much. And there was some early work around there that showed that, for example, you could use chain of thought to get models to answer more smartly. But it was still Only a small step improvement. At that time there was still the benefit of that was as much as you can really get with just prompting. And so back then I was thinking about, okay, how do you actually make them half decent at actually solving these harder problems?
A
Can you give a broad, we have everything from researcher audience to business person audience here. Can you give a broad intuition for star?
B
I guess the intuition is if you have a model and it's able to solve these slightly harder questions by thinking about them, then what if you actually teach it like, hey, this solution that you came up with that got you to the right answer, good job. Or if the model didn't, then you basically don't reward it. I guess the original version of SAR actually had or yeah, there wasn't a baseline at the time we compared it to reinforce, which is this popular algorithm in I guess, reinforcement learning, very simple policy gradient thing. But yeah, I guess at the time it was a very simple algorithm, just you iteratively generate solutions. If the solutions get you to the right answer, you learn from them. If they don't, you don't. And then you just kind of keep doing this as the model solves harder and harder problems and, and then learns from harder and harder problems.
A
At what point in the research, if at all, were you surprised by how well it worked or did you have some intuition for this being something scalable?
B
There was one experiment that I remember doing, this was quite a while ago at this point, but we looked at the. I think it was like N digit addition or multiplication. Sorry, it's been a second. And one thing that was really interesting was that back then this was a task that was considered hard for language models.
A
Yeah, of course, it was considered one of the examples of why they were still so stupid.
B
Exactly. And I was like, okay. And one of the really interesting things for me was that as you actually trained for more and more iterations, the number of digits that it was actually able to do kept increasing. And I think that this was one of those big surprises for me, like, oh, wow, like there's no obvious plateau here.
A
And did you go directly from that to generally this should scale?
B
I think I was generally like the interesting. Yeah, I think there were a few things though. There was one part of it that we introduced to kind of. We observed that there was a bunch of the data that the model wasn't learning from. And so we proposed another variant of this where we actually were like, oh, what if you actually take the ones where it fails and you basically ask it to reason about why it should have gotten it right and then you train as if it got it right. And this version was kind of a way of extending beyond the parts of the data that it couldn't see. So if you only train on the positive examples, then you end up in this kind of potential minimum where there's just no more data that it can actually solve. And so back then we were like, what if we just show it the problems that it didn't solve and try to teach it from those. But I guess another thing that other work has done since then is oh, what if you just sample a lot? And that also seems to work in those works.
A
STAR has become a broadly used part of the reasoning paradigm. Since you published, can you also describe, I think this was like sort of your last published work, like Q Star.
B
So Quiet Star was kind of the last thing that I did back at Stanford and it was really fun. I guess we showed a few things that were kind of cool. One of the main goals of that paper was to show that you could actually scale this up to pre training skill by using basically pre training style data. I guess now there's a bunch of these works that have come out recently around RL pre training and stuff like that. And that's I guess in some ways similar to some of what we showed in the QuietStar work. Instead of having question answer, if you actually just have these arbitrary chunks of text, for example, and you tries to predict what's going to come next, which is the standard language modeling objective, can you actually get models that more generally learn to reason? One of the cooler things that I think is kind of overlooked about the original clientstar paper is we showed a bunch of key improvements to the STAR paper that were necessary to actually do this kind of thing. So that was, for example, showing that it's really valuable for this algorithm to be online, showing that it's really valuable to have a baseline where you like, you know, for harder problems you learn more, for easier problems you don't learn quite as much. And I think that there were a bunch of nuggets in there that even at the time I don't think I fully thought of as like, oh, wow, that's actually a cool improvement over the original thing.
A
So you ended up going to GROK for several years. Sorry, xai for several years and you worked on a bunch of different paradigms. So pre training data for Grok 2 and then overall the reasoning recipe for Grok 3. I'm sure I'm missing things, but tool use and agentic infrastructure for Grok 4, I guess if you level set us today. How smart are models? They can obviously do N digit arithmetic at this point.
B
I guess in terms of IQ stuff, I'd say there's a lot of. And if you're able to pose the problem very well, like some very advanced physics problem or math problem, I would say they're reasonably smart. I think like a lot of the failures that people see.
A
Give me a human comparison. What is reasonably smart?
B
I think, I think it's hard to compare directly because it's very jagged. Yeah, like, like it's. It's true that like some of these, for example, some of the HLE questions that these models are able to solve are genuinely things that are like non trivial for like actually PhD researchers. I'm not saying they're like open problems or anything, but they are pretty non trivial Also. A lot of them are. One interesting category of these. I spend a lot of time looking at the HLE questions.
A
One interesting category of them, sorry, humanities last exam. For anybody who isn't looking at the ac. Vowels. No, great.
B
Looking at these humanities last exam questions. One kind of category that is actually quite big are these trick questions that require basically people. If you're familiar with it, you'll be like, oh, they're trying to get you to assume something. But actually if you think more carefully about this problem, that assumption doesn't hold. And this turns out to be a bunch of those kinds of problems. So I think they're pretty smart, but also they're more, I think, tripped up by some of these tricky things. But also they don't really. Really. I think one of the core things is that they're not smart emotionally or they're not smart on the level of actually understanding what people care about or kind of how to actually help people accomplish the things that they care about.
A
I want to talk about this and your next mission. But just on this topic of even jagged intelligence within the IQ domain, which I think almost everybody in the industry has been focused on until now, what would you recommend for people who are not researchers to develop some sort of intuition. Intuition for that surface? Because that seems very important to making them useful.
B
Yeah, I guess one thing that I think is really important to keep in mind is that the more kind of context you can give the current generation of models, the better you kind of are, the better off you are. Their answers are super sensitive to whatever additional information you can give them. Yeah, I think this is like a really important thing. I would generally say existing models are particularly good at handling Questions that are easy to answer in kind of like a closed form, if there's a simple numerical answer to what you're asking or a simple way of choosing from a set of things. This is something that these models actually, obviously it's all dependent, but this is something that makes it easier for them all. If you can imagine it being easy to check your answer, that actually I think makes it easier for the models.
A
What do you think is the most dominant explanation for attempts to use models in more verifiable domains like code still failing at sophisticated tasks? Is it just like the wrong context has been fed to them? Is it context window is simply not large enough to support the scratchpad and continual testing. Why in those domains? What is the biggest challenge?
B
Part of it is there's, I think, a balance. When people kind of want to give users these models, it's actually important that they're not annoyingly slow. And so I think there's actually a number of problems where if you gave the models more time, they would actually be able to answer better. But for example, in the kind of coding context, you kind of have to be reasonably responsive. At least it depends on the kind of setup. If you look at products like OpenAI's codecs, which is kind of this longer running background thing versus cursor, which is more interactive, you have a bit more luxury with those more background approaches to tackle harder problems, I'd say yeah, I think it's a tricky question. A lot of things depend on how far the distribution of what you're asking is from the distribution that the models were actually trained with. So if you happen to be asking a problem that's very similar to the kind of problems that it's seen before, then it'll do great. And if you're asking a problem that's very out of domain to some extent, this question is kind of hard to answer concretely unless you know basically what the RL data for a lot of these specific tasks is.
A
Right? And today obviously none of the model or code agent code interface companies are going to release like a capability map for you of what their RL data looks like, which would be very useful because I mean intuitively, unless if you just look outside of the pre training Internet data sets, right, there are types of problems and types of code bases that are much further out of distribution. And so when engineers try in those scenarios, obviously they get a dumb agent back.
B
And also another thing that matters a lot is just how verifiable are the things that you're trying to get the model to do. I mean, obviously there's been a ton of work out there on making models less dependent on verifiable rewards. Lots of cool published papers. I believe most people would say that there's still a gap between how well these models perform on verifiable tasks versus not verifiable tasks.
A
Yeah, absolutely. This last real question on IQ, but because it is where 90 plus percent of industry energy, literally energy and compute is focused, how would you characterize where we are in scaling and the obvious opportunity to improve from here?
B
There's still meaningful dimensions of scaling that haven't been, I think, fully explored in terms of iq. I think there's a lot of cool efforts out there. There's a lot of cool stuff that can, you know, that can still be done on the capabilities axis. I do think that one, as you start thinking about some of these new kind of axes of scaling, it's actually very natural to realize that there are ways to do them in ways that incorporate people and there's ways to do them in ways that kind of leave people out more and more. And being very mindful of, oh, hey, I'm designing this new algorithm and it's going to scale IQ of this model by X amount if you to effectively keep people in the loop. It's actually a very active decision. And so I think in general, if you're thinking about these things, that's important.
A
Wouldn't it be fair to claim that the instinct of many labs is to try to get people out of the loop as much as possible from a scaling perspective? Because that's very messy. If I want to recruit people to, for example, take complex reasoning choices off them in tasks that are not in distribution for me yet, that is not as simple to execute on for an organization as more rollouts, right?
B
Yeah, for sure.
A
And so why is that important at all from a capabilities perspective, that's a good transition to what are you doing?
B
Yeah, I'd say that it's the main thing is just that as you kind of have these models that expand in terms of the horizon that they're automating, you have these models, the recent or recent ish IMO results are kind of a good example of this. You have these models that go on for hours of reasoning without any human intervention. And this has kind of been an increasing measure of success, I would say, for these labs. So for example, there's this METR meter benchmark that everyone likes to share whenever there's a new model. And it's like, oh, we went from being able to have these models work for two, like complete two hour tasks autonomously without human intervention to 2.5 hour tasks without human intervention. And obviously there's questions of what do those numbers actually mean and how should we take them at face value? But regardless, this has been the metric that people are looking at more and more to measure progress. But as we kind of get these models that increasingly remove people from the interaction, you end up with basically people having less say in kind of the things that get built. You end up with, I think if you have a model that goes off and does its own thing for eight hours and comes back to you with something that is somewhat there. I think this is a weird regime where people probably feel less real agency over the things that they're building. And I think also I kind of anticipate that people will feel like they don't really understand the things that are being built.
A
That's already true.
B
I think it's already true.
A
20,000 lines of generated code looks good to me.
B
Yeah, it's just like you make these PRs and they're like 100,000 lines of. And I think in general this is kind of going to be part of the trend.
A
So do you think that it's important to have humans in the loop of producing the output or the reasoning? Because the ceiling is higher with humans who are in the loop, because it is more efficient, because we can error correct when models are off path or philosophically because people want that or like some combination of all three.
B
Yeah, I think it's probably some combination. I think another thing that I kind of think about is the most natural thing to do as you kind of automate away the existing set of tasks is you kind of look at the world gdp. You carve out the parts that are most easy to replace with these models. And that's kind of the things that you target. Like, oh, wow, coding is like a X billion dollar market, let's automate all of that. Or this other segment is like X billion dollar market, let's automate all of that. But I actually think if you kind of empower people, if you have models that really understand what people are trying to accomplish and really support them in accomplishing those things, you have the potential to actually grow that pie instead of basically replacing all of those segments. And I think in general, if the purpose of these models is to replace the person for this chunk of work, you end up with a lot less real innovation on kind of what's possible. Yeah, I think if you actually have models that really understand what people's goals are and really empower them more, you end up in a very different situation.
A
Because we're going to push those capabilities into areas that are out of distribution for them.
B
Okay, cool.
A
Is that accurate? I'm just.
B
Yeah, no, I'd say so. I think it's like when I say that, you know, I'd like to work on models that empower people instead of replacing them. People are like, oh, yeah, sure, but I'd rather work on curing cancer or something. Obviously that's a really important goal. Right. Building models that are able to kind of solve humanity's most difficult and most fundamental problems is incredibly important. But I also think that, and I'm sure that many of you, the researchers in the field, disagree. I guess in the long run, we'll see kind of what plays out. But I personally strongly believe that we're much more likely to solve a lot of these fundamental human problems by working together, by building models that are really good at collaborating with large groups of people, that are really good at understanding different people's goals, different people's ambitions, different people's values, understanding different people's weaknesses, and how to kind of coordinate with these large groups of people to make everyone more effective. And I think the vision of this AI that goes off on its own for 20 hours, does its own thing, and kind of comes back with the answer to life, the universe and everything. I think that this is less likely. I think this is. I guess we'll have to see, but I think it's less likely.
A
So that goes to. You are starting a new company, HumanZand. I remember being actually quite fundamentally surprised, given all of your work on IQ and reasoning and coding and scale, that you were interested in essentially eq and you also thought of eq and tell me if this is a wrong characterization as the emotional or the interactive capabilities of models today have really shown up in things like character or companionship. Tools only. And you thought of it as also enablement from a productivity perspective. So tell me about where this thread came from.
B
Yeah, I guess I've been thinking about this kind of stuff for some time now, even back in my PhD. I think one of my, I guess, less well known works was actually about we show that you can train language models to simulate different kinds of students for tests. Yeah, yeah. And by simulating students, you can actually design better tests for those students. And that was like a really cool finding. Like, hey, if you have models that are really good at modeling people, you can actually design systems that are Better for people. And this was something that I found really cool and kind of as we move towards the current kind of capabilities frontier, it became more and more obvious that we have these incredibly smart models that are capable of so much, but they're not used for anywhere near what they're capable of. The role that they play in people's lives is a lot less deep, a lot less positive than it could be. And I spent a lot of time thinking about, okay, why is that? Why are these models not more, like I said, deeply, positively integrated into people's lives? And it seemed like a really big part of it is that fundamentally these models don't really understand people. They don't understand people's goals. I would say part of it is the general kind of training paradigm that the field is in. It's very, I would say single task focused or task centric.
A
It's ludicrous that all the benchmarks are still oriented this way. Yeah, yeah.
B
I mean, like, or most of them. I mean even the ones that are like, there's very few benchmarks out there that actually tries to consider, oh, what if you actually have a person that's interacting with this model? At best you have some multi turn benchmarks that try to simulate what an environment would respond to different inputs. But even that is still far from considering, hey, if you actually have this model that interacts with the person for some amount of time, how does it actually affect that person's life? It's really remarkable that the field is kind of so stuck in this kind of task centric regime, but it makes a lot of sense. One thing that I was told by some folks at Google is that one of the reasons is that it's actually very useful for credit assignment. So being able to have these benchmarks that are very easy to quantify and very easy to relate to some immediate thing means that you can kind of say, oh yeah, this team did 2% better than this team, so they deserve all of the resources. Or this team improved the benchmark by 10% while this team improved it by 5%. So let's allocate accordingly. And I think in general that's part of it. I think another part of it is kind of more aligned with the easiest ways to train these models. It's not easy to have these, our own environments and stuff. You have lots of these companies popping up obviously that are trying to sell environments to different people.
A
And the most popular are of course, encoding and computer use rather than anything that requires simulating people.
B
Yeah, it's not that Surprising that we're kind of in this current regime.
A
But so what do models need to know about people or like what capabilities are they either missing or have not been elicited from them?
B
The most fundamental thing is that the models kind of don't understand the long term implications of the things that they do and say. When you treat every turn of a conversation as kind of its own game and you basically think of it as like, okay, you had this interaction, you're done. You need to make sure that this one response has all of the possible answers, has all of the possible content. You don't ever like, ask questions, you don't ever like, try to clarify things. You don't really tend to express uncertainty. You don't tend to be proactive, you don't tend to think about the long term. You see a lot of even single term side effects of this kind of regime and most of them are treated as kind of their own problems to solve. You see issues that people highlight around sycophancy. You see issues that there was recent news around the psychosis stuff. There's a lot of these harmful effects that you get if you think about things in this very single task or task centric way. But if you have models that actually consider the long term implications of oh, hey, if I tell this person to start a company that you know, sells gloves for catching ice cream, if I like tell them that that sounds like a good business idea, they might actually go and they might actually build that business and they might realize that it was not actually a good business idea. Having a model that can kind of rule out the long term implications of.
A
The things and then they won't trust me anymore and then they won't pay for my compute.
B
Exactly, exactly.
A
No, I'm kidding. I think that's really interesting. Like one of the very core principles we have at conviction for how we make decisions is well, what is the very long term thing we want? Right. And if that is the customer, the founder in this case or an LP or even for us, it actually simplifies things quite a bit. If you say we are optimizing for a decade plus versus this interaction. And so being single turn versus multi turn seems like a very different way to make decisions. It seems very hard to collect data about multi turn human interactions, especially when you get to timescale. It's actually analogous to a problem in biology of how do you study diseases that just take time to progress.
B
I think it's a really fundamental question. I think there is actually some good academic work that has started to explore some of this. Yeah, there's some work recently around, you know, RL from human interaction. There's some. There's a cool paper called a Collab LLM, you know, that. That trains against, like, you know, simulation. There's a lot of very cool work. Kind of starting to explore this in academia. But in general, I would say there's a lot less attention being paid to this kind of stuff in industry. Because I would say for most labs, and maybe this is a strong statement, but I would say for most labs, the human is kind of the intermediate until you have this fully automated system. And so spending a lot of time optimizing things for being really good at understanding and really good at interacting and really good at collaborating with things is kind of almost like an intermediate thing you have to do until you get to this fully automated point.
A
Can you paint a picture of if we have models that better understand human objectives over different timescales and are good at interacting with humans, how is that more integrated into your life five years from now?
B
Yeah, I think you don't need to go that far out two years. But yeah, I think you get a lot of behaviors that you currently don't really see in these models. I think you have models that are much better at understanding how the things that you say and ask fit in to the overall context of the stuff that you're doing. For example, if the model knows that you're going to some wedding, for example, and then you ask it about booking hotels in Paris, it might consider, oh, hey, around the time of this event, I know that this user has all of these things that are true about them. Like a model that's generally able to kind of think about how every thing that you say fits into your understanding of that person would just be like, I think, a very fundamentally different interaction. Because right now, if you want to ask a question like that, you kind of have to dump all of this context in. You have to tell like, oh, you know, can you help me find a hotel in Paris? This is because, you know, I'm going to like a wedding. I have these constraints. I have these people who need to be with me. It needs to do this. It needs to be basically dump all of the context that's relevant to yourself into the model, which is also an expensive interaction and something that most people won't do.
A
Imagine if you had a friend where you had to re explain everything about yourself to them every time you spoke.
B
Can you imagine if every time you interacted with someone, you basically, they remember Your name and maybe what you do and just the really high level sketch of your life. That friendship probably would not last very long. Yeah, I think that's kind of what the current models are.
A
So you'd argue that any investment in memory that today's models have is not, it's not that interesting or that core to their capabilities today.
B
I would say that memory is definitely a feature that has been underinvested in by the field. But I would say that it is kind of difficult to invest in memory in this very task centric regime because if you have a bunch of these independent tasks, the amount of information that each of those needs from other things that you've discussed is not all that high. Like because of the current paradigm, memory doesn't end up being super useful in the training and so these models are not particularly good at doing it.
A
So one other thing I said to you I think out of like fear instinct than anything else, but I feel like other people will have this reaction as well is I'm a unique snowflake. You can't possibly simulate me and all of my self consistency issues between like I want to learn this today, but I don't actually want to do the work. I want to eat cake, but I want to be in shape as well. Like you know, we have different time scales and change our minds. I'm just constant distribution shift like and then you can't possibly bring all of us under distribution. Like what? How do you react to that?
B
I think to a certain extent it's probably a little bit true. It's not easy to, to build these really good models of people. But I do think that the task for the model needs to be that it should be trying to do that. The model needs to actually be trying to learn all of these, trying to learn about you, trying to learn about the things that you care about. The actual objective of the model needs to be to understand you and it probably won't be perfect but boy, you can be a lot better than the current models.
A
That seems totally reasonable actually.
B
Yeah. And it's something that I think as a field we will probably get better at. I'm not going to pretend that I'm going to one shot this problem but I think even any serious effort gets you quite a long way.
A
So there is a cult sci fi series about the culture where you have these super intelligent minds and essentially all of the human and human like races live in a society where the minds make most of the decisions and there's like I forget the total humanoid population but let's say there are 30 or 40 minds that are still relevant as people in terms of perhaps being out of distribution or providing reasoning that the minds cannot. And everybody else just lives in a world, world of abundance where they're like rock climbing and hanging out or whatever, and they do not produce. How is your view of abundance different?
B
Everyone kind of has things that they're passionate about, and given the opportunity, I think people can do really cool things. I think the role of the model should be to allow people to do those really cool things that everyone kind of wants to do and accomplish those things that everyone kind of wants to accomplish. And I think we shouldn't outsource all of the thinking and all of the everything to these AI overlords or whatever. I think what we really want are models that are able to empower us.
A
Amazing. Okay, super unique mission, amazing research work. You're hiring an early team, getting a lot of compute. Who are you looking for on the recruiting side?
B
One thing that I think is actually probably a good thing that my previous company did is thinking of everyone kind of to some extent as like engineers. I think I'm looking for really strong infra folks who can build stuff. I'm looking for really strong researchers who can build stuff. I'm looking for really strong product folks who can build stuff. I'm looking for people who have thought a lot about users who've thought a lot about memory. On the research side, I'm looking for, on the infra side, for people who've thought about building distributed systems really fast inference, people who've been there to scale really big projects up. On the product side, I think people who are really creative about new modes of interaction. People who really deeply care about building beautiful, tasteful products.
A
Awesome. Thanks so much, Eric.
B
Thank you so much.
A
Congrats on the new company.
B
Thank you so much.
A
Find us on Twitter at nopriors Pod. Subscribe to our YouTube channel. If you want to see our faces, follow the show on Apple podcasts, Spotify, or wherever you listen. That way you get a new episode every week and sign up for emails or find transcripts for every episode@no-priors.com.
Podcast: No Priors: Artificial Intelligence | Technology | Startups
Hosts: Sarah Guo, Elad Gil
Guest: Eric Zelikman (formerly Stanford, xAI, founder of Humans&)
Date: October 9, 2025
In this episode, Sarah Guo interviews Eric Zelikman—an influential AI researcher and founder of the new startup Humans&. The discussion traverses Eric’s journey from foundational work on scaling reasoning in machine learning (notably the STAR and QuietStar methods), his time at xAI contributing to Grok, and his new focus on bringing emotional intelligence (EQ) and human-centric design to AI models. The conversation covers the limitations of today’s "IQ-heavy" models, the future of human-in-the-loop AI, and the vision behind Humans&, which aims to bridge the gap between technical reasoning and understanding people’s goals and emotions.
STAR (Self-Taught Reasoner):
[04:47] “As you actually trained for more and more iterations, the number of digits that it was actually able to do kept increasing… there’s no obvious plateau here.” — Eric Zelikman
QuietStar:
How smart are today’s models?
Advice for Non-Researchers:
Most scaling strategies focus on minimizing human input because it’s “messy”; however, Eric argues for the importance of keeping people involved:
Human-in-the-loop: The why
Eric shifts focus to emotional/interpersonal capabilities that go beyond the current “task-centric” paradigm.
On Current Benchmarks & Training:
Missing Capabilities:
[26:28] “The most fundamental thing is that the models… don’t understand the long term implications of the things that they do and say.”
Today’s AIs are “single-turn focused,” rarely clarifying or expressing uncertainty, failing to act proactively or remember useful context.
[31:37] “Imagine if you had a friend where you had to re-explain everything about yourself to them every time you spoke.” — Sarah Guo
The Data Challenge:
Future AI should:
Remember rich personal context over time;
Understand evolving, sometimes contradictory user goals and preferences;
Assist in productivity and well-being (not just character/companionship);
Collaborate and coordinate, not just automate.
[34:59] “Everyone kind of has things that they’re passionate about, and… can do really cool things. I think the role of the model should be to allow people to do those really cool things…” — Eric Zelikman
Addressing Uniqueness:
On the reality of model agency:
[18:43] “20,000 lines of generated code looks good to me.” — Sarah Guo, on the opacity of fully autonomous codegen.
On current AI as bad friends:
[31:37] “Imagine if you had a friend where you had to re-explain everything about yourself to them every time you spoke.” — Sarah Guo
On the long-term value of EQ-focused models:
[29:51] “For most labs, the human is kind of the intermediate until you have this fully automated system… optimizing things for being really good at… interacting and collaborating… is almost like an intermediate thing until you get to this fully automated point.” — Eric Zelikman
On the limits of task-centric training:
[24:23] “It’s ludicrous that all the benchmarks are still oriented this way.” — Sarah Guo
On empowering people, not just automating:
[19:17] “If you actually have models that really understand what people’s goals are… you end up in a very different situation.” — Eric Zelikman
The episode is candid, deep, and leans on Eric’s technical expertise while pulling the conversation toward the imperative for more human-centric and emotionally intelligent AI models. Both Eric and Sarah are reflective but practical, pushing past buzzwords to interrogate what real collaboration between AI and people could—and should—look like.
For listeners:
This episode is essential for anyone interested in the “next leap” in AI: not just smarter algorithms, but systems with memory, self-awareness, and real collaboration with humans. If you care about the future relationship between technology and humanity, and how AI can augment—not replace—our potential, you’ll find both insight and inspiration here.