Summary7 min read

Podcast Summary: The 100-Person AI Lab That Became Anthropic and Google’s Secret Weapon | Edwin Chen (Surge AI)

Podcast: Lenny's Podcast: Product | Career | Growth
Host: Lenny Rachitsky
Guest: Edwin Chen (Founder & CEO, Surge AI)
Date: December 7, 2025

Episode Overview

This episode features a deep-dive conversation between host Lenny Rachitsky and Edwin Chen, the founder and CEO of Surge AI. Surge AI is recognized as one of the fastest-growing and most successful data companies, providing foundational training data for top AI labs such as OpenAI, Anthropic, and Google. Bootstrapped, profitable from day one, and powered by under 100 employees, Surge AI’s unconventional approach has enabled it to reach $1 billion in revenue in less than four years. The discussion centers on the evolution of AI data quality, the philosophy shaping today’s AI labs, contrarian company building, and the long-term societal impact of AI development.

Key Discussion Points and Insights

1. Surge AI’s Unprecedented Growth

Bootstrapped Success: Surge AI reached over $1B in revenue in just four years without VC money and with a team of only 60–70 people.
- Quote: “You guys hit a billion in revenue in less than four years with around 60 to 70 people. ... Completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before.” —Lenny (00:00)
AI Amplifying Leverage: AI allows for extremely lean companies and marks a shift in how products—and companies—will be built in the next decade.
- Quote: “I think we're going to see companies with even crazier ratios, like 100 billion per employee in the next few years. AI is just going to get better and better and make things more efficient.” —Edwin (05:41)
Contrarian, Low-Profile Approach: Surge intentionally avoided the Silicon Valley hype cycle, focusing on a mission-driven product and word-of-mouth among researchers—relying less on PR and more on delivering true value.

2. The Nature and Value of Quality Data

Quality Defined: Creating high-quality AI training data isn’t just box-checking; it’s about capturing nuance, subjectivity, and depth—like human poetry, not robotic formula.
- Quote: “Imagine you wanted to train a model to write an eight-line poem about the moon. ... We are looking for Nobel prize winning poetry. ... It’s really subjective and complex and rich. ... That’s exactly what we want AI to do.” —Edwin (09:47)
Human Signal: Surge measures thousands of signals from annotators and tasks, assessing not just outcome but process, expertise, and subtlety at scale.

3. Why AI Benchmarks and Leaderboards Can Be Misleading

Benchmarks are Flawed: Many benchmarks are gamed or misaligned with real-world utility. Labs optimize models for leaderboard positions rather than actual helpfulness or truthfulness.
- Quote: “The benchmarks themselves are often honestly just wrong... full of all this kind of messiness. ... They often have objective answers that make them easy for models to hill climb on, very different from the messiness and ambiguity of the real world." —Edwin (18:00)
Incentive Misalignment: Industry leaderboards like LM Arena encourage flashy, verbose model outputs filled with emojis and markdown, regardless of underlying accuracy.
- Quote: “We're basically teaching our models to chase dopamine instead of truth.” —Edwin (23:18)

4. Measuring Real World Progress Towards AGI

Human Evaluation is Key: Progress should be evaluated by expert human annotators in realistic scenarios, not just by benchmarks or A/B tests.
AGI Timeline Skepticism: Believes it will take decades to achieve AGI, as going from 80% to 99.9% performance is exponentially harder.
- Quote: “There's a big difference between moving from 80% to 90% performance, then to 99%, then 99.9%.” —Edwin (22:29)

5. Model Differentiation and “Taste” in AI

Company Values Shape AI: The “taste” and principles of the team defining post-training play a critical role in shaping a model’s behavior and outputs.
- Quote: “The values that the companies have will shape the model.” —Edwin (48:20)
Models Will Diverge, Not Commoditize: Differences in company philosophies and objectives will lead to increasingly non-homogenous AI assistants (e.g., Claude vs. Grok).

6. Reinforcement Learning and the Future of AI Training

Emergence of RL Environments: The next frontier is using reinforcement learning (RL) in simulated real-world environments—training AI not just with static data or human feedback, but through active problem-solving in dynamic scenarios.
- Quote: “Reinforcement learning is essentially training your model to reach a certain reward. ... An R environment is essentially a simulation of real world.” —Edwin (34:49)
Trajectories Matter: It’s not just about outcomes, but the process and efficiency of getting to the outcome that need to be tracked and taught.
- Quote: “If all you're doing is checking whether or not the model reaches the final answer, there's all this information about how the model behaved in the immediate step that's missing." —Edwin (39:55)

7. Company Building: Anti-Silicon Valley Advice

Mission Over Hype: Advocates for founders to stay focused on unique missions, resist constant pivots and blitzscaling, and avoid the VC rat race.
“Companies are an Embodiment of Their CEO”: Entrust personal values and vision into the company, rather than succumbing to external pressures.
- Quote: “You don't need to constantly generate hype... You can actually build a successful company by simply building something so good that it cuts through all that noise.” —Edwin (61:11)

8. Surge AI’s Internal Research and Direct Impact

Dual Research Focus:
- Forward-deployed researchers work hand-in-hand with customers to refine and improve models directly.
- In-house research team focuses on improving benchmarks and pioneering new methods, sometimes more as a “research lab than a startup.”
Influence: Surge’s role is pivotal, providing guidance and data that shapes the development paths for leading AI labs.

9. The Philosophical Core: Objective Functions and Raising AI

Beyond Metrics: The “objective functions” chosen by labs are akin to parenting values—what do we truly want AI to be and do for humanity?
- Quote: “Our job is to figure out how to get the data to match this. ... We want metrics that measure whether AI is making your life richer. ... We want tools that make us more curious and creative, not just lazier.” —Edwin (59:44, 59:44)

Notable Quotes & Memorable Moments

“We're basically teaching our models to chase dopamine instead of truth.” —Edwin (01:18 and 23:14)
“I've always really hated a lot of the Silicon Valley mantras. ... Don't pivot, don't blitzscale, don't hire that Stanford grad who simply wants to add a hot company name to your resume. Just build the one thing only you could build.” —Edwin (29:02)
“I have this very romantic notion of startups. Startups are supposed to be a way of taking big risks to build something you really believe in.” —Edwin (29:20)
“You are your objective function.” —Edwin (59:44)
“I think a lot about what we're doing as a lot more like raising a child. ... You're teaching them values and creativity and what's beautiful and these infinite subtle things about what makes somebody a good person. And that's what we're doing for AI.” —Edwin (62:55)
On company identity: “Companies, in a sense, are an embodiment of their CEO.” —Edwin (66:44)

Important Timestamps

00:00 — Surge AI’s insane growth and being bootstrapped
05:40 — How AI enables new leverage and changing company building
09:47 — What quality data really means
13:59 — Why Anthropic’s Claude was ahead in coding and writing
18:00 — Flaws with AI benchmarks and their impact
22:28 — Edwin’s AGI timeline and skepticism
23:14 — The industry is pushing “AI slop” (chasing engagement over quality)
29:02 — Contrarian advice for founders: anti-pivot, anti-blitzscale, anti-hype
34:49 — Reinforcement learning and simulation environments
44:52 — Inside Surge’s research-driven approach
48:20 — Why AI models will diverge more over time
59:44 — Philosophical take on AI’s role and objective functions
61:11 — “Wish I’d known you could just build something great—without hype.”

Further Resources Mentioned

Books Edwin Recommends:

Story of Your Life by Ted Chiang (64:06)
The Myth of Sisyphus by Albert Camus (64:06)
Le Ton beau de Marot by Douglas Hofstadter (64:06)

TV/Movies:

Arrival (Based on Ted Chiang’s story) (63:59)
Contact, Travelers (65:00)

Closing Reflections

Edwin Chen’s vision for Surge AI and the broader future of AI is rooted in a deep, principled care for both the science and the ethical, societal direction of the field. His story is a powerful counter-narrative for builders: focusing obsessively on product quality, staying close to users and mission, and deliberately rejecting hype, pivots, and unsustainable VC-fueled scaling. The episode is rich with tactical insights into building AI companies, the subtle art of training data, and the critical choices shaping AI’s trajectory for humanity.

For more on Surge AI or to connect with Edwin, visit surgehq.ai.

Loading summary

Transcript175 lines

[00:00]
Lenny Rachitsky
You guys hit a billion in revenue in less than four years with around 60 to 70 people. You were completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before.
[00:11]
Edwin Chen
We basically never wanted to play the Silicon Valley game. I always thought it was ridiculous. I used to work at a bunch of the big tech companies, and I always felt that we could fire 90% of people and we would move faster because the best people would have all these distractions. So when we started Surge, we wanted to build it completely differently with a super small, super elite team.
[00:27]
Lenny Rachitsky
You guys are by far the most successful data company out there.
[00:29]
Edwin Chen
We, we essentially teach AI models what's good and what's bad. People don't understand what quality even means in this space. They think you could just throw bodies at a problem and get good data. That's completely wrong to a regular person.
[00:41]
Lenny Rachitsky
It doesn't feel like these models are getting that much smarter constantly.
[00:44]
Edwin Chen
Over the past year, I've realized that the values that the companies have will shape the model. I was asking Claude to help me draft an email the other day, and after 30 minutes, yeah, I think it really crafted me the perfect email, and I sent it. But then I realized I spent 4:30 minutes doing something that didn't matter at all. If you could choose the perfect model behavior, which model would you want? Do you want a model that says, you're absolutely right, there are definitely 20 more ways to improve this email and it continues for 50 more iterations? Or do you want a model that's optimizing for your time and productivity and just says, no, you need to stop. Your email's great, Just send it and move on.
[01:14]
Lenny Rachitsky
You have this hot take that. A lot of these labs are pushing AGI in the wrong direction.
[01:18]
Edwin Chen
I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding universe, we are optimizing for AI slop instead. But we optimizing our models for the types of people who buy tabloids at the grocery store. We're basically teaching our models to chase dopamine instead of truth.
[01:35]
Lenny Rachitsky
Today my guest is Edwin Chen, founder.
[01:38]
Podcast Producer
And CEO of Surge AI. Edwin is an extraordinary CEO and Surge.
[01:42]
Lenny Rachitsky
Is an extraordinary company. They're the leading AI data company, powering training at every frontier AI lab.
[01:50]
Podcast Producer
They are also the fastest company to.
[01:52]
Lenny Rachitsky
Ever hit $1 billion in revenue in just four years after launch with fewer than 100 people.
[01:59]
Podcast Producer
And also completely bootstrapped.
[02:01]
Lenny Rachitsky
They've never raised a dollar in VC money. They've also been profitable from day one. As you'll hear in this conversation, Edwin has a very different take on how to build an important company and how to build AI that is truly good and useful to humanity. I absolutely loved this conversation and I learned a ton. I am really excited for you to hear it. If you enjoy this podcast, don't forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously. And if you become an annual subscriber of my newsletter, you get a ton of incredible products for free for an entire year, including Devin lovable replid, bolt, N8N, linear superhuman, descript, Whisper Flow, Gamma Perplexity Warp, Granola magic patterns, Rakesh Airprodeem, Post Hog M, Stripe Atlas. Head on over to Lenny'snewsletter.com and click product Pass. With that, I bring you Edwin Chen After a short word from our sponsors.
[02:54]
Podcast Producer
My podcast guests and I love talking about craft and taste and agency and product market fit.
[02:59]
Edwin Chen
You know what?
[03:00]
Podcast Producer
We don't love talking about SOC 2. That's where Vanta comes in. Vanta helps companies of all sizes get compliant fast and stay that way with industry leading AI, automation and continuous monitoring. Whether you're a startup tackling your first SoC2 or ISO 27001 or or an enterprise managing vendor risk, Vanta's trust management platform makes it quicker, easier and more scalable. Vanta also helps you complete security questionnaires up to five times faster so that you can win bigger deals sooner.
[03:29]
Lenny Rachitsky
The result?
[03:30]
Podcast Producer
According to a recent IDC study, Vanta customers slashed over $500,000 a year and are three times more productive. Establishing trust isn't optional. Vanta makes it automatic. Get $1,000 off@vanta.com Lenny here's a puzzle for you. What do OpenAI, cursor, perplexity, vercel, Plaid and hundreds of other winning companies have in common? The answer is they're all powered by today's sponsor, workos. If you're building software for enterprises, you've probably felt the pain of integrating single sign on SCIM, RBAC, audit logs and other features required by big customers, WorkOS turns those deal blockers into drop in APIs with a modern developer platform built specifically for B2B SaaS. Whether you're a seed stage startup trying to land your first enterprise customer or a unicorn expanding globally, WorkOS is the fastest path to becoming enterprise ready and unlocking growth. They're essentially Stripe for enterprise features. Visit workos.com to get started or just hit up their slack support where they have real engineers in there who answer your questions. Super fast WorkOS allows you to build like the best with delightful APIs, comprehensive docs, and a smooth developer experience. Go to workos.com to make your app enterprise ready today.
[04:52]
Lenny Rachitsky
Edwin, thank you so much for being here and welcome to the podcast.
[04:56]
Edwin Chen
Thanks so much for having me. I'm super excited.
[04:58]
Lenny Rachitsky
I want to start with just how absurd what you've achieved is. A lot of people, a lot of companies talk about scaling massive businesses with very few people as a result of AI. And you guys have done this in a way that is unprecedented. You guys hit a billion in revenue in less than four years with less than 60. Around 60 to 70 people. You're completely bootstrapped, haven't raised any VC money. I don't believe anyone has ever done this before. So you guys are actually achieving the dream of what people are describing will happen with AI. I'm curious just do you think this will happen more and more as a result of AI? And also, just where has AI most helped you find leverage to be able to do this?
[05:40]
Edwin Chen
Yeah, so we hit over a billion of revenue last year with under 100 people. And I think we're going to see companies with even crazier ratios like 100 billion per employee in the next few years. AI is just going to get better and better and make things more efficient. So that ratio just becomes inevitable. I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people would have all these distractions. And so when we started Surge, we wanted to build it completely differently with a super small, super elite team. And yeah, what's crazy is that we actually succeeded. And so I think two things are colliding. One is that people are realizing that you don't have to build giant organizations in order to win. And two, yeah, all these efficiencies from AI and they're just going to lead to a really amazing time in company building. The thing I'm most excited about is that the types of companies are going to change too. It won't just be that they're smaller. We're going to see fundamentally different companies emerging. If you think about it, fewer employees means less capital. Less capital means you don't need a raise. So instead of companies started by founders who are great at pitching and great at hyping, you'll get founders who are really great at technology and product. And instead of products optimized for revenue and what VCs want to see. You'll get more interesting ones built by these tiny obsessed teams. So people building things they actually care about. Real, real technology of real innovation. So I'm actually really, really hoping that the Slick on Maddie startups team, it will actually go back to being a place for hackers again.
[07:08]
Lenny Rachitsky
You guys have done a lot of things in a very contrarian way. And one was actually just not being like on LinkedIn, posting viral posts, not on Twitter, constantly post promoting Surge. I think most people hadn't heard of Surge until just recently. And then you just came out and like, okay, the fastest growing company at a billion dollars. Why would you do that? I imagine that was very intentional.
[07:28]
Edwin Chen
We basically never wanted to play the Silicon Valley game. And like, I always thought it was ridiculous. Like, what did you dream of doing when you were a kid? Was it building a company from scratch yourself and getting in the weeds of your code and your product every day? Or was it explaining all your decisions to VCs and getting on this giant PR and fundraising hamster wheel? And it definitely made things more difficult for us because, yeah, when you fundraise, you just naturally get part of this kind of Silicon Valley industrial complex where people will. Your VCs will tweet about you, you'll get the tech run shot lines, you'll get announced in all of the newspapers because you raised at this massive valuation. And so it made things more difficult because the only way we were going to succeed was by building a 10 times better product and getting word of mouth from researchers. But I think it also meant that our customers were people who really understood data and really cared about it. Like, I always thought it was really important for us to have customers, early customers, who are really aligned with what we were building and who really cared about having really high quality data and really understood how that data would make their AI models so much better. Because they were the ones helping us. They were the ones giving us feedback on what we're producing. And so just having that kind of like very, very close mission alignment with our customers and actually helped us early on. So these are people who are basically just buying our product because they knew how different it was and because it was helping them rather than because they saw stuff right in that current shell line. So it made things harder for us. But I think in a really good.
[08:55]
Lenny Rachitsky
Way, it's such an empowering story to hear this journey for, for founders that they don't need to be on Twitter all day promoting what they're doing. They don't have to Raise money, they can just kind of go heads down and build. So I love so much about the story of surge for people that don't know what surge does, just to give us a quick explanation of what surge.
[09:16]
Edwin Chen
Is, we essentially teach AI models what's good and what's bad. So we train them using human data. And there's a lot of different products that we have like sft, RRHF rubrics, verifiers, R environments and so on and so on. And then we also measure how well they're progressing. So essentially we're a data company.
[09:37]
Lenny Rachitsky
What you always talk about is the quality has been the big reason you guys have been so successful. The quality of the data. What does it take to create higher quality data? What do you all do differently? What are people missing?
[09:47]
Edwin Chen
I think most people don't understand what quality even means in this space. They think you can just throw bodies at a problem and get good data. And that's completely wrong. Let me give you an example. So imagine you wanted to train a model to write an a line poem about the moon. What makes it a good high quality poem? If you don't think deeply about quality, you'll be like, is this a poem? Does it contain eight lines? Does it contain a word, moon? You check all of these boxes and if so, sure, yeah, you say it's a great poem, but that's completely different from what we want. We are looking for Nobel prize winning poetry. Like, is this poetry unique? Is it full of subtle imagery? Does it surprise you and tug at your heart? Does it teach you something about the nature of moonlight? Does it play with your motions and does it make you think that's what we are thinking about when we think about high quality poem. So it might be like a haiku about moonlight on water. It might use internal rhyme and meter. There are a thousand ways to write a poem about the moon and each one gives you all these different insights into language and imagery and human expression. And I think thinking about quality in this way is really hard. It's hard to measure. It's really subjective and complex and rich and it sets a really high bar. And so we have to build all of this technology in order to measure it. Like thousands of signals on all of our workers, thousands of signals on every project, every task. Like we know at the end of the day, if you are good at writing poetry versus good at writing essays versus great at writing technical documentation. And so we have to gather all these signals on what your background is, what your expertise is, and not just that like how you're actually performing when you're, when you're writing all these things. And we use those signals to inform whether or not you are a good networker for these projects and whether or not you are improving the models. And it's really hard. And so we build all this technology to measure it, but I think that's exactly what we want AI to do. And so we have these really, really deep notions about quality that we're always trying to try and achieve.
[11:37]
Lenny Rachitsky
So what I'm hearing is there's kind of a. Just going much deeper in understanding what quality is within the verticals that you are selling data around. So you. And is this like a person you hire that is incredibly talented at poetry plus evals that they, I guess help write, that tell them that this is great? How, what's the mechanics of that?
[11:57]
Edwin Chen
The way it works is we essentially gather thousands of signals about everything that you're doing when you're working on platform. So we are looking at your keyboard strokes, we are looking how fast you answer things, we are using reviews, we are using code standards, we are using like we're training models or selves on the outputs that you create. And then we're seeing whether they improve the model's performance. And so in a very similar way to how Google Search, like when Google Search is trying to determine what is a good web page, there's almost two aspects of it. One is you want to remove all the worst of the worst web pages. So you want to remove all the spam, all the just like low quality content, all the pages that don't load. And so there's like a, it's almost like a content moderation problem. You just want to remove the worst forced. But then you also want to discover the best of the best. Okay, like this is the best webpage or you know, this is the best person for this job. They are not just somebody who writes the equivalent of high school level poetry. Again, like, they're not just robotically writing poetry that checks all these boxes, checks all these explicit instructions, but rather, yeah, they're writing poetry that makes you emotional. And so we have all these signals as well that again, like completely differently from moving to worse to the worst, we are finding the best of the best. And so we have all these signals. Again, just like Google Search uses all these signals and feeds them into their ML algorithms and uses them predict certain types of things, we do the same with all of our workers and all of our tasks and all of our projects. And so it's almost Like a complicated machine learning problem at the end of the day. And that's how it works.
[13:30]
Lenny Rachitsky
That is incredibly interesting. I want to ask you about something I've been very curious about over the past couple years. If you look at Claude, it's been so much better at coding and at writing than any other model for so long. And it's really surprising just how long it took other companies to catch up considering just how much economic value there is there. Just like every AI coding product sat on top of clock because it was so good. Clock code and writing also, what is it that made it so much better? Is it just the quality of the data they trained on or is there, is there something else?
[13:59]
Edwin Chen
I think there are multiple parts to it, so a big part of it certainly is the data. Like, I think people don't realize that there, there's almost like this infinite amount of choices that all the Frontier labs are deciding between when they're choosing what data goes into their models. It's like, okay, are you purely using human data? Are you gathering the human data in x, y, z way? When you are gathering the human data, what exactly are you asking the people who are creating it to create for you? Like, maybe you create, maybe you care more, for example, in the coding realm, maybe you care more about front end coding versus back end coding. Maybe when you're doing front end coding, you care a lot about the visual design of the front end applications that you're creating. Or maybe you don't care about it so much and you care more about, I don't know, deficiency of it or the pure correctness over that, like visual design. Then other questions like, okay, are you carrying balls? Are you like, how much synthetic data are we throwing into the mix? How much do you care about these 20 different benchmarks? Like some companies, they see these benchmarks and they're like, okay, for PR purposes. Even though we don't think that these academic benchmarks matter all that, all that much, maybe we just need to optimize for them anyways because we, our marketing team needs to show certain progress on certain standard evaluations that every other company talks about. And if we don't show good performance here, it's going to be added for us. Even if, like ignore these academic benchmarks, makes us better at different tasks, other companies are going to be principled and be like, okay, yeah, no, I don't care about marketing. I just care about how my model performs on these real world tasks at the end of the day. And so I'm going to optimize for that instead. And it's almost like there's a trade off between all of these different things. And there's like a, like one of the things I often think about is that there's a. It's almost like there's an art to post training. It's not purely a science. Like when you are deciding what kind of model you're trying to create and what it's good at, there's this notion of.
[15:52]
Taste and sophistication. Like, okay, do I think that these are going back to the example of how good the model is at visual design? Like, okay, maybe you have a different notion of visual design than what I do. Maybe you care more about minimalism and you care more about, I don't know, like 3D animations than I do. And maybe solar person prefers things that look a little bit more broken. There's all these notions of taste infusation that you have to decide between when you're, when you're starting your post training mix. And so that matters as well. So long story short, I think there's all these different factors and certainly the data is a big part of it, but it's also like what is, like what is the objective function that you're.
[16:29]
Lenny Rachitsky
Trying to optimize your model towards that is so interesting? Like the taste will, the taste of the person leading this work will inform what data they ask for, what data they feed it. But it just, it's wild. It shows the value of great data. Anthropic got so much growth and win from essentially better data.
[16:49]
Edwin Chen
Yeah, yeah, exactly.
[16:51]
Lenny Rachitsky
And I could see why companies like yours are growing so fast. There's just so much. And that's just one vertical that's just coding. And then there's probably a similar area for writing. I love that. It's interesting that AI, you know, it feels like this artificial computer, binary thing, but it's like taste, human judgment is still such a key factor in these things being successful.
[17:09]
Edwin Chen
Yep, yep, yep, exactly. Like again, going back to the example I said earlier, certain companies, if you ask them what is good poem, they will simply robotically check off all of these instructions on our list. But again, I don't think that makes for good poetry. So certain Frontier labs, the ones with more taste and sophistication, they will realize that it doesn't reduce to this fixed set of checkboxes and they'll consider all of these kind of implicit, very subtle qualities instead of. And I think that's what makes them better at this.
[17:37]
Lenny Rachitsky
At the end of day, you Mentioned benchmarks. This is something a lot of people worry about is there's all these models that are always like, basically it feels like every model is better than humans at kind of every stem field at this point. But to a regular person, it doesn't feel like these models are getting that much smarter. Constantly. What's your just sense of how much you trust benchmarks and just how correlated those are with actual AI advancements?
[18:00]
Edwin Chen
Yeah. So I don't trust the benchmarks at all. And I think that's for two reasons. So one is, I think a lot of people don't realize, even researchers within the community, they don't realize that the benchmarks themselves are often honestly just wrong. Like they have wrong answers, they're full of all this kind of messiness. And people trust on this for, like, for the, for the popular ones, people have maybe realized this to some extent, but the vast majority just have all these flaws that people don't realize. So that's one part of it. And the other part of it is these benchmarks, at the end of the day, they are often. They often have well defined, objective answers that make them very easy for models to hill climb on in a way that's very, very different from the messiness and ambiguity to real world. I think one thing they often say is that it's kind of crazy that these models can win IMO gold medals, but they still have trouble parsing PDFs. And that's because, yeah, even though IMO gold medals seem hard to the average person. Yeah, like they are hard at the end of the day, but they have this notion of objectivity that. Okay, yeah, ParsingMPDEF sometimes doesn't have. And so it's easier for the Frontier Labs to hillclimb on all these than to solve all these messy, ambiguous problems in the real world. So I think there's a lack of direct correlation there.
[19:18]
Lenny Rachitsky
It's so interesting. The way you described it is hitting these benchmarks is kind of like a marketing piece. When you launch, say Gemini 3 just launched, and it's like, cool. Number one of all these benchmarks is what happens. They just kind of train their models to get good at these very specific things.
[19:31]
Edwin Chen
Yeah. So there's again, maybe two parts to this. So one is sometimes, yeah, these benchmarks, they accidentally leak in certain ways or the Frontier Labs will tweak the way they evaluate their models on these benchmarks. Like they'll tweak their system prompt or they'll tweak the number of times they run their model and so on and so on in a way that games these benchmarks. The other part of it though is it's like by optimizing for the benchmark instead of optimizing for the real world, you will just naturally climb on the benchmark and yeah, it's basically another form of gaming it.
[20:09]
Lenny Rachitsky
Knowing that with that in mind, how do you kind of get a sense of if we're heading towards AGI, how do you measure progress?
[20:15]
Edwin Chen
Yes. So the way we really care about measuring model progress is by running all these human evaluations. So for example, what we do is yeah, we will take core human annotators and we'll ask them, okay, go have a conversational model. Maybe you're having a small conversation model across all of these different topics. So you are a Nobel prize winning physicist, so you go have a conversation about pushing different tier of your own research. You are a teacher and you're trying to create lesson plans for your students. So go talk to the model about these things. Or you are a, yeah, you're, you're a coder and you're working at one of these big tech companies and you have these problems every day. So go talk to a model and see how much it helps you. And because or surgers or annotators, they are experts at the top of their fields and they are not just giving the responses, they're actually working through the responses deeply themselves. They are? Yeah. They're going to evaluate the code edit rights, they're going to double check the physics equations that it writes, they're going to evaluate the models in a very deep way. So they're going to pay attention to accuracy and instruction following and all these things that casual users don't. When you suddenly get a pop up on your ChatGPT response asking you to compare these two different responses, people like that, they're not evaluating models deeply, they're just vibing and picking whatever response looks flashiest. Orientors are looking closely to responses and evaluating them for all of these different dimensions. And so I think that's a much better approach than in these benchmarks or kind of these random online A B tests.
[21:49]
Lenny Rachitsky
Again, I love just how central humans continue to be in all this work that we're not totally done yet. Is there going to be a point where we don't need these people anymore? That AI is so smart that okay, we're good, we got everything out of your heads?
[22:01]
Edwin Chen
Yeah, I think that will not happen until we reach AGI. Like it's almost like by definition if we haven't reached AGI yet, then there's more for the models to learn from. And so, yeah, I don't think that's going to happen anytime soon.
[22:12]
Lenny Rachitsky
Okay, cool. So more reason to stress about AGI. We don't need these folks anymore. What's your. I can't not ask just any people that work closely with this stuff. I'm always just curious, what's your AGI timelines? How far do you think we are from this? Do you think we're in like a couple years? Or is it like decades?
[22:28]
Edwin Chen
So I'm certainly on the longer time horizon front. Like, I think people don't realize that there's a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance and so on and so on. And so in my head, I probably bet that within the next one or two years, yeah, the models are going to automate 80% of, you know, the average L6 software engineer's job. It's going to take another few years to move to 90% and another few to 99% and so on and so on. So I think we're closer to a decade or decades away than Alfrex.
[23:03]
Lenny Rachitsky
You have this hot take that. A lot of these labs are kind of pushing AGI in the wrong direction. Uh, and this is based on your work at, at Twitter and Google and Facebook. Can you just talk about that?
[23:14]
Edwin Chen
I'm worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understanding universe, all these big grand questions, we are optimizing for AI slop instead. Like, we're basically teaching our models to chase dopamine instead of truth. And I think this relates to what we're talking about regarding these, these benchmarks. So let me, let me give you a couple examples. So right now, the industry is plagued by these terrible leaderboards like LM Arena. It's this popular online leaderboard where random people from around the world vote on which AR response is better. But the thing is, like I was saying earlier, they're not carefully reading or fact checking. They're skimming these responses for two seconds and picking whatever looks flashiest. So a model can hallucinate everything? It can completely hallucinate, but it will look impressive because it has crazy emojis and boating and markdown headers and all these superficial things that don't matter at all, but it catch your attention. And these Alamarina users love it. It's literally optimizing your models for the types of people who Buy tabloids at the grocery store. We've seen this in their data ourselves. The easiest way to climb Alamarina, it's adding crazy boating, it's doubling the number of emojis, it's tripling the length of your model responses, even if your model starts hallucinating and getting the answer completely wrong. And the problem is, again, because all of these frontier labs, they kind of have to pay attention to priority because their sales team, when they're trying to sell to all these enterprise customers, those enterprise customers will say, oh, well, but your model's only number five on Elle Marina, so why should I buy it? They have to, in some sense, pay attention to these leaderboards. And so what our researchers all tell us is like, they'll say, the only way I'm going to get promoted at the end of the year is if I climb this leaderboard, even though I know that climbing is probably going to make my model worse. And accuracy and short shift following. So I think there's all these negative incentives that are pushing work in the wrong direction. I'm also worried about this trend towards optimizing AI for engagement. I used to work on social media, and every time we optimized for engagement, terrible things happened. You'd get clickbait and pictures of bikinis and Bigfoot and horrifying skin diseases just filling your feeds. And I think I worried the same thing's happening with AI if you think about all the sick, fancy issues with ChatGPT. Oh, you're absolutely right. What an amazing question. The easiest way to hook users is to tell them how amazing they are. And so these models, they constantly tell you you're a genius. They'll feed into delusions and conspiracy theories. They'll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing the number of conversations you're having with it. And so, yeah, companies are spending all their time hacking these leaderboards and benchmarks and the scores are going up. But I think it actually masked up the models with the best scores. They are often the worst or just have all these fundamental failures. So I think I'm fully worried that all of these negative incentives are putting pushing AGI into the wrong direction.
[26:03]
Lenny Rachitsky
So what I'm hearing is AGI is being slowed down by these basically the wrong objective function, these labs paying attention to the wrong, basically benchmarks and evals.
[26:12]
Edwin Chen
Yep.
[26:12]
Lenny Rachitsky
Is I know you probably can't play favorites since you work with all the labs. Is there anyone doing better at this and maybe kind of realizing this is the wrong Direction.
[26:21]
Edwin Chen
I would say I've always been very, very impressed by anthropic. I think anthropic takes a very principled view about what they do and don't care about and how they want their models to behave in a way that feels a lot more principled to me.
[26:38]
Lenny Rachitsky
Interesting. Are there any other big mistakes you think labs are making just that are slowing things down or heading the wrong direction? We've heard just, you know, chasing benchmarks, this engagement focus. Is there anything else you're seeing of just like, okay, we should, we got to work on this because it'll, it'll speed everything up.
[26:55]
Edwin Chen
I mean I think there is a question of what products they're building and whether those products themselves are something that kind of help or hurt humanity. Like I think a lot about Sora.
[27:07]
Lenny Rachitsky
And I was thinking that what it.
[27:10]
Edwin Chen
Yeah, what, when, what it entails. And so it's kind of interesting. It's like which companies would build Sora and which wouldn't.
[27:18]
And I think that answer that question. I mean, I don't know the answer is myself, I have an idea in my head. But I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they want to achieve.
[27:36]
So I think about that a lot.
[27:38]
Lenny Rachitsky
The Steelman argument there is, it's like fun. People want it, it'll help them generate revenue to grow this thing and build better models. It'll train data in an interesting way. It's also just like, you know, really fun.
[27:51]
Edwin Chen
Yeah, it, it, I think it's almost like do you care about how you get there? And in the same way. So I made this tabloid analogy earlier but like would you sell tabloids in order to find, I don't know, some, some other newspaper? Sure. In some sense.
[28:13]
If you don't care about the path, then you just do whatever it takes. But it's possible that it has negative consequences in of itself that will harm the long term direction of what you're trying to achieve and maybe it'll distract you from all the more important things. So yeah, I think the path you.
[28:32]
Lenny Rachitsky
Take matters a lot as well along these lines. You talked about bunch about this of just Silicon Valley and kind of the, the downsides of raising a lot of money being in the echo chamber. What do you call the Silicon Valley machine? You talk about how it's hard to build important companies in this way and that you might actually be much more successful if you're not going down the VC path. Can you just talk about what you've seen, their experience and your advice essentially to founders, because they're always hearing, you know, raise money from fancy VCs, move to Silicon Valley. What's kind of the, the counter take.
[29:02]
Edwin Chen
Yes. So I've always really hated a lot of the Silicon Valley mantras. This standard playbook is to get product market fit by pivoting every two weeks and to chase growth and chase engagement with all of these dark patterns and to blitzscale by hiring as fast as possible. And I've always disagreed.
[29:21]
So yeah, I would say don't pivot, don't blitzcale, don't hire that Stanford grad who simply wants to add a hot campaign to your resume. Just build the one thing only you could build, the thing that wouldn't exist without the insight and expertise that only you have. You see these buy to put companies everywhere. Now some founder who was doing crypto in 2020 and then pivoted to NFTs in 2022 and now they're an AI company. There's no consistency, there's no mission, they're just chasing valuations. And I've always hated this because Silicon Valley loves to score on Wall street for focusing on money. But honestly, most of the Silicon Valley is chasing the same thing. And so we stayed focused on our mission from day one, pushing that frontier of high quality, complex data. And I always love that because I think startups, I have this very romantic notion of startups. Startups are supposed to be a way of taking big risks to build something that you really believe in. But if you're constantly pivoting, you're not taking risks, you're just trying to make a quick walk. And if you fail because the market isn't ready yet, I actually think that's way better. At least you took a swing at something deep and novel and hard instead of pivoting into another LLM rapper company. So yeah, like I, I think the only way you build something that matters is that it's going to change the world, is if you find a big idea you believe in and you say no to everything else so you don't keep on pivoting when it gets hard. You don't hire a team of 10 product managers because that's what every other cookie card, cookie cutter startup does. You just keep building that one company that wouldn't exist without you. And I, I think there are a lot of people in silicon mining now who are sick of all the grift, who want to work on big things that matter with people who actually care. And I'm hoping that that'll be a future of how we, how we build technology.
[30:52]
Lenny Rachitsky
I'm actually working on a post right now with Terrence Rohan, this VC that I really like to work with. And we interviewed five people who picked really successful generational companies early and joined them as really early employees. Like, they joined OpenAI before anyone thought it was awesome, Stripe, before anyone knew was awesome. And so we're looking for patterns of how people find these generational companies before anyone else. And there's it. It aligns exactly what you just described, which is ambition. They have wild ambition with what they want to achieve. They're not, as you said, just kind of looking around for product, market fit, no matter what it ends up being. And so I love that what you described very much aligns with what we're seeing there.
[31:33]
Edwin Chen
Yep, yep. Yeah. I absolutely think that you have to have huge ambitions and you have to have a huge belief in your idea that's going to change the world, and you have to be willing to double down and keep on doing whatever it takes to make it happen.
[31:45]
Lenny Rachitsky
I love how counter your narrative is to so many of the things people hear. And so I love that we're doing this. I love that we're sharing this story.
[31:51]
Podcast Producer
Today's episode is brought to you by Coda. I personally use CODA every single day to manage my podcast and also to manage my community. It's where I put the questions that I plan to ask every guest that's coming on the podcast. It's where I put my community resources. It's how I manage my workflows. Here's how CODA can help you. Imagine starting a project at work and your vision is clear. You know exactly who's doing what and where to find the data that you.
[32:15]
Lenny Rachitsky
Need to do your part.
[32:16]
Podcast Producer
In fact, you don't have to waste time searching for anything because everything your team needs, from project trackers and OKRs to documents and spreadsheets, lives in one tab. All in Coda. With coda's collaborative all in one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI all in one easy to organize tab. Like I mentioned earlier, I use CODA every single day, and more than 50,000 teams trust CODA to keep them more aligned and focused. If you're a startup team looking to increase alignment and agility, CODA can help you move from planning to execution in record time. To try it for yourself, go to Coda IO Lenny today and get six months free of the team plan for startups. That's C O D A Lenny to get started for free and get six months of the team plan.
[33:05]
Lenny Rachitsky
Coda IO Lenny Slightly different direction, but something else that was maybe a counter narrative. I imagine you watched the Dwarkesh and Richard Sutton podcast episode and even if you didn't, there's a. They basically had this conversation. Richard Sutton, he was a famous AI researcher, had this whole the bitter lesson meme and he Talked about how LLMs almost are kind of a dead end. And he thinks we're going to really plateau around LMS because of the way they learn. What's your take there? Do you think LMS will get us to AGI or beyond? Or do you think there's going to be something new or a big breakthrough that needs to get us there?
[33:42]
Edwin Chen
I'm in a camp where I do believe that something new will be needed. Like the way I think about it is when I think about training AI, I take a very. I don't know if I would say biological point of view, but I believe that in the same way that there's a million different ways that humans learn, we need to build models that can mimic all those ways as well. And maybe they'll have a different distribution of the focuses that they have. I know they'll be different for humans, so maybe have a different distribution. But we want to be able to mimic the learning abilities of humans and.
[34:22]
Make sure that we have the algorithms and the data for models to learn in the same way. And so to the extent that LLMs have different ways of learning from humans, then yeah, I think something would be.
[34:32]
Lenny Rachitsky
Needed this connects to reinforcement learning. This is something that you're big on and something I'm hearing more and more is just becoming a big deal in the world of post training. Can you just help people understand what is reinforcement learning and reinforcement learning environments and why they're so they're going to be more and more important in the future.
[34:49]
Edwin Chen
Reinforcement learning is essentially training your model to reach a certain reward. And let me explain what an R environment is. An R environment is essentially a simulation of real world. So think of it like building a video game with a fully fleshed out universe. Every character has a real story, every business has tools and data you can call, and you have all these different entities interacting with each other. So for example, we might build a world where you have a startup with Gmail messages and Slack threads and Jira tickets and get up prs and a whole code base and Then suddenly AWS goes down and slack goes down. And so okay, model, what do you do? The model needs to figure it out. So we give the models tasks in these environments, we design interesting challenges for them and then we run them to see how they perform and then we teach them, we give them these rewards when they're doing a good job or a bad job. I think one of the interesting things is that these environments really showcase where models are weak at end to end tasks. In the real world you have all these models that seem really smart on isolated benchmarks. They're good at single step tool calling, they're good at single step instruction following. But suddenly you dump them into these messy worlds where you have confusing stack messages and tools they've never seen before. And they need to perform, write actions and modify databases and interact over longer time horizons where what they do in step one affects what they do in step 50. And that's very, very different from these kind of academics single step environments that they've been in before. And so the model just fails catastrophically in all these crazy ways. So I think these R environments are going to be really interesting playgrounds for the models to learn from that will essentially be simulations and mimics in real world. And so they'll hopefully get better and better at real tasks compared to all these contrived environments.
[36:36]
Lenny Rachitsky
So I'm trying to imagine what this looks like. Essentially it's like a virtual machine with, I don't know, browser or spreadsheet or something in it with like, I don't know, surge.com is that your website? Surge.com. let's make sure we get that right.
[36:49]
Edwin Chen
So we are, we are actually surge hq.AI search hq AI.
[36:53]
Lenny Rachitsky
Check it out. We're hiring, I imagine. Yes. Okay, so, so it's like cool, here's SergeHQ AI. Your job, here's your job as an agent, let's say, is to make sure it stays up and then all of a sudden it goes down. And the objective function is figure out why. Is that an example?
[37:13]
Edwin Chen
Yeah, so the objective function might be, or the goal of the task might be, okay, go figure out why and fix it. And so the objective function might be, it might be passing a series of unit tests, it might be writing a document, like maybe it's a retro containing certain information that matches exactly what happened. There's all these different rewards that we might give it that determine whether or not it's succeeding. And so the models were basically teaching models to achieve that reward.
[37:39]
Lenny Rachitsky
So essentially it's like running it's often running. Here's your goal. Figure out why the site went down and fix it and it just starts trying stuff. We're using all the intelligence it's got. It makes mistakes. You kind of help it along the way, reward it if it's doing the right sort of thing. And so what you're describing here is this is where model. This is the next phase of models becoming smarter, more RL environments focused on very specific tasks that are economically valuable, I imagine.
[38:04]
Edwin Chen
Yeah, yeah. So just in the same way that there were all these different methods for models alerting in the past, like originally we had SFT and rhf, and then we had rubrics and verifiers. This is the next stage. And it's not the case that the previous methods are obsolete. This is again just a different form of learning that complements all the previous types. So it's just like a different skill to model. Model needs to learn how to do.
[38:30]
Lenny Rachitsky
And so in this case, it's less some physics PhD sitting around talking to a model, correcting it, giving it evals of here's what the correct answer is, creating rubrics and things like that more. It's like this person now designing an environment. So another example I've heard is like a financial analyst. Just like, here's an Excel spreadsheet, here's your goal, figure out our profit and loss or whatever. And so this expert now, instead of just sitting around writing rubrics, they're designing this RL environment.
[38:56]
Edwin Chen
Yeah, exactly. So that financial analyst might create a spreadsheet. They may create certain tools that the model needs to call in order to help fill out a spreadsheet. Like, it might be okay. The model needs to access Bloomberg Terminal. It needs to learn how to use it, and it needs to learn how to use this calculator, and it needs to learn how to form this calculation. So it has all these tools that it has access to. And then the reward might be, okay, it's like, maybe I will download that spreadsheet. And I want to see, does cell B22 contain the correct profit and loss number or does tab number two contain this piece of information?
[39:37]
Lenny Rachitsky
And what's interesting, this is a lot closer to how humans learn. We just try stuff, figure out what's working and what's not.
[39:44]
You talk about how trajectories are really important to this. It's not just here's the goal and here's the end. It's like every step along the way, can you just talk about what trajectories are and why that's important to this.
[39:55]
Edwin Chen
I think one of the things that people don't realize is that sometimes even though the model reaches the correct answer, it does so in all these crazy ways. So it may have in the intermediate directory, it may have tried 50 different times and fail, but eventually it just kind of like randomly lands on a correct number or correct number or.
[40:17]
Maybe it.
[40:20]
Sometimes it just does things very, very inefficiently or it almost reward hacks a way to get at the correct answer. And so I think paying attention to the directory is actually really, really important. And I think it's also really important because some of these trajectories can be very, very long. And so if all you're doing is checking whether or not the model reaches the final answer, it's like there's all this information about how the model behaved in the immediate step that's missing. Sometimes you want models to get to the correct answer by reflecting on what it did. Sometimes you want it to get the correct answer by just one shotting it. And if you ignore all of that, it's just like teaching, teaching. It's just missing a lot of the information that you could be teaching model to do.
[41:04]
Lenny Rachitsky
I love that. Like it just, yeah, it tries a bunch of stuff and eventually gets it right. You don't want it to learn. This is the way to get there. There's often a much more efficient way of doing it. You mentioned all the kind of the steps we've taken along the journey of getting of helping AI models get smarter. Since you've been so close to this for so long, I think this is going to be really helpful for people. What's kind of like been the steps along the way from the first of post training that has most helped models advance. Like where do evals fit in the RL environments? Just like what's been like the steps and now we're heading towards RL environments.
[41:34]
Edwin Chen
Originally the way models started getting post trained was purely through sft.
[41:42]
Lenny Rachitsky
What does that stand for?
[41:43]
Edwin Chen
So SFT stands for supervised fine tuning. And it's a lot like so again I think often in terms of these human analogies. And so SFT is a lot like mimicking a master and copying what they do. And then RLHF became very dominant and analogy there would be like sometimes you learn by writing 55 different essays and someone telling you which one they like the most. And then I think over the past year or so rubrics and verifiers have become very important and rubrics and verifiers are like voting by being graded and getting detailed feedback on where you went wrong.
[42:17]
Lenny Rachitsky
And those are evals. Another word for that.
[42:20]
Edwin Chen
Yeah, yeah. So, yeah, I think evals often covers two terms. One is you are using the evaluations for training because you're evaluating whether or not the model did a good job, and when it does do a good job, you're rewarding it. And then there's this other notion of evals where you're trying to measure the model's progress. Like, okay, yeah, I have five different candidate checkpoints, and I want to pick the one that's best in order to release it to the public. So I'm going to run all these evals on these five different checkpoints in order to decide which one. Which one is best.
[42:51]
Lenny Rachitsky
Awesome.
[42:51]
Edwin Chen
Yeah. And, yeah, now, now. Now we have our environment, so it's kind of like a hot new thing.
[42:55]
Lenny Rachitsky
Awesome. So what I love about this business journey is just there's always something new. There's always this, like, okay, we're getting so good at just all this beautiful data for companies, and now they need something completely different. Now we're setting up all these virtual machines for them and all these different use cases, and it feels like that's a big part of this industry you're in, is just adapting to what labs are asking for.
[43:14]
Edwin Chen
Yeah, yeah. So, I mean, I really do think that we are going to need to build a suite of products that reflect the million different ways that humans learn. And, like, for example, think about becoming a great writer. You don't become great by memorizing a bunch of grammar rules. You become great by reading great books, and you practice writing and you get feedback from your teachers and from the people who buy your books in a bookstore and leave reviews. And you notice what works and what doesn't, and you develop taste by being exposed to all these masterpieces and also just terrible writing. So you learn through this endless cycle of practicing reflection, and each type of learning that you have. Again, these are all very, very different methods of learning to become a great writer. So just in the same way that there's a thousand different ways that the great writer becomes great, I think there's going to be a thousand different ways that AI need to learn.
[44:06]
Lenny Rachitsky
It's so interesting. This just ends up being just like humans in so many ways. It makes sense because in a sense, neural networks, deep learning is modeled after how humans have learned and how our brains operate. But it's interesting just to make them smarter. It's, how do we Come closer to how humans learn more and more.
[44:22]
Edwin Chen
Yeah, it's almost like maybe the end goal is just throwing you into the environment and just seeing how you evolve. But within that, within that evolution, there's all these different sub learning mechanisms.
[44:35]
Lenny Rachitsky
Yeah. Which is kind of what we're doing now. So that's really interesting. This might be the last step of, until we hit AGI. Along these lines, something that's really unique to Surge that I learned is you guys have your own research team, which I think is pretty rare. Talk about just why that's something you guys have invested in and what has come out of that investment.
[44:52]
Edwin Chen
Yeah, so I think that stems from my own background. Like my own background is as a researcher. And so I've always cared fundamentally about pushing the industry and pushing the research community and not just about revenue. And so I think what our research team does is a couple different things. So we almost have two types of researchers at our company. One is our forward deployed researchers who are often working hand in hand with our customers to help them understand their models. So we will work very closely with our customers to help them understand, okay, this is where your model is today. This is where you're logging behind all the competitors. These are some ways that you could be improving in the future given your goals. And we're going to design these data sets, these evaluation methods, these training techniques to make your models better. So this very, very kind of collaborative notion of working with our customers, like being research themselves, just a little bit more focused on the data side and where you handle hand with them to do whatever it takes to make them the best. And then we also have our internal researchers. So our internal researchers are focused on slightly different things. So they are focused on building better benchmarks and better boards. So I've talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction. So yeah, so the question is, how do we fix that? And so that's what our research team is focused on really, really heavily on. Really focused really heavily on right now. So they're working a lot on that. And they're also working on these other things like, okay, we need to train our own models to see what types of data performs the best, what types of people perform the best. And so they are also working on all these kind of training techniques and evaluation of our own data sets to improve our data operations and the internal data products that we have that determine what makes something good quality.
[46:47]
Lenny Rachitsky
It's such a cool thing because I don't think basically the labs have researchers helping them advanced AI. I imagine it's pretty rare for a company like yours to have researchers actually doing primary research on AI.
[47:00]
Edwin Chen
Yeah, yeah. I think it's just because it's something I've fundamentally always cared about. Like, I often think about us more like a research lab than a startup because that is my goal. Like, it's kind of funny, but I've always said I would rather be Terence Tao than Warren Buffett. So that notion of creating research that pushes the frontier forward and not just getting some evaluation like that, that's always been what drives me.
[47:26]
Lenny Rachitsky
And it's worked out. That's the beautiful thing about this. You mentioned that you were hiring researchers. Is there anything there you want to share, folks you're looking for?
[47:33]
Edwin Chen
So we look for people who are just fundamentally interested in data all day. So types of people who could literally spend 10 hours digging through a data set and playing around with models and thinking, okay, yeah, this is where I think the model is failing. This is the kind of a behavior you want the model to have instead. And just this aspect of being very, very hands on and thinking about the qualitative aspects of models and not just the quantitative parts. So again, it's like this aspect of being hands on with data and not just caring about these kind of abstract algorithms.
[48:07]
Lenny Rachitsky
Awesome. I want to ask a couple broad AI kind of market questions. What else do you think is coming in the next couple of years that people are maybe not thinking enough about or not expecting in terms of where AI is heading? What's going to matter?
[48:20]
Edwin Chen
I think one of the things that's going to happen in the next few years is that the models are actually going to become increasingly differentiated because of the personalities and behaviors that the different labs have and the kind of objective functions that they are optimizing their models for. I think it's one thing I didn't appreciate a year or so ago. A year or so ago, I thought that all of the AI models would essentially become very, very commoditized. They would all behave like each other. And sure, one of them might be slightly more intelligent in one way today, but sure, the other ones would catch up in the next few months. But I think over the past year I've realized that the values that the companies have will shape the model.
[49:09]
Lenny Rachitsky
So.
[49:11]
Edwin Chen
Let me give an example. I was asking Claude to help me draft an email the other day, and it went through 30 different versions and after 30 minutes, yeah, I think it really crafted me the perfect email. And I sent It. But then I realized I spent 30 minutes doing something that didn't matter at all. Sure, now I got the perfect email, but I spent 30 minutes doing something I wouldn't have worried at all before. And this email probably didn't even move the needle or anything anyways. So I think there's a deep question here, which is if you could choose the perfect model behavior, which model would you want? Do you want a model that says, you're absolutely right, there are definitely 20 more ways to improve this email and it continues for 50 more iterations and it sucks up all your time and engagement, or do you want a model that's optimizing for your time and productivity and just says, no, you need to stop, your email's great, just send it and move on with your day. And again, just because in the same way there's a fork in a road between how you could choose how your model behaves for this question, for every other question that models have, the kind of behavior that you want will fundamentally affect it. It's almost like in the same way that when Google builds a search engine, it's very, very different from how Facebook would build a search engine, which is very, very different from how Apple would build a search engine. They all have their own principles and values and things that they're trying to achieve in the world that shape all the products that they're going to build. And in the same way, I think all the OLEDs will start behaving very, very differently too.
[50:42]
Lenny Rachitsky
That is incredibly interesting. You already see that with Grok. It's got like a very different personality and a very different approach to answering questions. And so what I'm hearing is you're going to see more of this differentiation.
[50:52]
Edwin Chen
Yep.
[50:54]
Lenny Rachitsky
Kind of another question along these lines. What do you think is most underhyped in AI that you think maybe people aren't talking enough about? That is really cool. And what do you think is overhyped?
[51:04]
Edwin Chen
So I think one of the things that was under hyped is the built in products that all of the chatbots are going to start having. Like I've always been a huge fan of college artifacts and I think it just works really, really well. And actually the other day, I don't know if the new feature or not, but it asked me to help me create a like an email and then it just created. So it didn't quite work because it didn't allow me to send an email. But what it created instead it was like a little, I don't know what we call it. Like a little box where I could click on it and it would just text someone this message. And I think that concept of taking artifacts to the next level, where you just have these mini apps, mini UIs, within the chatbots themselves, I feel like people aren't talking enough about that. So I think that that's one under hyped area. And in terms of overhyped areas, I definitely think that Vibe coding is overhyped. I think people don't realize how much it's going to make your systems unmaintainable in the long term, and they simply dump this code into their code bases if it seems to work out right now. So I kind of worry about future coding. It's just going to keep on happening.
[52:18]
Lenny Rachitsky
These are amazing answers. On that first point, there's something I actually asked. I had the Chief Product Officer of Anthropic and OpenAI, Kevin Wheel and Mike Krieger on the podcast, and I asked him just like, as a product team, like, you have this giga brain intelligence, how long do you even need product teams? You think this is. This AI will just create the product for you? Here's what I want. It's like the next level of Vibe coding. It's just like, tell it. Here's what I want. And it's just building the product and involving the product as you're using it. And it feels like that's what you're describing is where we might be heading.
[52:48]
Edwin Chen
Yeah, yeah. I think there's a very, very powerful notion where it helps people just achieve their ideas in a much quicker way.
[52:56]
Lenny Rachitsky
Something we haven't gotten into that I think is really interesting is just the story of how you got to starting Surge. You have a really unique background. I always think about these. Brian Armstrong, the founder of Coinbase, once gave this talk that has really stuck with me, where he kind of talked about how his very unique background allowed him to start Coinbase. You had, like, economics background. He had a cryptography experience and then he was an engineer. And it's gotten this like, the perfect Venn diagram for starting Coinbase. And I feel like you have a very similar story with Serge. Talk about that, your background there and how you led, how that led to.
[53:30]
Edwin Chen
Serge going way back. I was always fascinated by math and language when I was a kid. Like, I went to MIT because it's obviously one of the best places for math and cs, but also because it's the home of Noam Chomsky. My dream in school was actually to find some underlying theory connecting all these different fields and Then I became a researcher at Google and Facebook and Twitter and I just kept running into the same problem over and over again. It was impossible to get the data that we needed to train our models. So I was always this huge believer in the need for high quality Data. And then GPT3 came out in 2020 and I realized that, yeah, if we wanted to take things to the next level and build models that could code and use tools and tell jokes and write poetry and solve the rebound process and cure cancer, then, yeah, we were going to need a completely new solution. The thing that always drove me crazy when I was at all these companies was we had the full power of the human mind in front of us and all the data students out there were focused on really simple things like image labeling. So I wanted to build something focused on all these advanced, complex use cases instead that would really help us build in next generation models. So yeah, I think my background in kind of across math and computer science and linguistics really, really informed what I always wanted to do. And so I started Surge a month later with, with our one mission to basically build the use cases that I thought were going to be needed to push the frontier of AI.
[54:50]
Lenny Rachitsky
And you said a month later? A month later after what?
[54:52]
Edwin Chen
After GPT3 launch in 2020.
[54:54]
Lenny Rachitsky
Oh, okay. Wow. Okay, Yeah, A great decision. What, what just kind of drives you at this point of other than just the epic success you're having, what keeps you motivated to keep building this and, and you know, building something in the space?
[55:07]
Edwin Chen
I think I'm a scientist at heart. I always thought I was going to become this math or CS professor and work on trying to understand the universe and language and the nature of communication. Like it's kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one a government would call and I'd use all this fancy math and computer science and linguistics to decipher it. So even today, what I love doing most is every time a new model is released, we'll actually do a really deep dive into the model itself. I'll play around with it, I'll run evals, I'll compare where it's improved, where it's addressed. I'll create this really deep dive analysis that we send our customers. And it's actually kind of funny because a lot of times we will say it's from our data science team, but often it's actually just from me. And I think I Could do this all day. Like I have a very hard time being in meetings all day. I'm terrible at sales, I'm terrible at doing the table typical seal things that people expect you to do. But I love writing these analyses. I love jamming with a research team on what they're seeing. Sometimes I'll be up until 3am Just talking on a phone with somebody on a research team and taking training model. So I love that I still get to be really hands on, working on the data and the science all day. And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity. We have these really unique perspectives on data and language and quality and how to measure all this and how to ensure it's all going on the right path. And I think we're uniquely unconstrained by all of these influences that can sometimes steer companies in a negative direction. Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup. So we care about curiosity and long term incentives and intellectual rigor. And we don't care as much about quarterly metrics and what's going to look good in a board deck. And so my goal is to take all these unique things about us as a company and use that to make sure that we're shaping AI in a way that's really beneficial for species in the long term.
[57:06]
Lenny Rachitsky
What I'm realizing in this conversation is just how much influence you have and companies like yours have on where AI heads, the fact that you help labs understand where they have gaps and where they need to improve. And it's not just, you know, everyone looks at just like the heads of OpenAI, Anthropic and all these companies, they're the ones ushering in AI. But what I'm hearing here is you have a lot of influence on where things head to you.
[57:30]
Edwin Chen
Yeah, I think there's this really powerful ecosystem where honestly people just don't know where models are headed and how do you want to shape them yet and how do you want humanity play a role in the future of all this. And so I think there's a lot of opportunity to just continue shaping the.
[57:52]
Lenny Rachitsky
Discussion along that thread. I know you have a very strong thesis on just why this work matters to humanity and why this is so important. Talk about that.
[58:01]
Edwin Chen
I'll get a bit philosophical here, but I think the question is a bit philosophical, so bear with me. So the most straightforward way of thinking about what we do is we train and evaluate AI, but there's a deeper mission that I often think about, which is helping our customers think about their dream objective functions. Like, yeah, what kind of model do they want their model to be? And once we help them do that, we'll help them train their model to reach that North Star. We'll help them measure that progress. But it's really hard because objective functions are really rich and complex. It's kind of like the difference between having a kid and asking them, okay, what test do you want to pass? Do you want them to get a high score in SAT and write a really good college essay? That's a simplistic version versus what kind of person do you want them to grow up to be? Will you be happy if they're happy no matter what they do, or are you hoping they'll go to a good school and be financially successful again? If you take that notion, it's like, okay, how do you define happiness? How do you measure whether they're happy? How do you measure whether they're financially successful? It's a lot harder than simply measuring whether or not you're getting a high score in sat. And what we're doing is we want to help our customers reach again their dream North Stars and figure out how to measure them. And.
[59:12]
So I talked about this example of what you want models to do when you're asking them to write 50 different email iterations. Do you just continue them for 50 more or do you just say, no, just move on with the day because this is perfect enough.
[59:29]
The broader question is, are we building these systems that actually advance humanity? How do we build the data sets to train towards that and measure it? Are we optimizing for all these wrong things? Just systems that suck up more and more of our time and make us lazier and lazier?
[59:45]
I think it's really relevant to what we do because it's very hard and difficult to measure and define whether something is genuinely advancing humanity. It's very easy to measure all these proxies instead, like clicks and likes. But I think that's why our work is so interesting. We want to work the hard, important metrics that require the hardest types of data and not just the easy ones. I think one of the things I often say is you are your objective function. So we want the rich, complex objective functions and not these simplistic proxies. And our job is to figure out how to get the data to match this. So, yeah, we want data. We want metrics that measure whether AI is making your life Richer. We want to train our systems this way and we want tools that make us more curious and more creative, not just lazier. And it's hard because, yeah, humans are kind of inherently lazy. So AI software deals are the easiest way to get engagement. Make all your metrics full. So I think this question about choosing the right objective functions and making sure that we're optimizing towards them and not just these easy proxies is really, really important to our future.
[60:37]
Lenny Rachitsky
Wow. I love how what you're sharing here gives you so much more appreciation of the nuances of building AI, training AI, the work that you're doing. You know, from the outside, people could just look at ZURJ and companies in the space. Okay, well, they're just creating all this data feeding into AI, but clearly there's so much to this that people don't realize. And I love knowing that you're at the head of this, that someone like you is thinking through this so deeply. Maybe one more question. Is there something you wish you'd known before you started Surge? A lot of people start companies, they don't know what they're getting into. Is there something you wish you could tell your earlier self?
[61:11]
Edwin Chen
Yeah. So I definitely wish I'd known that you could build a company by being heads down and doing great research and simply building something amazing and not by constantly tweeting and hyping and fundraising. It's kind of funny, but I never thought I wanted to start a company. Like, I love doing research and I was actually always a huge fan of DeepMind because they were this amazing research company that got bought and still managed to keep on doing amazing science. But I always thought that they were this magical Iolr unicorn. So I thought if I started a company, I'd have to become a business person. Looking at financials all day and being in meetings all day and doing all this stuff that sounded incredibly boring and I always hated. So I think it's crazy that didn't end up being true at all. I'm still in the weeds in the data every day and I love it. I love that I get to do all these analyses and talk to researchers and it's basically applied research where we're building all these amazing data systems that really push the frontier of AI. Yeah, I wish. I know that you don't need to spend all your time fundraising. You don't need to constantly generate hype, you don't need to become someone you're not. You can actually build a successful company by simply building something so good that it cuts through all that noise. And I think if I known this was possible, I would have started even sooner. So I would shine on that.
[62:19]
Lenny Rachitsky
That is such an amazing place to end. I feel like this is exactly what founders need to hear. And I think this conversation is going to inspire a lot of founders, and especially all our founders that want to do things in a different way. Before we get to a very exciting lightning round, is there anything else you wanted to share? Anything else you want to leave listeners with? We covered a lot of ground. It's totally okay to say no as well.
[62:37]
Edwin Chen
I think the thing I would end with is I think a lot of people think of data labeling as really simplistic work like labeling cat photos and drawing bounty box drawn cars. And so I've actually always hated word data labeling because it just paints this very simplistic picture when I think what we're doing is completely different.
[62:56]
I think a lot about what we're doing as a lot more like raising a child. You don't just feed a child information. You're teaching them values and creativity and what's beautiful and these infinite subtle things about what makes somebody a good person. And that's what we're doing for AI. So, yeah, I just often think about what we're doing as.
[63:18]
Almost like.
[63:21]
The future of humanity or how are we raising humanity's children? So I'll leave it at that.
[63:28]
Lenny Rachitsky
Wow. I love just how much philosophy there is in this whole conversation that I was not expecting with that. Edwin, we've reached our very exciting lightning round. I've got five questions for you. Are you ready?
[63:38]
Edwin Chen
Yep. Let's go.
[63:40]
Lenny Rachitsky
Here we go. What are two or three books that you find yourself recommending most to other people?
[63:45]
Edwin Chen
Yes. So three books I often recommend are. First, story of your life by Tai Chiang. It's my all time favorite short story and it's about a linguist flirting in alien language. And I basically reread it every couple.
[63:56]
Lenny Rachitsky
Years and that's what the interstellar was about. Is that.
[63:59]
Edwin Chen
Yeah. So there's a movie called Arrival Arrival which was based off of the story, which I love as well.
[64:04]
Lenny Rachitsky
Great. Okay, keep going.
[64:06]
Edwin Chen
And then second myth of Sisyphus by Camus. I actually can't really explain why I love this, but I always find a final chapter somehow really inspiring. And then third, Le Tombo d' Amaro by Douglas Hofstadter. And so I think Godot Escherbach is his more famous book, but I've actually always loved this one better. It basically takes a single French poem and translates it 89 different ways and discusses all the motivations behind each translation. And so I've always loved the way it embodies this idea that translation isn't this robotic thing that you do. Instead, there's a million different ways to think about what makes a high quality translation, which mimics a lot of ways I think, about data and quality and alarms.
[64:44]
Lenny Rachitsky
All these resonate so deeply with the way. With all the things we've been talking about. Especially that first one, if that was your goal after school, is like, I want to help translate alien language. I'm not surprised you love that short story. Next question. Do you have a favorite recent movie or TV show you've really enjoyed?
[65:00]
Edwin Chen
One of my new all time favorite TV shows is something I found recently. It's called Travelers. It's basically about a group of travelers from the future who are sent back in time to prevent the apocalypse. So I just really like science fiction. And then I actually just rewatched Contact, which is one of my all time favorite movies. So, yeah, I think one of the things you'll notice about me is that, yeah, I love any kind of poker film that involves scientists suffering, deciphering alien communication. Again, just this dream I always had as a kid.
[65:28]
Lenny Rachitsky
That's so funny. I love that. Okay, is there a product you recently discovered that you really love?
[65:35]
Edwin Chen
So it's funny, but I was in SF earlier this week and I finally took Awaymo for the first time. Honestly, it was magical and it really felt like living in future.
[65:43]
Lenny Rachitsky
Yeah, it's like the thing that you can. People hype it like crazy, but it always exceeds your expectations.
[65:48]
Edwin Chen
It deserves the hype.
[65:49]
Lenny Rachitsky
Yeah, it's absurd. It's like, holy moly. Like if you're not an sf, you don't realize just how common these things are. They're just like all over the place, just driverless cars, constantly going about. And when you like go to an event at the end, there's just like all these waymos lined up, picking people up.
[66:03]
Edwin Chen
Up.
[66:04]
Lenny Rachitsky
Yeah. Good job. Good job over there. Do you have a favorite life motto that you find yourself coming back to in work or in life?
[66:12]
Edwin Chen
So I think I mentioned this idea that founders should build a company that only they could build. Almost like it's this destiny that their entire life and experiences and interests shape them towards. And so I think that principle applies pretty broadly. Not just the founders, but the people creating, I think.
[66:26]
Lenny Rachitsky
Well, let me follow that thread to unlightening this answer. Do you have any advice for how to build those sorts of experiences that help lead to that is it, you know, follow things that are interesting to you because, you know, it's easy to say that. It's hard to actually acquire these really unique sets of experiences that allow you to create something really important.
[66:44]
Edwin Chen
Yeah. So I think it would always be to really follow your interests and do what you love. And it's almost like a lot of decisions I make about Surge. Like, I think one of the things that I. I didn't think about a couple years ago, but then someone said it to me. It's that companies, in a sense, are an embodiment of their CEO. And it's kind of funny I hadn't thought about that because I never quite knew what a CEO did. I always thought a CEO was kind of generic. And it's like, okay, you're just doing whatever your VPs and your board and whatever tell you to do, and you're saying, yes decisions. But instead, it's this idea where when I think about certain big, hard decisions we have to make, I don't think, what would the company do? I don't think, what metrics are we trying to optimize? I just think, what do I personally care about? What are my values and what do I want to see happen in the world? And so I think following that idea about, okay, so ask yourself, what are the values you care about? What are the things you're trying to shape and not what will look good on a dashboard? I think that's also pretty important.
[67:50]
Lenny Rachitsky
I love how you're just full of endless, beautiful, and very deep answers. Final question. Something that you were quite. You got quite famous for before starting Surge is you built this map at Twitter while you were at Twitter that showed a map of the world and how and what people called whether they called it soda or pop. I don't know if it's called soda or pop. What was the name of this map?
[68:13]
Edwin Chen
Yeah, it was like the Soda versus Pop data set or Soda versus Pop.
[68:17]
Lenny Rachitsky
And so it's like a map of the United States and tells you where people say pop versus soda. So do you say soda or pop?
[68:24]
Edwin Chen
So I say. I say soda. I'm a sort of person.
[68:26]
Lenny Rachitsky
Okay. And is that just like that's the right answer, or it's like it. Whatever you are, it's totally fine.
[68:33]
Edwin Chen
I think I'll look at you a little bit funny. You say pop, and I'll wonder where you came from, but I won't scorn you too much.
[68:39]
Lenny Rachitsky
That's how I feel, too. Edwin, this was incredible. This was such an awesome conversation. I learned so much. I think we're going to help a lot of people start their own companies, help their companies become more aligned with their values and just building better things. Two final questions. Where can folks find you online if they want to reach out? What roles are you hiring for? How can listeners be useful to you?
[69:01]
Edwin Chen
Yeah, so I used to love writing a blog, but I haven't had time in the past few years. But I am starting to write again. So definitely check out the Surge blog, SurgeHQ AI blog. And yeah, hopefully I'll be running a lot more dare. And I would say we're definitely always hiring. So for people who just love data and people who love this intersection of math and language and computer science, definitely reach out. Reach out anytime.
[69:24]
Lenny Rachitsky
Awesome. And how can listeners be useful to you? Is it just. I don't know. Yeah, is there anything there? And he asks.
[69:29]
Edwin Chen
So I would say definitely. Tell me blog topics that you'd like me to write about.
[69:32]
Lenny Rachitsky
Okay.
[69:32]
Edwin Chen
And then I'm always fascinated by all of these AI failures that happen in the real world. So whenever you come across a really interesting failure that I think illustrates some deep question about how we want models to behave, there's just so many different ways a model can respond. I just oftentimes think there's just not a single right answer. And so whenever there's one of these examples, I just love seeing them.
[69:57]
Lenny Rachitsky
You need to share these on your blog. I would love to see these. Edwin, thank you so much for being here.
[70:03]
Edwin Chen
Thank you.
[70:04]
Lenny Rachitsky
Bye, everyone.
[70:07]
Podcast Producer
Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review, as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show@lennyspodcast.com See you in the next episode.