Summary9 min read

Podcast Summary: Latent Space: The AI Engineer Podcast

Episode: The AI Coding Factory
Date: May 29, 2025
Guests: Matan and Eno, Co-founders of Factory AI
Hosts: Alessio (CTO, Decibel) and Wix (Founder, Small AI)

Episode Overview

This episode features a deep dive into Factory AI, a company redefining software development with AI-driven autonomous agents ("droids") for enterprises. The founders, Matan and Eno, join hosts Alessio and Wix to discuss the unique journey of Factory’s founding, the evolving landscape of AI agents and code generation, product positioning, enterprise use cases, technical architecture, and their vision for the future of AI-assisted software engineering.

Key Topics and Discussion Points

The Founding Story of Factory AI

Timestamps: 00:16 – 06:33

LangChain Hackathon Origins: Matan and Eno met at a 2023 LangChain hackathon, bonding over a shared fascination for code generation and AI in software. Both had Princeton connections but had never meaningfully interacted before.
- “We had like 150 mutual friends, but somehow never had a one-on-one conversation... at this LangChain hackathon...very quickly just gets into code generation.”
  – Matan [00:35]
Career Transitions: Eno was at Hugging Face focused on code model advising; Matan was doing a PhD in theoretical physics at Berkeley but was "nerd sniped" into AI by the fundamental nature of code as a lens for machine intelligence.
- “The beauty of how code is just core to the way that machines, you know, develop intelligence, really kind of nerd sniped me and got me to leave what I was pursuing for 10 years.”
  – Matan [02:30]
Rapid Commitment: Within eight days of meeting, both founders quit their previous roles to start Factory AI, driven by an “intellectual love at first sight.”
- “It was eight days from us first meeting to me dropping out of my PhD and Eno quitting his job.”
  – Matan [06:18]

Evolving the AI Coding Landscape

Timestamps: 06:56 – 14:25

Market Positioning: Factory’s focus is on autonomous, end-to-end systems for software development in the enterprise, targeting legacy, large, messy codebases (e.g., COBOL migrations) often ignored by other AI coding tools.
- “There are hundreds of thousands of developers who work on code bases that are 30+ years old...the value you can provide is very dramatic.”
  – Matan [07:14]
Constraints of Existing Coding Tools: Most competitors center on IDE plugins for individual productivity, facing constraints in latency and context. Factory’s product is cloud-first, enabling more scalable, delegative workflows.
- “When you are freed of a lot of these constraints, you can start to more fundamentally reimagine what a platform needs to look like...the product experience of delegation is really, really immature right now.”
  – Eno [08:32]
“Droids” Not Agents: Factory’s agents, dubbed “droids,” are designed around planning and environmental grounding rather than endless, unreliable loops. The naming avoids the baggage of the “AI agent” hype.
- “Droids have a nice ring to it...our customers really love droids as a name—‘these are the droids you’re looking for’.”
  – Matan [13:16]
Division of Labor: Founders see the core of software developer work as shifting; AI will take more of the “inner loop” (code writing), with humans increasingly focused on planning and communication.
- “The outer loop of software development… is going to continue to be very human driven, while the inner loop… is probably going to get fully delegated to agents very soon.”
  – Eno [13:51]

Product Demo and Technical Differentiators

Timestamps: 14:37 – 36:19

Droid Specializations: Factory’s platform offers different autonomous agents (droids) tailored to key enterprise use cases:
- Knowledge Droid: Research, documentation, and technical writing.
- Code Droid: The “daily driver” for code changes, ticket execution, with workflow delegation.
- Reliability Droid: Popular for incident response and SRE work.
- “There are different droids available for key use cases that people tend to have...we have a Code Droid, Knowledge Droid, and a Reliability Droid.”
  – Eno [14:37]
Workflow Design: Emphasis on seeing agent thought processes ("X-ray into its brain") and integrating context from Slack, Linear, Jira, GitHub, Sentry, PagerDuty, etc.
- “As the agent is working, what matters most is seeing what the agent is doing and having a bit of like an X-ray into its brain.”
  – Eno [16:00]
Intelligent Delegation: Agents ask clarifying questions dynamically, shifting from prompt engineering dependence toward more natural, manager-like delegation.
- “A lot of users...should not need to prompt engineer agents...the system knows when to ask for clarification.”
  – Eno [18:23]
Proactive Context Synthesis: Factory’s platform generates synthetic insights into codebases (e.g., setup steps, module interconnections), reducing friction and preventing rote context ingestion.
- “As we index code bases, we’re actually generating these insights at a much more granular level across the entire code base. Systems should be proactive in finding that information.”
  – Eno [21:36]

Model Evaluation, Pricing, and Enterprise Integration

Timestamps: 24:10 – 44:14

Model Benchmarks: Internal evaluation suite built on task-based and behavioral specs, rather than just public benchmarks ("Big Bar vs Little Bar" syndrome).
- “There are so many customers that we have. That was purely because they saw the charts...so I think that motivates resources to be put on benchmarking.”
  – Matan [25:42]
- “Vibe-based...internally actually matters a lot. We use factory every day, so when we switch a model we very quickly get a sense of how things are changing.”
  – Matan [26:21]
Handling Model RL Preferences: Some new LLMs (e.g., Sonnet 3.7) appear to prefer specific coding styles or tools due to their RL post-training. Factory adapts its product to ensure consistency and optimal tool usage.
- “It smells like Claude code...what if you gave it a search tool that was way better than grep, but the model just loves to use grep?”
  – Eno [27:00]
Pricing Model: Usage-based, transparent billing on tokens consumed.
- “We’re fully usage based...I actually think that we get better users the more they understand what tokens are and how they’re used, you know, in each back and forth.”
  – Matan [36:23]
Enterprise Metrics: Focused on deliverable timelines and concrete ROI over traditional productivity metrics like code churn or number of commits.
- “At the end of the day, no one really cares about the metrics. What people really care about is developer sentiment...pulling in timelines is the best ROI.”
  – Matan [39:53]

The Changing Nature of Software Development and Team Structures

Timestamps: 31:07 – 44:14

Why Browser-Based (Not IDE): Factory intentionally eschews IDE plugins for a web-first interface, betting that as AI does more code generation, the optimal developer experience will change fundamentally.
- “Can you iterate your way from a horse to a car?...you do need to think from scratch, about what does that new way to develop look like.”
  – Matan [31:07]
Rise of Tiny Teams and The “AI Native” Attitude: Individual users and very small teams can now accomplish work previously requiring dozens or hundreds of engineers.
- “There are sometimes individuals who weren't even really developers who will use Factory and have more usage than in 100 person enterprise...crazy to see.”
  – Matan [58:34]

Bottlenecks, Future Vision, and Organizational Insights

Timestamps: 47:51 – 57:42

Technical Limiters:
- For Models: Need for LLMs capable of long, complex, goal-oriented agentic tasks (multi-hour sessions with persistent planning).
  - “Models that have been post-trained on more general agentic trajectories over very long time spans...that is probably one of the bigger blockers.”
    – Eno [48:27]
- For Dev Tools: More robust, semantic observability and analytics are still lacking; current tools only offer rudimentary traces.
  - “It is still surprising to me that observability, it remains very challenging... how do you build almost semantic observability into your product?”
    – Eno [50:03]
Customer Growth and Go-To-Market: Factory’s enterprise deployments are growing rapidly on the strength of "aha" moments with customers; go-to-market investments are ramping up.
- “We really just relied on word of mouth...and when every one of those ends up a happy customer, you need to increase top of funnel.”
  – Matan [52:24]
Unique Hiring Needs: Biggest challenge is finding highly technical people who can both interface with execs and dig into code hands-on ("Is this a junior Eno or not?").
- “I think a big rate limiter is...having both that ability to talk to the CIO, VP Engineering... and sit side by side with their developers and jump into the platform.”
  – Matan [53:52]
Importance of Brand, Vibe, and Team: Factory benefited from close collaboration with a design-minded team (including Matan’s brother), stressing the importance of cross-disciplinary creativity and creating a company culture that’s fun and social as well as ambitious.
- “I recommend working with a sibling...having that design perspective and the engineering perspective and bash those two things together until we get something perfect.”
  – Matan [55:19]

Notable Quotes & Memorable Moments

“It was eight days from us first meeting to me dropping out of my PhD and Eno quitting his job.”
– Matan [06:18]
“There are hundreds of thousands of developers who work on code bases that are 30+ years old...if you made a demo video doing some COBOL migration, that's not very sexy...but the value you can provide is very dramatic.”
– Matan [07:14]
“The product experience of delegation is really, really immature right now. And most enterprises see that as the holy grail, not just going 15% or 20% faster.”
– Eno [08:32]
“A lot of users, we believe, should not need to prompt engineer agents...if you're hyper optimizing every line...you're going to have a bad time.”
– Eno [18:23]
“We are taking that more ambitious angle...everything is going to change about software development...the time developers spend writing code is going to go way down. But the time spent planning is going to go way up.”
– Matan [31:07]
“No one really cares about the metrics. What people really care about is developer sentiment...pulling in timelines for big deliverables.”
– Matan [39:53]
“If you are highly, highly technical but you want to be a founder...interface with CIOs and CTOs, this is a huge opportunity.”
– Eno [54:26]
“I cannot recommend enough working with a sibling... having that design perspective and the engineering perspective and bash those two things together.”
– Matan [55:19]

Important Timestamps

| Segment | Timestamp | |-----------------------------------------------|----------------| | Founding story/hackathon meeting | 00:16 – 06:33 | | Autonomous agent focus & "droids" concept | 06:56 – 14:25 | | Product demo, use cases, and workflow | 14:37 – 36:19 | | Model evaluation and pricing | 24:10 – 44:14 | | Browser vs. IDE paradigm shift discussion | 31:07 – 36:19 | | Metrics, ROI, and enterprise deployment | 36:19 – 44:14 | | Bottlenecks and future model vision | 47:51 – 50:03 | | Organizational growth, hiring, and design | 51:56 – 57:42 | | Closing comments: AI-native teams, culture | 57:42 – 59:09 |

Takeaways & Key Insights

Factory AI is betting big on the shift from collaborative, IDE-centric code writing to a cloud-based, delegative model—the "AI coding factory."
Their platform is built for enterprises and real-world, often unsightly codebases—eschewing hype demos for high-impact, unsexy tasks.
The team is obsessive about product experience (especially explainability, workflow, and delegation) and sees a radical change ahead for the developer role.
Growth is accelerating among Fortune 500s; word-of-mouth and demonstration of dramatic ROI (shortening massive migrations from months to days) are key.
Technical and go-to-market scaling both hinge on hybrid technical-sales hires; brand and team culture are also seen as strategic superpowers.

For full show notes and resources, visit latent.space.

Loading summary

Transcript189 lines

[00:00]
A
Foreign.
[00:05]
B
Welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel and I'm joined by my co host, Wix, founder of Small AI.
[00:12]
C
Hey.
[00:13]
A
And today we're very blessed to have both founders of Factory AI. Welcome.
[00:17]
C
Thank you for having us.
[00:18]
D
Yeah, thank you.
[00:18]
A
Matan and Eno. My favorite story about the founding of Factory is that you met at the LangChain Hackathon. And I'm very annoyed because I was at that hackathon and I didn't start a company, I didn't meet my co founder, maybe one. You want to quickly sort of retell that little anecdote because I think it's always very fun.
[00:36]
C
Yeah, yeah. Both Eno and myself went to Princeton for undergrad. And what's really funny is retrospectively we had like 150 mutual friends, but somehow never had a one on one conversation. If you pulled us aside and asked us about the other, we probably knew like vaguely like what they did, what they were up to, but never had a one on one conversation. And then at this LangChain hackathon, you know, we're walking around and catch a glimpse of each other out of the corner of our eye, you know, go up, have a conversation and very quickly just gets into cogeneration. And this was like back in 2023 when code generation was all about baby AGI and auto GPT. Like that was like the big focus point there and both were speaking about it, both were very obsessed with it and I like to say it was intellectual love at first sight because basically every day since then we've been obsessively talking to each other about AI for software development.
[01:28]
A
If I recall that LangChain hackathon wasn't about co generation. How do you sort of get find the idea maze to Factory?
[01:34]
D
Yeah, basically I think that we both came at it from slightly different angles. I was at Hugging Face working primarily on advising like CTOs and AI leaders at hugging Face's customers, guiding them towards how to think about research strategy, how to think about what models might pop up, how to. And in particular we had a lot of people asking about code and code models in the context of we all want to build like a fine tuned version on our code base. In parallel, I had started to explore building. At the time the concept of agent wasn't really like clearly fleshed out, but imagine basically a while loop that wrote Python code and executed on it for a different domain for finance. On my mind was how not very helpful it felt for finance and how Incredibly interesting it felt for software. And then when I met Matan, I believe that he was exploring as well.
[02:31]
C
Yeah, that's right. So I was at the time, I was still doing a PhD at Berkeley technically in theoretical physics, although for a year at that point, I had really switched over into AI research. And I think the thing that pulled me away from string theory, which I'd been doing for, like, 10 years, into AI, was really the string theory. And, you know, physics and mathematics really makes you appreciate, you know, fundamental ness or things that are very general. And the fact that capability in code is really core to performance on any lm. And, like, loosely, the better any LM is at code, the better it is at any downstream task even. That's like writing poetry. And that fundamental beauty of, like, how code is just core to the way that machines, you know, develop intelligence, really kind of nerd sniped. Nerd sniped me and got me to leave what I was pursuing for 10 years. And that mixed also with the fact that code is one of the very few things, especially at the time, that you could actually validate and so you could have that agentic loop where the LLM is generating the output and you're actually verifying in ground truth the quality of that output. It just made it extremely exciting to pursue.
[03:40]
B
How did you guys decide that it was time to do it? Because I think maybe if you go back, the technology is like, it's cool at a hackathon, but then as you start to build a company, it's maybe like, there's a lot of limitations. How did you maybe phase out the start of the company of, like, okay, the models are not great today, so let's maybe build the harness around it to then. Now the models are getting a lot better, so it's time to, like, go ga, as you're doing now and all of that.
[04:03]
C
There's kind of a more quantitative answer and then a more qualitative answer. So the qualitative answer kind of building off of what I said before, of, you know, it was intellectual love at first sight. I think it was also one of those things that was kind of just like if, you know, you know, we met and we got along so well, and basically, like, the next 72 hours, we didn't sleep. We were just, like, building together on initial versions of what would become factory. And when something like that happens, I think it's good to just, like, lean in and not really question it and overanalyze yet at the same time, if you, you know, do actually Go and analyze. I think there are exactly, you know, the considerations that you're talking about, which is, yeah, the models at the time, which I think at the time it was just 3.5, which was out. Certainly that's not enough to have a fully autonomous engineering agent. But very clearly, if you build that harness or you build that scaffolding around it and bring in the relevant integrations or the sources of information that a human engineer would have, it's very clear how that trajectory would get to the point where more and more of the tasks that a developer would do actually come under that. Under that line where you can automate it.
[05:05]
D
Yeah. And I think that at the time, as you mentioned, there was like baby AGI and a couple of these other concepts that had come out, which involved putting a while loop around the lam, feeding back some context. And on the other hand, there were papers coming out the chain of thought, self reflection. Of course, the scaling law papers at this point had been somewhat established. And so there was kind of this clear direction where models were going to get better at reasoning, they were going to get better at having larger context windows, they were going to get cheaper at, or at least the paro frontier of the model capabilities was going to be expanding so that good models would get cheap. The best models might stay the same price, but they start to get really smart. And this was, I wouldn't say super obvious, but if you spent a lot of time just reading through these papers and working through, there was definitely a rumbling amongst most of the people in the community that that was going to continue to extend. And so you blend all of these together with kind of meeting somebody who has this kind of energy that clearly they want to build. And I think that it became really obvious that the opportunity was available. I also think that we made a lot of very solid progress on the initial demo, enough to convince ourselves this was actually going to be possible.
[06:19]
C
To be clear though, it was eight days from us first meeting to me dropping out of my PhD and Eno quitting his job. So there was analysis, but it was also just like, yeah, let's do it.
[06:29]
D
Yeah, it's pretty, pretty crazy. Like eight days for sure. Yeah.
[06:33]
B
My first company was a hackathon project and I dropped out of school to actually found the company with one of my best friends. So the story resonates.
[06:40]
A
I think I'm doing hackathons wrong.
[06:42]
C
Maybe like I've.
[06:43]
A
I've had. I met one girlfriend out of it. Now that was about it.
[06:46]
B
Hey, that's, you know, some people might Say that's how you're.
[06:49]
C
Is it still ongoing?
[06:51]
D
No.
[06:51]
A
Oh yeah. I mean it's part of the funnel. Yeah.
[06:57]
B
So yeah, maybe Cogen was not the topic of the hackathon back then, but I would say today every other event that I go to, CodeGen is part of it. There's a lot of Codegen products. Do you guys want to just talk about what Factory is and maybe just give a quick comparison on the different products that people might have heard about and then we can kind of dive deeper.
[07:15]
C
Our focus is on building autonomous systems for the full end to end software development life cycle and in particular for enterprises, I think, especially given the like. So, okay, obviously Cogeneration is very exciting. A lot of, you know, the best engineers coming out of any of the, you know, the popular schools, it's like they want to work on like RL, they want to do cool things with like GPUs, you know, training models. Code is one of the like most obvious things because it's very easy to resonate with if you're an engineer. And so that's led to a lot of the players in the space really focusing on coding in particular and on like solo developers or building a quick like zero to one project and really like that use case that appeals to that profile. And I think something that we're focused on is the relatively underserved enterprise perspective, which is there are hundreds of thousands of developers who work on code bases that are 30 plus years old. It's really ugly, it's really hairy, it's really messy. And if you made a demo video like doing some like COBOL migration that's not very sexy, you wouldn't go viral, you wouldn't get a lot of views because it's like, it's just not that visually appealing. But the value that you can provide and how much you can improve those developers lives is very, very dramatic. And so seeing that underserved group is kind of why we've focused our sites there.
[08:33]
D
Yeah. And I would add that there are a lot of really interesting constraints that people take for granted in the broader market as being like fundamental to the coding assistant, the SDLC assistant kind of market in particular. A lot of the players look at a platform that has been the dominant tool for software developers, the ide. Right. And you think this is a tool that was designed 20 plus years ago or has been iterated on for 20 plus years primarily for a human being to write every line of code. When you take a tool like that and you start to introduce AI, you Start to introduce additional constraints that exist just out of the nature of where you're interacting with these systems and, and where those systems live. So for example, latency matters a lot when you're inside of an ide. The cost when you are local first and your typical consumer is on a free plan or a like $20 a month paid plan limits the amount of high quality inference you can do and the scale or volume of inference you can do per like outcome. When you are freed of a lot of these constraints, you can start to more fundamentally reimagine what, what a platform needs to look like in order to shift from a very collaborative workflow, which is what I think we see with most tools today, and a more delegative workflow where you are actually managing and delegating your tasks to AI systems. So I think that the, the product experience of delegation is really, really immature right now. And most enterprises though see that as the holy grail, not like going 15% or 20% faster.
[10:07]
A
And you call them droids. Is there a just story behind the naming of either factory or droids?
[10:13]
C
Yeah, so we were initially incorporated as the San Francisco Droid company.
[10:20]
D
Really?
[10:20]
C
We were.
[10:21]
A
Is this before I bleed that out.
[10:22]
D
In the, in the live podcast?
[10:24]
C
Sorry.
[10:24]
A
Oh, you had to beat that out.
[10:26]
D
No, no, no, no.
[10:27]
C
But our, our legal team advised us that Lucasfilm is particularly litigious and that we should change our name. And at the time. So at the time, while we were thinking of renaming, I was still in my PhD because we incorporated like two days after we met, which was also ridiculous, but we think of renaming. And I was still in some ML class at Berkeley and I was reading a paper on actor critic. In there, there was some equation that was like, you know, some function of the actor. We're just calling that Y. And so it was like F of A equals Y. A is actor. Put the actor in there and then it's, you know, factor E. Yeah. And so that's, that's how it originally came about. It actually works quite well also, you know, automation, that sort of thing. But yeah, yeah, yeah, yeah.
[11:10]
D
And also the factory method, I think was we, at some point we had that written up and then I think that inspired this line of thinking and droids kind of remained because we felt that there was a lot of hype at the time around the concept of agent. But it referred to such a specific thing that everybody saw, which was this like endless while loop, unreliable system that just kind of went on and on and took a bunch of actions without guidance. Yeah, and our thought process was, well, that's not really what our system looks like at all. So even though underneath it is an agentic system, do we need to say that we're an agent company? Doesn't really make sense.
[11:48]
A
I kind of like that. Like, yeah, actually, last year we pushed. Even though I put you guys. You spoke last year at the world's Fair. If I put you on the agent's track. But I almost didn't have an agent's track because I was like, this is so ill defined. And I think that instinct is good. But now the agent's wave has kind of come back the other way, and now everyone's an agent. I think defining your own term for it and just getting out of that debate, I think is a positive. Is it closer to a workflow, which is, I guess, the more commonly industry accepted term now?
[12:18]
D
Yeah, no, that's a great question. I think that the original version of the droids were a lot closer to what we called workflows. So they were asynchronous and event based. They would trigger, and each one had a specific kind of deterministic, exact, like semi deterministic. That was the original version. I think what we've grown to. As the models have evolved and as our ability to build out guardrails and the system has just improved, we've gone to the point where when you interact with droids inside the factory platform, they are relatively unbounded in the path that they take, and they are, in general, guided mainly by the concept of planning, decision making, and environmental grounding. So they can stay loosely goal oriented over a long duration without needing very hard coded guardrails. But they still tend to hit their goal according to their original plan. So I think now agent actually is probably like the proper way to describe them.
[13:13]
C
Sure.
[13:13]
A
But, you know, I think. I think droids have a nice ring to it.
[13:16]
C
It's also funny. Our customers really, really love droids as a name. Just, these are the droids you're looking for. I cannot tell you how many times, like, if, you know, with an enterprise customer, we'll do a poc and like, you know, a day later, you know, they're excited and, like, things go well, they'll share a screenshot and be like, these are the droids we're looking for. And honestly, every time, it's just. It's so fun.
[13:36]
A
I know. And everyone thinks they're the first to make a joke, but it really is better than, like, agents or intern or like, you know, autonomous human.
[13:46]
C
Name the number of human name, AI products.
[13:49]
A
It's actually a pretty good insight, 100%.
[13:51]
D
And actually I think we to a certain extent take a bit of an objection to the idea that these things are a replacement for a human being. I think that very much as we work through harder and harder problems with agents, it's become more clear that the outer loop of software development and what a software developer does, planning, talking with other human beings, interacting around what needs to get done, is something that's going to continue to be very human driven. While the inner loop, the actual execution of writing lines of code or writing down the doc is probably going to get fully delegated to agents very soon.
[14:26]
B
You just need to put. Roger, Roger.
[14:28]
D
Once they ask a question, one finishes the task.
[14:31]
C
Yeah, we have that emoji in our slack and use it very frequently.
[14:34]
B
Roger, Roger. Do we want to do a quick demo?
[14:38]
D
Yeah, happy to jump in. When you land on the platform, you're presented with the opening dashboard. We try to make it really obvious that there are different droids available for key use cases that people tend to have. Of course, you can always go and speak with a default droid that can do a lot of things pretty well. But what we've learned is that there's three major use cases that people keep coming back to the platform for. The first is knowledge and technical writing. So that's going in and more of like a deep research style system that will go and do some research. It will use tools available to it, search, et cetera, and come back with either a high quality document or answers, and then you can back and forth. The code droid is really the one that's the daily driver for a lot of folks. And this system allows you to actually delegate a task. I'll jump into that in a second. We can actually go through a full walkthrough and then the reliability droid. This was pretty surprising to us, the degree to which people love doing incident response, the kind of like SRE style work inside of the platform. I guess in retrospect it's nice because no one loves to be on call at 3am waking up being like, what's happening? And being able to just pass an incident description or say, hey, something's going wrong and have a system really compile all the evidence, write up an rca, provide that for you is super high leverage. And so that's actually one of the more popular droids that people use. But I can start by just going into the code droid and when you start a session with a droid, you're Presented with this interface, it's a little different from typical, where we see a lot of tools really want to focus you in on the code. Our perspective is that code is important to review when it's completed. But as the agent is working, what matters most is seeing what the agent is doing and having a bit of like an X ray into its brain. So, so we have an activity log on the left and a context panel on the right. So you'll notice as we go through this task that context panel starts to get updated. And I'm going to start by just doing something that's pretty common entry point. I'm going to paste a ticket into our platform. We have integrations with a bunch of different stuff. Linear, Jira, Slack, GitHub, Sentry, PagerDuty, you name it. We have a bunch of these integrations that our enterprise clients have wanted over time such that you can easily pull this info in. And if I were to say something like, hey, can you help me with this ticket? And then I'm going to use my AT command, which lets me easily reference code or code bases, hey, can you help me with this ticket? In factory mono, I like to be nice to them. I'd love for your help. And so you'll note that right off the bat, the droid starts working. It's doing a semantic search on part of my query in that code base. And actually the system has access to a bunch of different tools here. Memory, project management tools, GitHub, web search. Right now the code droid only has search enabled by default. But you'll note that as the system starts working, it may actually want those additional tools added so that it can, they can do its job.
[17:38]
C
Maybe an important note there is, you know, as we deploy these droids in the Enterprise, I think something that we're pretty like ideological about is everyone expects these, you know, agentic systems to perform at the level of a human, right? Because that's what they're always going to compare them to. But in a lot of cases, they'll have these agents just in the ide. And that's like the equivalent of onboarding a human engineer. Just like throwing them into your code base and being like, all right, like go. But like, the reality is, when you onboard a human engineer, what do you actually onboard them to? Slack Notion, Linear, like datadog, sentry, pager dude. They have all of these other information sources that they need to actually be a productive engineer. And yes, in theory, if you're like really, really good and you don't need contextual information, you could just work based on code, but like that would be a lot harder and it would probably.
[18:23]
D
Take a lot more time, 100%. And having those connections ends up being super important as it works through harder problems. Like in particular, you can see that the first thing it did after that search was reference some of the information that it found in saying, hey, this is what I found so far. It gives an initial crack at a plan, right? Presents that really clearly to you and then goes to ask clarifying questions. So a lot of users, we believe, should not need to prompt engineer agents, right? If your time is being spent hyper optimizing every line and question that you pass to one of these systems, you're going to have a bad time. And so a lot of what we do is to be able to format. If I say help me with this ticket, right, there's clearly going to be some ambiguities. The system knows when you give a very detailed answer or request to follow your instructions, when you give more ambiguous requests to ask for clarification. This is actually a really tricky thing to get right in the model, but we spend a lot of time thinking about it and, and so I'm just going to answer some of these questions. Are there any UI mockups? No. Try to imitate the other examples. Two, only preview when the button is clicked. Three, no, it's actually not implemented. Four, just answering which specific fields must be displayed, your choice and five, your choice. So now I'm basically saying to it, you decide for some, giving my preferences on others.
[19:52]
C
And this, this is really the balance of like even, you know, delegation that will happen to non AI is like, you know, as a good manager, you give autonomy to people that work with you when needed. But also if you're like, hey, I'm a little worried about this or I'm going to be really strict about what I expect here. You want to extract that behavior as well because a lot of times like, you know, mentioned, if you give a really poor prompt and you just say, hey, go do it, it's going to go do it. But it'll probably make assumptions. And then at the end you might not be happy, but that's just because there were some constraints in your head that you didn't actually explicitly mention when you're communicating.
[20:25]
D
So yep, 100%.
[20:27]
B
Do you guys have like a template that you've seen work? When I imported to Devin, for example, they have the fix prompt button and then it refills it in their template which is like give the agent instruction on how to debug, give agent instruction on how to do this and ask you to fill out these things. Do you guys have the. A similar thing where like you think for each project these are like the questions that matter or is that more dynamic?
[20:51]
D
No, that's a great question. And it's something that we talk about a lot internally is it's surprising how many people are building products that have reactive, like information requests. Like, you know, please fill out this form to explain how to do this thing. Or you need to set up this dev environment yourself manually in order for this to work. We think of trying to be proactive with a lot of this stuff. So you'll notice in the right hand corner there's this project overview. Right. And the system started to code after doing some search. So that's going to pop up while we do this. But when I click into this project overview, what you're going to see is basically a A and I'm hiding it because I'm realizing this is actually semi sensitive. It's probably fine for the.
[21:35]
B
We can hide that in the video.
[21:36]
D
No worries. It's totally fine for folks to see that it's a mono repo. If I scroll down, that's when we'd get in a little bit of trouble. But inside that project overview, we're actually synthesizing a bunch of what we call synthetic insights on top of the code base. And that is looking at things like how to set up your dev environment. What. What is the structure of the code base? How do important modules connect to each other? And as we index code bases, we're actually generating these insights at a much more granular level across the entire code base. We think that in general, systems should be proactive in finding that information. However, with features like memory and we have like a Droid YAML, you can set a bit of your guidelines. However, we also feel that it's like that XKCD about standard.
[22:21]
B
Standard, right.
[22:22]
D
Like everyone's got like a dot blank rules file and so we ingest those automatically from all of the popular providers as well.
[22:28]
A
Wow. Okay.
[22:30]
B
Does something like a cursor Rules is complementary because people might take this and then work on it in cursor separately.
[22:37]
D
Yeah. What we found is that there are sometimes extraneous advice in those because people need to give a lot more guidance to those types of tools than they do to ours. So our system passes through and only picks the things that we don't already know.
[22:52]
C
Another thing that kind of comes to mind related to your question, and this is something We've been thinking about a lot as well is as we have more and more enterprise customers and a lot of the developers in the enterprise are not going to be as up to date on every new model and how it changes its behavior. Something that's interesting that we're thinking about is these developers are getting familiar with Factory and how to get the most out of it. And then we, let's say when we Upgraded from Sonnet 3.5 to 3.7, we suddenly had a lot of developers being like, hey wait, it now does this less or it does this more. What's happening? Or when they go to Gemini, let's say, and they want longer context. And so something that I think is interesting is how much of the like behavior difference from the models should we act as like a shock absorber for? So that they can basically as a user use it exactly how they've been using it before and get the same sort of output, but then also how much of that do we actually want to translate to the user? Because presumably over the next three years the way you interact with models will change and it's not just going to be up to behavior, but it's rather like, I guess it's alpha versus beta in the model. Some models have different personalities and it's just the way you prompt it to get the same out of it. And then there are others where it's like, I mean, for example the reasoning models, they just work in a fundamentally different way. And so you as the user should know how to interact differently. So that's something that's kind of fun to wrestle with.
[24:10]
B
How do you evaluate the new models?
[24:12]
D
We listened a lot to how the model providers actually think about building out their eval suites and in particular trying to look at things like desired behavior versus actual behavior and in a way that's sustainable for a small team. We don't have like, you know, $100 million to pay data providers. And so a lot of the evaluation ends up being a combination of point like task based evals. So like the ATOR has an awesome benchmark that we built on top of internally for code editing and file generation versus we also for the top level agent loop have like our own behavioral spec where we set a bunch of high level principles, we break those down into tasks, those tasks then have grades and rubrics and then we try to run those in order to determine is the behavior suite that we like, for example, asking questions when it's ambiguous versus not asking questions, does that match up? And we also use that to optimize the prompts as well.
[25:10]
A
Just a quick question on these types of things. I think every company should have their own internal evals, right? Yeah, that is not in question and obviously that is your ip, so we can't know too much about it. But like what is the right amount to spend on something like this? Yeah, because let's say we talk about CBench before recording. CBench cost like 8,000 to run, but I've heard varying numbers between 8 to 15,000 to run. Yeah, that's high. But you should be able to spend some amount to ensure that your system as a whole works and doesn't regress. So like, what's a rule of thumb for like what is the right amount to spend on this?
[25:43]
C
Yeah, I mean I think it's important to separate out the two purposes of benchmarks. Which one is marketing. And like there are so many customers that we have. That was purely because they saw the charts.
[25:54]
A
Yeah.
[25:55]
C
And they saw Big bar versus Little Bar and they were like, okay, we want to go with Big Bar. Which is funny. But that's just, I mean that's just the way things go. And so I think that motivates, I think that's actually a good thing because that motivates more resources to be put on, you know, benchmarking and evaluation. On the other hand, you know, there definitely is a risk of going like too far in that direction or like even go getting to the point where you're fine tuning just to, you know, satisfy some benchmark there. And so like we were saying before.
[26:22]
A
The taping, like you guys don't bother competing on suite bench anymore because that's not that relevant.
[26:27]
C
Yeah, that and also like just in the enterprise, the use cases are pretty different than those representatives presented in something like Sweebench. So we do have pretty rigorous internal internal benchmarks as well. But I think also there's a certain extent to which like the vibe based or the sentiment based internally actually matters a lot. Because who has like a more intimate understanding of the behaviors of these models than the people who work on it every single day? Like working with them and building with them. Because I mean we use factory internally every single day. And so when we switch a model we very quickly get a sense of how things are changing.
[27:00]
D
Definitely. And I think that those task based evals tend to be the ones where it's most critical that we hill climb continuously on versus the top level evals. They change so much with the new model providers that we try to make sure that they have some Degree of consistent behavior that the feel is smart. But the top level agent is actually not that responsible for what most people call quality. That ends up being is it fast, accurate and high quality code edits? Does it call tools with the right parameters? Is the tool design such that that model can easily fit into it? And we have noticed a lot of really interesting behaviors with as the new models that have a lot heavier RL on post training related to their own internal like agentic tools. So for example, Sonnet 3.7 clearly has. It smells like Claude code. Right. Same with Codex. It very much impacted the way that those models want to write and edit code such that they seem to have a personality that wants to be in a CLI based tool. What's interesting is how do we combat the preferences that RL brings into the product. For example, search with CLI is like GREP and Glob. But what if you gave it a search tool that was way better than GREP or glob at finding precisely what you wanted. But the model just really loves to use Grep. They're going to fight each other. And so our evals have to figure out how do we make sure that as we build tools that are better than what maybe the model providers have in their slightly more toy examples that the models use those with their full extent. And that's actually been a very interesting novel challenge to us that only started happening in the last three to six months as these new models have come out.
[28:48]
B
Does that make you want to do more reinforcement fine tuning on these models? Like kind of take more of that matter into your own hands or.
[28:55]
D
I definitely think that it's an interesting idea, but our take in general is that freezing the model at a specific quality level and freezing the model at a specific data set just feels like it's lower leverage than continuing to iterate on all these external systems. And it also feels like this is a bit of a bug. Like we spoke with a bunch of the research labs and I don't think that they actually want this type of behavior. What it is ultimately is it's a reduction in generalization.
[29:24]
A
Cool. Anything else to see on the demo?
[29:26]
D
Oh, yeah. I mean we can.
[29:28]
A
It's still coding.
[29:29]
D
Yeah, yeah. So yeah. So you can see here that we're.
[29:33]
A
Running because you gave it like a whole bunch of things.
[29:36]
D
Yeah, so I actually gave it like quite a large project to do to execute live in front of us.
[29:40]
C
Got to earn its keep.
[29:41]
D
Yeah, yeah. This is why this delegation style flow we see is like really different. Where in General, we expect the answer or output of this to just be correct, right? It's running code, it's iterating on code, it's making edits to a bunch of different files. It's going to have to run pre commit hooks and test all this stuff. I think that this is a big difference in workflow, right where we've just had a podcast conversation, meanwhile the agent is working on my behalf. This is probably going to be mergeable at the end of this. It's ideally going to create a pull request and we can check in on it at the end. But I think that this difference is like, what would I be doing right now? I think today a lot of people just like open up their phone maybe and start browsing or they go context switch to a different task, but the real power is unlocked. When you start to realize this is the main thing that I'm going to be doing is only delegating these types of tasks. And so you start jumping to, okay, while this is happening, let me go and kick off another task and another one and another one. And so being cloud native, being able to parallelize these, like I'm only sharing one tab, but if I just open another one and started right now, we support that natively. I think that this feels a little bit more like how people are going to work, where you maybe start the day setting off a bunch of tasks in motion and then you spend the rest of it on maybe harder intellectual labor, like thinking about which of these is actually highest priority to execute on.
[31:08]
C
And this actually goes into something that, you know, was mentioning a little bit before, but also like a question that I'm sure everyone, when they see this is going to ask, which is, why is this browser based? Why is this not in the ide? Like I, I'm used to coding in the ide. And the, the kind of higher level answer here is that, and Nina was alluding to this before, like the last 20 years, the IDE was built for this world where developers are writing every single line of code. And something I think everyone can agree on is that over the next few years, what it means to be a software developer is going to change dramatically. Now, some people disagree and some people say there will be no more software engineers, some people say there will be everyone's going to be a software engineer and everywhere in between. But the reality is very clearly, in the next few years, the amount of lines of code written by a human will go down. Like the percentage of code written by humans will go down. And our take is that it is very unlikely that the optimal UI or the optimal interaction pattern for this new software development, where humans spend much less time writing code, I think it's very unlikely that that optimal interaction pattern will be found by iterating from the optimal pattern. When you wrote 100% of your code, which was the IDE. Internally, we talk a lot about the Henry Ford quote, which is, you know, if you ask people what they want, they would say faster horses. And for us, the analogy here is like, can you iterate your way from a horse to a car? And there's like this very grotesque, like, ship of Theseus you can imagine of like trying to turn a horse into a car. It doesn't really look pretty. And our take is, you know, even though the world was built for horses, at a certain point in point in time, right? Like, there were stables everywhere throughout a city, you were used to feeding this thing and, you know, taking it with you everywhere. And it is kind of a higher barrier to entry to start introducing this new means of transportation. In this analogy, we are taking that more ambitious angle of like, everything is going to change about software development. And in order to find that optimal way of doing it, you do need to think from scratch, think from first principles about what does that new way to develop look like. And to give some early answers that we are pretty clear about is the time developers spend writing code is going to go way down. But in turn, the time that they spend understanding and planning is going to go way up. And then also the time that they spend testing so that they can verify that these agents that they delegated to did indeed do the task correctly, that's going to go way up. The promise of test driven development is going to finally be delivered with this world of AI agents that are working on software development. Because now if you do want to delegate something like this while you're doing a podcast and come back later, ideally you don't even need to check their work and you just merge the pr. But how do you do that with confidence? You need to be really sure that the tests that you put up and said, hey, you know, Droid, you're not going to be done until you pass all of these tests. If you wrote those tests, well, then you can, all right, great, pass the test. Let's merge it. I don't even need to go in and see how it did everything.
[33:54]
A
I mean, sometimes you do have to break the tests because you're changing functionality. Yeah, yeah, there are a whole bunch of hard problems, but I just wanted to Cap off the sort of visual component of the thing. There's one thing you haven't shown which there's like a built in browser. So like I have a next JS project here that I'm running the conference website. It's spun it up itself. When I tried it out in ChatGPT Codex, it didn't work out of the box and they didn't have a browser built in. So it's nice that you have that kind of stuff.
[34:19]
D
No, for sure. Like being able to view like HTML, svg, et cetera on demand is, is super nice. And I think it's pretty much wrapped up. It actually finished these changes. I think it's like roughly 12 files that it edited, created. And so yeah, you know, right after this, because of the GitHub tool, I would just say go ahead and create a pull request.
[34:38]
A
Okay.
[34:38]
D
And then it'd be good.
[34:39]
A
Amazing.
[34:39]
D
Yeah, good stuff.
[34:40]
A
And you even show like a little 43% of context size use. That's actually not that much given that this is Factory's own code base.
[34:47]
D
Yeah. And this is actually a large mono repo. I think the, the big thing that I'd love for people to try out is like, look at how efficient it is. It's able to really execute on precisely what it needs to edit with relatively lower token usage than other agentic tools. Obviously if you're just getting auto complete, that's going to be a little bit more expensive. But compared to other agents where you get like five credits and it takes a while to execute on anything, I think they'll see a better experience with Factory.
[35:16]
A
Yeah. When you started saying things like oh, we can pull in from notion, we can pull it from slack, like that sounded like a lot of context. You're going to have to do pretty efficient rag to do this. Right. Like I guess it's not even rag, it's just retrieval.
[35:29]
C
Yeah, yeah. I mean but there is the, there is the temptation. And I remember maybe a year ago there was really a lot of hype on large context because. Right. It's the dream of you can be super lazy and just throw in your whole code base, throwing everything at it.
[35:40]
A
Which is what cloudcore does.
[35:42]
D
Right, Right.
[35:42]
C
Yeah, exactly. But I think the one downside of that is. Okay, great, like if you do have billion token context window model, you throw it all in there, still going to be more expensive. The reason why retrieval is so important for us is because even if there is a model that's going to have these larger context windows and certainly over time we're going to get larger context windows. You still want to be very cost efficient. And this is something that our customers care a lot about and they see a lot of the value in the effort that we put in on retrieval because they'll see like, wait, this was a huge mono repo and I gave it all this information, but then I see for each actual call you're really good at figuring out what do I actually need as opposed to just like throw the whole repo in and pray that it works.
[36:20]
B
You mentioned the credits. What's the pricing model of the product?
[36:23]
C
We're fully usage based, so the tokens, I think for us it's really important to like respect the users and their ability to understand what, what this stuff means. And so I think like all the stuff around like credits and it just kind of obfuscates what's actually happening under the hood. And I actually think that we get better users the more they understand what tokens are and how they're used, you know, in each back and forth.
[36:46]
D
Yeah. So it's a direct bill through to the. We call them standard tokens and it's benchmarked off of the standard models that we have. So like right now when you get access to the platform, your team would pay like a small fixed price for just access. Every additional user is another like very small fixed price. And then the vast majority of the spend would be on usage of the system. And so I think that this is just nicely aligned where you get a sense of how efficient it is about the token usage. This is a big reason why we've tried really hard to make it more token efficient. And then you can track of course in the platform how you're using it. And I think that a lot of people like to not only see just like raw usage and this kind of gets into like tracking success. Something that a lot of people do by maybe like number of tabs that you accepted or chat sessions that ended with code. For us, we try to look a little bit further and say, look, you use this many tokens but you know, here are the deliverables that you got, here are the pull requests created, here's merged code. We help enterprise users look at things like code churn, which it turns out the more AI generated code you have, if the, you know, if the platform isn't telling you the code churn, there's a reason for that.
[37:59]
A
Code churn, meaning amount of code deleted versus added.
[38:01]
D
Yeah, it's basically a metric that tracks when you. It's kind of like a Variability in a given line of code.
[38:08]
C
It's very imperfect because like some people, some people will say, like the difference between code churn and like refactored code is like somewhat arbitrary because it's got time period at which, okay, if I merged some line and then I changed that line, if you change that line in a shorter period, it'll churn versus a longer period, it'll count as like refactoring, which is like a little.
[38:30]
D
So generally in enterprise code bases, if you merge a line of code and then change that code within three weeks, it's because something was wrong with that code. Generally it's not always true, but it's a useful metric.
[38:40]
C
It averages out. Because sometimes it's like, wait, what if you just had like an improvement or you know, some, some change that wasn't about quality but like, is code churn up bad?
[38:49]
D
Yes, because what it tends to be is that in a, in very high quality code bases, you'll see 3%, 4% code churn when they're at scale. Right. This is like millions of lines of code. In poor code bases or poorly maintained code bases or early stage companies that are just changing a lot at once, you'll see numbers like 10 or 20%. Now if you're at Lassian and you have 10% code churn, that's a huge, huge problem because that means that you're just wasting so much time. If you're an early stage startup, code turns less important. This is why we don't really like report that to every team, just enterprises.
[39:27]
A
Any other like measurements are popular. I mean, you know, this is nice that I'm hearing about Cochrane, but like, what else are enterprise vps of end ctos?
[39:37]
C
What do they look at for the enterprise? I think the biggest thing, because there's so many tricks and different dances you can do to like justify our number.
[39:44]
A
Of commits, like number of commit, like.
[39:46]
D
Lines of code door metrics are usually.
[39:48]
C
Popular and at the end of the day, like what we, we initially went really hard in all the metric stuff.
[39:53]
D
Yeah.
[39:53]
C
What we found is that oftentimes if they liked it, they wouldn't care and if they didn't like it, they wouldn't care. And so in the, at the reality, like at the end of the day, no one really cares about the metrics. What people really care about is like developer sentiment when, when you're kind of playing that game. At the end of the day, if you want to do a metric, talk to developers and ask if they feel more productive or if you're A large enterprise and you want to justify roi. The biggest thing that we've seen and like that's allowed us to deploy very quickly in enterprises is pulling in timelines on things. So there's this one very large public company that we work with and pulling in just a large migration task from taking four months to taking like three and a half days. That is the best ROI that you don't need to measure this or that. Like, like we had something that was going to be delivered in the next quarter and we got it done this week with no downtime. Like that is, you know, music to a VP of engineering's ears. And so that's what we tend to focus on is like pulling in deliverables or increasing the scope of what you can get done in a quarter in.
[40:51]
A
Order to achieve a very large refactor like you just described. Do we use that same process you just saw or is there more setup?
[40:58]
D
I think that the workflow for let's say a migration is probably one of the most common. I can even give a very concrete example. Let's say you are the administrative service of a large European nation, right, like Germany or Italy, and you have a hospital system that runs on like a 20 year old Java code base. Now a company wants to come in and Big four consulting firm or something like that says we would like to transform this entire code base to Java 21. It's going to take X amount of time, couple months and you know, by the end you'll be on a relational database. You'll be into the Future, right on Java 21 when that typically happens, you kind of have to almost break down what that means from a human perspective first to then map it to how it works. On our platform you'll have a team of anywhere from four to 10 people come in and you have a project manager who is going to work with engineers to analyze the code bases, figure out all the dependencies, map that out into docs, right? First, analysis and overview of the code base. The next is a migration strategy and plan. The third is like timelines. And you're going to scope all this out. What do you do next? Will you go to a project management tool like jira? And so you take those documents and a human being translates that out. We've got two epics over the next two months. This EPIC will have these tickets, that EPIC will have these tickets. And then you find out the dependencies and you map those out to humans. Now each of these humans are now operating such that one after the other they're knocking out their work mainly in parallel, but occasionally pieces have to connect. Right. One person misses, and now the whole project gets delayed about a week. And so this interplay of understanding, planning, executing on the migration incrementally and then ultimately completing. Now there's a handoff period. There's docs of the new artifacts that we've created. There's all this information. You map that over to a system like ours. One human being can say, please analyze this entire code base and generate documentation. Right? And that's one pass, one session in our platform. Analyze each of the modules. We already do a lot of this behind the scenes, which makes this a lot easier, and actually generate an overview of what the current state is. You can now pull those docs in with real code and then say, what's the migration plan? Right. If there's some specific system, you can pull in docs. And then when you have this, our system connects with linear Jira. You can create tickets. Just create the epic. Ticket this whole process out and figure out which are dependencies and which can be executed in parallel. Now you just open up an engineer in every browser tab, right? And you execute all of those tasks at the same time. And you as a human being, just review the code changes. Right. This looks good, merge. Did it pass CI? Okay, great. Onto the next one. This looks good, merge. So a process that typically gets ultimately bottlenecked not by like skilled humans writing lines of code, but by bureaucracy and technical complexity and understanding now gets condensed into basically how fast can a human being delegate the tasks appropriately. So it happens outside of like one session, like what we just saw, which would be one of those tasks, but the planning phase, that's really where we see enormous condensation of time.
[44:14]
A
We just talked about your pricing. You're just usage based, but you tempted to have forward deployed engineers that just like the current meme right now to execute these large things.
[44:24]
C
Yeah, so I think this is something we definitely do a little bit of for our larger customers. Just because, you know, like we said at the beginning, this is the way we think software development will look. And it's an entirely new behavior pattern. And I think it would be a little naive to just be like, hey, we have this new way of doing things. Go figure it out yourselves. Right. So we definitely go in and, you know, help show them how to do it. So like in this migration example that I was mentioning before, we worked with them like side by side, but just us and like two of their engineers showed them how to do it. They were like, they saw the light, if you will, and then they ended up being the internal influencers within their org, teaching everyone else how to do it. But if you want to change behavior, you can't just assume that the product is going to be so good that everyone's going to immediately get it. Because with developers, you know, we need to know who we're selling to. And developers have very efficient ways of working that they've built out over the last 20 years. And we want to make sure that we accommodate that and earn their trust and slowly bring them into this new way of building. And to do that, we need to extend that all of branch and, you know, come meet them where they are, show them how they can do new things.
[45:30]
B
We did an episode with Together AI maybe a year ago or so, and we were talking about what inference speed actually we needed, and they always argued we need to get to like 5,000 tokens a second. And we're chatting whether or not that makes sense, because people cannot really read it. As you think about Factory, how much do you think you're like, bound by the speed of these models? Do you know, if the models were like, a lot faster, would you just complete things quicker? Would you maybe fan out more in parallel? What are the limits of the models today?
[46:01]
C
I want to let you know. Answer this, but immediately, every time this thing comes up, I always just think about the memory that Chrome tabs take, which is like, it's never enough and you always want more, but then it's also. It lets you be lazier. Yeah, but anyway, I want to.
[46:15]
D
Yeah, no, for sure. And I think that the. This is kind of a funny question. It kind of has two directions. It's like, practically, would this make a big difference on someone who knows and loves and uses our platform on a daily basis? I think you would probably improve the quality of life such that, yeah, definitely faster tokens would be awesome. I think that actually where this impacts is for those who haven't yet made the jump from collaboration to delegation. So if you are used to very high latency, high feedback experiences, then that speed difference and seeing like most of that delegation happen very quickly and then being able to immediately jump in, I think feels very nice. And so for the larger enterprise deployments, where they start to familiarize themselves with how this works and the migrations, I don't think this actually makes a big difference because most of the bottleneck ends up being, like I mentioned, almost like bureaucratic in nature. But for the average developer, I think this improves the user experience to the point where it would feel very magical. So I think we could get a lot faster.
[47:18]
C
It probably wouldn't change what's possible, but it would really, really change like ease of adoption for people who maybe aren't as in the weeds on AI tools.
[47:26]
D
And if you combine though the latency with a cost reduction as well. I do think that cost and is one of the reasons why we haven't so greatly scaled out. Like originally we had a lot of techniques that would generate a lot of stuff in parallel and we still know how to do that and we're very excited to to bring that. But right now we don't do it because it's cost prohibitive and the quality delta is not enough to justify the cost increase.
[47:51]
A
I have a kind of a closing question if you, if you don't mind. This is a more or less your asking you on a limiting factor. It's basically four questions in one. So what do you see as your limiting factor right now in terms of models? Basically what capabilities would really help you hiring? What skills are really hard to hire customers? What do you really want to unlock that you're like it's weirdly not working but you have an ICP which is a more enterprise doing well, but what's the next one? And then finally dev tooling. What do you wish existed that you had to build for yourself or you just feel like could be a lot better?
[48:27]
D
Yeah, maybe I'll do models and dev tool and you can take hiring and customers. I mean I'm thinking right off the bat probably the biggest thing is models that have been post trained on more general agentic trajectories over very long time spans, that feels like something that is. There's an effort for that right now. But what I mean is an hour or two hours or three hours of seriously working on a hard problem such that the model knows how to keep that long term goal directed behavior the whole time. That is something that I assume we'll get soon.
[49:08]
A
OpenAI has put out that like operator benchmark. Yeah, that was that it had human testers like actually try for two hours and give up. Yeah, did you see that one?
[49:17]
D
Yeah, yeah. I mean I think that that's exactly the type of work that we want to see taken further because that I would argue is probably one of the bigger blockers.
[49:25]
A
I would say. Like would you ever do that yourself? I don't see you guys as customizing your own models a lot but you work with the Frontier Labs. Right. But like is there a point where you would just Be like, all right, screw it, like, we'll do it.
[49:37]
D
We are currently building benchmarks with a lot of the post training techniques very much in mind right now. And so I don't know what exactly at this point in time, how much we're going to commit to that, but for sure we will be using those benchmarks for our own internal goals and if we need to use them later on for post training, I think that there's a lot of compatibility and then maybe for dev tools, since I'll just, I'll just jump that one.
[50:03]
A
Yeah.
[50:04]
D
It is still surprising to me that observability, it remains like very challenging.
[50:11]
A
Really. They're like 80 tools out there, I know for sure.
[50:14]
D
And Langsmith is actually fantastic. Like we use LangChain Shop, we don't use LangChain, but we just use Langsmith. And, and Langsmith is the hackathon is.
[50:23]
A
His ROI for Harrison.
[50:25]
D
And like, like they've been fantastic and that's been cool. But I think that there is. It's really tricky to deal with enterprise customers where you can't see their code data at all. But you're trying to build a product where you can improve the experience and a lot of it is actually subjective. It's like, I don't like the way this code looks. That remains something very unclear to us is how do you build almost like semantic observability into your product? I think Amplitude and, and Statsig and a lot of the feature flag companies actually are closer to this than the.
[50:57]
A
Product analytics actually really it's more about like observe if they still, if they're on the platform, like basically anything other than up and down thumbs. Right.
[51:06]
D
Yeah. And like, and like what was the user's intent when they entered into this session? Right. It's the type of thing where you almost need LLMs in the observability.
[51:13]
A
Are you saying they've done it or Amplitude could do it?
[51:17]
D
That's where I would like to see it.
[51:19]
A
Yeah. As far as I understand, it hasn't really.
[51:21]
B
Yeah, yeah. Because mostly everything is kind of like span based. They're like. Although servability products are trace but not at the.
[51:28]
D
Yeah, yeah, exactly. Not. Not in the like semantic direction.
[51:32]
B
So for you, that's kind of solved in a way. It's like the actual traces. It's like you can get that information anywhere.
[51:38]
D
Yeah. Like our team comes from like Uber and all these amazing places where they, they know how to do that part. I think the more tricky thing is when human beings have like messy intents that are natural language. How do you really classify and understand when users are having a good time versus when they're having a bad time? Okay, that's hard.
[51:57]
A
Great. Hiring and customers.
[51:59]
C
Yeah, so maybe I'll start with customers. So we've been at it for just over two years now. I think the first, like year and a half was really, really focusing in on the product. And what is the, the interaction pattern that works for the enterprise. And over the last 90 days, the deployments that we've had with like the large Enterprise and Fortune 500 has been exploding. It's been going really well. It's very exciting.
[52:22]
A
How are they mostly finding you?
[52:24]
C
So this is a good point. This is part of why we're, you know, doing more podcasts is because so far we really just relied on word of mouth, like working well with one enterprise and then, you know, they're at some CEO dinner, they mention it to someone else.
[52:36]
A
And which by the way, like, that's why we have the conference to like, we put all the, the VPs in one room.
[52:41]
C
Totally. And so, like, it's worked really well. But I think when every one of those conversations ends up leading to a happy customer, that means you need to increase top of funnel. And so accordingly, we are really putting fuel on the fire for our go to market for, you know, Fortune 500 large enterprises, which is obviously, it's a very exciting thing to do where the team has been pumped. I mean, there was a particular day in January of this year where one of those large enterprises basically had the magic moment of like, if I was the only one at my company using this, I would still tell them to have me use this instead of hiring three engineers for myself. One of the biggest moments for us where it was like, people in the enterprise are really getting dramatic value out of factory. And that kind of kicked off this really, like the last 90 days has just been a whirlwind. So getting to more of these Fortune 500 companies is kind of top of mind for us right now. To that end, as you serve Fortune 500 customers, it becomes important to have a larger go to market team, both on the sales side, the customer success side, and then also of course, on the engineering side. So we are very much hiring.
[53:44]
A
Yeah, I think like everyone's hiring. It's just like, what are you finding that's hard to hire?
[53:49]
C
Oh, I see, like the particular roles.
[53:51]
A
Yeah, yeah. What's the rate limiter here?
[53:52]
C
I think a big rate limiter is like for us going to these Fortune 500 companies, one of the most important things is having both that ability to, you know, talk to the CIO VP of Engineering and have that sales presence, but then also the ability to be sitting side by side with some of their developers and jumping into the platform, jumping into their use cases and having that.
[54:14]
A
So you need like 100 emails basically.
[54:17]
C
Like literally our profile or actually our profile when we're like looking for this role is like is this a junior INO or not? Like that's, that's basically the template there.
[54:26]
D
Yeah, I don't know about that but I definitely, definitely think that if you are highly, highly technical but you want to be a founder, you want to move into a role where you are interfacing with CIOs CTOs like this. We have like three maybe of these roles that are probably going to be the most important like roles in our go to market team. And so I think that's a huge opportunity for anyone interested in what we've talked about.
[54:53]
C
We joke that this person would basically be my best friend because any trip we go to to fly to a customer, they'd be there with me, you know, talking to whoever is the buyer as well as you know, going in, working with the engineers. So I'm also I guess hiring a best friend.
[55:07]
B
I thought that's what AI was going to be for.
[55:09]
D
Yeah, I guess not just to wrap.
[55:12]
B
I think we're all fans of your guys design and kind of like brand vibe which is very.
[55:16]
A
Speaking of best friends.
[55:17]
D
Yeah.
[55:18]
B
You know who does your design.
[55:20]
C
Yeah. So a huge privilege of, you know, Factory has been working with my older brother Cal who who joined us. He you know, moved from, he was in New York for five years. He moved out to San Francisco. He actually even before he moved, he's the one who designed our logo way back when. And you know, he's been a part of Factory from the very beginning and it's been an absolute pleasure working with him. From the brand design, the marketing design and then of course the product and the platform itself. I cannot recommend enough working with a sibling.
[55:50]
A
Sure. I mean, you know, not all of us are lucky to have that. What do you learn from working with a designer like that? I think a lot of technical people listen to us want to build a startup. They don't have the polish that you have, they don't have the hype. Yeah.
[56:04]
C
I think a big part of this is like one of our core operating principles is like embracing perspectives. And so Cal is not an engineer and I think what's great is a majority of our team are engineers and having that ability to come in with that design perspective and then also the engineering perspective and like, bash those two things together until we get something perfect out of it. That has been really, really important. I think it's a lot of times it's kind of easy to fall victim to, oh, I'm building. Like, I'm the profile who I'm building for, so I know what's best. And that obviously works a lot of the time, but sometimes there are some, like, core design tenants that you just might not think of if you're building for yourself. So I think that's been. That's been pretty important there.
[56:43]
D
Yeah. And we do live in a very AI native company or operate in a very AI native company. So being able to have someone set principles that are then consumable by our own agents. Right. Design systems and consistency. I think it's pretty surprising the degree to which even, like, droids can actually imitate a brand voice and style that Cal created for us. And so a lot of that comes from not just the droids doing that from Our entire team of product engineers are all incredibly thoughtful about what they're putting in front of users. And I think that they're able to bring a lot of that into it in a way that feels like safe and on brand and also have fun.
[57:23]
C
Like, I think Factory, like our like, semi tongue in cheek slogan is like, the machine that builds the machine, like, it's fun. Does it transmit exactly what it is that we do in the most clear way?
[57:34]
A
No, the factory doesn't build factory.
[57:35]
C
Yeah. Like, we don't, but. But to a certain extent, it's like, you know, software. The right software.
[57:40]
A
Right.
[57:40]
C
The machine that builds the machine.
[57:41]
A
Yeah.
[57:42]
C
It's fun.
[57:43]
A
When you say fun, it's more actually. I see you guys hosting a lot of events at your office and like, to me that's like, oh, like these guys actually social, you know?
[57:50]
C
Yeah, Yeah. I think it's important for us because not only is this, you know, incredibly transformational, but it's also like, these are people that we spend all of our time with and we want to make sure that while we're doing it, it's.
[57:59]
A
Like right next to the cow train. You can, like, advertise out of your window.
[58:03]
C
Yeah. No one. No one peek in, though. There's a lot of secrets in there.
[58:07]
A
It's pretty sweet. Cool. I mean, I'm very excited for your talk. We touched on a few things I'm interested in. Right. We're seeing tiny teams is a topic that I'm seeing one person can do a lot more. So the average team size is really shrinking. The interaction of AI design and engineers, that's also another thing I'm exploring. I think we're really trying to push the frontier and then obviously there's always the suite agent stuff, which is always ongoing. So yeah, there's a lot of interesting work going on.
[58:35]
C
And one interesting addendum, there's. There are sometimes individuals who weren't even really developers who will use Factory and have more usage than in 100 person enterprise.
[58:45]
A
Yeah.
[58:46]
C
Which is crazy to see. There's some really interesting dynamics that we've seen play out just in how people use these tools. Whether it's like for that design or for that like small, small team use case.
[58:54]
A
It's pretty cool. There's an AI native attitude that like is going to set people apart if they're just open to it. But then also like they're maybe not. They don't drink too much Kool Aid. I think there's a, there's a medium there.
[59:06]
B
Thank you guys for coming on. This was fun.
[59:08]
D
Thanks for having us.
[59:09]
C
Thank you guys for having us.
[59:10]
A
This was awesome.