![[State of Evals] LMArena's $1.7B Vision — Anastasios Angelopoulos, LMArena — Latent Space: The AI Engineer Podcast cover](https://substackcdn.com/feed/podcast/1084089/post/186610584/7b528f473ec8e5562101a90c6d5abf32.jpg)
Loading summary
A
Light in space.
B
All right, we're here with Anastasia from arena. I actually don't actually know your stars.
A
Even a. It's like, very Angelopoulos.
B
Yeah, yeah.
A
There you go.
B
Congrats on all the success. You got the arena handle.
A
Yeah, we did. We got the arena handle.
B
Thank you.
A
Big branding moment.
B
I think X is like being more commercial, so obviously you bought it, but at least you have a place to go to where you can be like, hey, we really like this. But I do think dropping LM has changed the feel of it. I don't know how you feel.
A
Yeah, I don't know. I mean, the reason we kept the LM at the beginning is because we were like. We started as LMSIs, right out of the LMSIs, sort of, you know, conglomerate at Berkeley.
B
So we decided to. Language models.
A
Yeah, exactly.
B
So.
A
So we wanted to maybe broaden a little bit. Yeah. And we were the first arena, so we feel like let's kind of try to own that.
B
Yeah. Last time you. We had you guys on, you hadn't really spun out yet. And we. I did a call with Alessio and I was like, these guys are going to start a company. And I didn't know. I think you actually were already started at the time.
A
I don't. I don't remember.
B
Maybe because Anj, I had a. I chatted with Ange and he said he was your founding CEO.
A
He was indeed.
B
Which, like, people don't know. Yeah, Anj is a very interesting character. We have a podcast schedule with him. Yeah, yeah. He does a lot more than normal VCs.
A
He does. He's been incredible to us.
B
You want to shout out some stuff that you did?
A
Yeah, absolutely. So, you know, the way the company started was as an incubation by. On.
B
Yeah.
A
So what he did is he kind of like, found us at Berkeley and picked us out of the basement and was like, hey, these guys seem like they're onto something. And started working with us really early. Gave it, you know, gave us some grants. He was not, you know, a 16 was not the only wanted to do this. We also had a great grant from Sequoia, but ONS was in particular, quite, quite supportive of us and, you know, gave us some resources in order to continue building out arena before we even were committed to starting a business. And in that capacity, he sort of like, you know, formed an entity for us and this and that. And, you know, he was like, hey, you guys can walk away at any time if you guys don't want to start a business. I Mean, it was really incredible, very aggressive investment move by him. Right. Because of course, any money that he spent, at the end of the day, like, we could walk away and leave him with nothing. But I think he knew wisely that the right thing for arena was to start a company out of it. It was the only way that we could scale and that, you know, Weylan Yan and I would, would ultimately see that and be excited about doing it ourselves, which is which. Which ended up being the case.
B
Was there a moment for you where you were, I'm sure you were debating it yourself. You had other opportunities. What was the deciding factor for you?
A
It became clear that the only way to scale what we were building was to build a company out of it. That the world really needed something like Arena. Arena being really a place to sort of measure, understand and advance the frontier AI capabilities on real world users, on real world usage based on organic feedback. And that in order to achieve the scale and, you know, distribution necessary and the quality, of course, of the platform necessary to do this effectively, we would need to start a company out of it. You know, we considered other options. Are we going to keep doing this as an academic project? Are we going to do it as a nonprofit, blah, blah, blah. But ultimately, under those constructs, we didn't feel like we'd have the resources necessary to accomplish our mission.
B
So you raised 100 million or 80? 100. $100 million. That's a lot of resources.
A
Great.
B
Yeah. What's it for?
A
Well, you know, obviously on behalf of.
B
Like, everyone who's like, yeah, everybody. Dude, it's an Arena.
A
How are you going to spend your money? Yeah, yeah, it's great. Yeah, I think so. First of all, we don't necessarily need to spend all that money. Right? The, the purpose of money at a company is to give you cards to flip. It's to say, hey, you. You have enough resources necessary so that if your first bet fails, you can make another bet and another bet, of course. So that's not to say that we're going to spend all of it. Of course, you want to spend things responsibly. Having said that, the platform is actually quite expensive to run. We fund all of the inference on the platform. You know, the way, the way that it works is the platform.
B
You pay market rates, they don't give you.
A
No, no, we get discounts, but they're. But they are standard enterprise discounts, the same that would be given to any other customer.
B
And if you close any. I don't know any numbers. I see numbers of Votes. But like what's that in like monthly tokens or. I don't know about tokens.
A
You know I'd have to back. Back that out. But we have like, you know, let's say again this is off the cuff but I can safely say we have more than 5 million now. 5, 6 million. We have probably 250 million conversations that happen over the course of the platform. We're on the order of, you know, mid tens of millions of conversations every month that are having on the platform. It's actually quite a large, you know, it's one of the largest consumer platforms for LLMs. Of course nothing is really comparing to.
B
Chat GPT but you know, but no still I mean I think the largest skilled ones I would benefit of this.
A
Is that like it's actually quite a diverse population. So 25% of the people on our platform, for example, do software for a living. Still at this scale.
B
How do you know?
A
Because we like do all sorts of. We either survey them or we'll like analyze the prompt distribution that's coming into the platform. Happy. Also to share more, we've done something called Expert arena which is like trying to understand the distribution of experts that are coming to the platform.
B
A lot of that can be like unauthenticated or whatever usage it is.
A
But about half of our users now are logged in.
B
Yeah.
A
And so we have some ability to understand them. And we also have surveys that we run on the platform that tell us a bit more about who are the actual users. Of course there's always response bias in surveys so you have to take it with a grain of salt. But nonetheless.
B
And if you don't know Anastasia's background, he's the guy to correct for response bias. Yeah. Okay.
A
There's a lot of guys like that. And girls.
B
And girls. You're not the only player. There are this artificial analysis starting the arena.
A
You.
B
Yup. AI. Yep. It's like some crypto people that started this. I don't know if you have a conversation with them on like well hey this is our thing or like let's work together on something. I don't know if you've. What the.
A
No, you know. So I've talked to the artificial. I've actually talked to both groups.
B
Both seem, you know, am I missing any major players? It's just those.
A
No, I think those. I don't know. Yeah, I don't think so.
B
Okay.
A
I think those are like some of course large depending on how you define the term. Artificial analysis obviously has like huge like market mind share around the analysis of different AI systems.
B
Yeah, they told me they were just like, we are going to be gardener of AI.
A
Yeah, that's kind of their goal. And I think they're going after that consulting market and so on. Artificial analysis, from what I understand is like a team of consultants that are doing this seem like really, really nice guys going after that particular market. It's different from our platform in the sense that their analysis is based on sort of like aggregating public benchmarks and turning those into analytics. Independently rerunning and independently rerunning.
B
Which.
A
Yeah, yeah, yeah, which matters. Yes. And using those in order to compile sort of reports and so on that educate the field on the performance of all these different models.
B
But they also have arenas.
A
They have arenas, but the arenas are not based on organic usage. Like the thing that distinguishes our platform versus theirs is that the users are actually inputting their own use case. They're actually asking their own question. And that gives a level of realism that platform doesn't have. Of course they specialize in a slightly different thing, but I see those platforms kind of diverging in that sense.
B
Yeah. And sometimes it's like it's the only way to do this. So like for example, for aa, their video arena, it's pre generated videos. You can't enter in your own video.
A
That's correct. But we're doing it organically.
B
Yeah, exactly. And so as a voter, it does help in terms of like, I don't have to wait.
A
It does. But also why would you go? Do you actually care about like other people's videos?
B
Form your own intuition maybe.
A
Yeah. Maybe you're like interested in comparing like.
B
I'm a shitty prompter, right?
A
Are you? I don't believe that.
B
I'm not a terrible prompter. I learned by example.
A
Don't denigrate yourself. Don't denigrate yourself.
B
There are many prompts that's much better than me, let's say that. Right. That's a fact.
A
People have all sorts of cool ways of prompting LLMs. Yeah.
B
It's educational to say the only way to learn is by like, well, look at other people's prompts and see the results and go like, oh, I didn't know you could do that. And that's hard.
A
Totally.
B
Yeah, yeah, yeah. Okay, so let's come back to Arena. Oh, one thing I do want to say is the number one use of funds is getting off Gradio.
A
Oh yeah. Yes. Well, listen, Gradio, incredible platform. Gradio scaled us to a million mal. Yeah, that's Incredible.
B
And of course, you tell the Hugging Face wants that.
A
Of course. Yeah. We're really, really grateful to Gradio for taking us so far. Eventually, you know, it became time for us to move off of that and go to a React.
B
I'm sure Hugging Face would have loved you to stay on.
A
They would have, I'm sure.
B
Was there a technical, like, reason you just couldn't get the performance?
A
Yeah, it just became hard to develop. And there were all these tools that we wanted in React and, you know, to do all the fancy things that you can do in React became kind of like one example.
B
I don't know what's a feature that you really wanted?
A
Let's say we wanted to create like our own custom, like loading icons for video with notifications.
B
Okay.
A
How are we going to do that in React? It's hard.
B
Yeah. I mean, we'll make a custom component.
A
I'm sure the Hugging Face guys are going to come in and say like, hey, you can do that in Gradio, which maybe you can, but also fewer developers know. How are we going to hire for that? We have to reskill them. People are less familiar with that stack, you know.
B
So anyway, full React next year. So all this.
A
Yeah, yeah, yeah, all that.
B
Okay, cool. Other use of funds that might be interesting. Like, you know, basically.
A
No, that's basically it.
B
Resources. Okay.
A
Yeah, it's. It's. It's on primarily inference that funds the free usage of the platform and. And then also hiring, of course, Headcount.
B
Yeah.
A
We have an office, you know, that's an sf.
B
I'll tackle one of the major things this year, which I'm sure you're tired of thinking about, but I. For people who are not in the loop, this is going to be news to them. The Leaderboard illusion, the whole thing with Cohere. Let's summarize the. What they said and then your response.
A
So Leaderboard Illusion is a paper that critiques Alam Marina and the main.
B
Pretty, like brutally.
A
Well, you know, I would say unscientifically.
B
And let's be clear, Cohere wasn't doing.
A
That well on the cohere was like 74. It's all good. You know, it's actually not. It's. It's a respectable place that they had on the leaderboard. I don't even think it was really coherent people like the coherent model developers doing this. It was more their research side. But in any case, what is the. What does the Leaderboard Illusion say? It says that Alamarina was what they're the claim is that Alamarina was doing this undisclosed, quote unquote private testing on our platform, that model providers will send us pre release models and we'll expose them and so on and so forth and that this creates so called inequities in the leaderboard. You know, due to that pre release testing, for example, they will, they cited that meta at some point, tested amount of models with us. Of course we can't disclose all of the details of how all that was done, but that is the main claim of the paper. Now our response to that paper, it's online, you can find it on response to Leaderboard Illusion. And our response to that paper is essentially pointing out a series of factual mistakes in the paper that that question the validity of the claims. So you can go look at the first version of the paper on Arxiv yourself and you'll see the claims. I think most scientists would view that.
B
As, oh, they've corrected it.
A
They've correct of course, because we, I mean, but they didn't correct everything. They just corrected some aspects that were just blatantly unscientific and false. But you know, for example, said that we only sampled like 9% open source models and like you know, 60% like closed source models. And this created a gap between open and closed source. But in reality we're actually really supportive open source models and it was more like 60, 40. And so that was, you know, one of the examples of an error in the claims. Another example is that they were claiming that there was some sort of bias introduced by this prerelease testing and that it was undisclosed. And reality is, you probably know we've been doing this prerelease testing for a long time. Our community loves it. They love basically getting like secret code names. Yeah, the secret code like Nano Banana. Yeah, all that. So Nano Banana, by the way, started on you, started on us. Right. And people loved it. Went like global sensation. Like non, non zero fraction of the global population.
B
Using talk to Lena about naming it Banana. Or was that no decision?
A
No, it was, it was their decision, I believe. But it was sort of this randomly generated thing and it just went no, no.
B
So apparently Naina, who's a pm.
A
Yeah, yeah.
B
Is named after her because her nickname is Nana. Oh.
A
Oh, that's sweet.
B
I didn't know that Nana put Banana on it. Yeah.
A
Oh, that's sweet. Yeah, I didn't know that. Didn't know the origin story. Yeah. I mean to us it just looked like sort of a random thing but.
B
It was like clearly heads and shoulders above huge which like before that there was Reeve Image, remember?
A
Yeah, I do, of course, yeah.
B
And also of BFL and all those.
A
I mean all those models are also great. Yeah. And I think those teams are also improving quite quickly. But Nano Banana was a sensation. I mean, that moment alone changed Google's like roadmap. Yeah. Market share.
B
Yeah, seriously.
A
I mean, Google stock, billions of dollars are moving because of Nano and now.
B
There'S like an OpenAI code red and everything.
A
I don't know about that. But yes, the information reported this.
B
I would say, like the image generation, I would say has been like this weird part of AI overall because it's not strictly AGI critical. Like it's not reasoning, it's not feeding more context into the model. It is the model generating a visual representation. So it's basically, I always think like, well, Gemini used to get a lot of complaints for generating racist images or whatever.
A
Yeah, that was a hilarious moment.
B
And ChatGPT also had it in the past. And I'm like, well, can we just get rid of this? Do we have to do image generation? Because let's just focus the positive reputation of AI in general on language models and coding and the other stuff. But I'm wrong. I'm such a huge Nano and Banana pro show.
A
Yeah, I totally agree. I was also kind of wrong about this. I didn't see the positive benefits. But actually I think that these multimodal models are going to become some of the most economically valuable aspects of AI, both in consumer and also an enterprise. Because one of the fastest growing segments, market segments in AI adoption is marketing and marketing and design.
B
Yeah, ads. And so I'm a content creator.
A
Right, yeah, of course. I'm sure you're using it all the time.
B
Infinite supplies of diagrams and explainers and totally infographics.
A
Yeah, soon we're not going to be even making the papers. Yeah, they're just going to be our paper figures are going to be made by analysis.
B
Yes, yes, I do think they actually. One shot. So Deepseek came out with v3.2 recently. I took their explanations, which are very wordy. They like very concise papers. It's 23 pages long, but it's very dense. And so I just took their explanations of the RL environment stuff and I fed it into nanobananapro and it spread an image that I used to understand the paper better. And the fact that I can just casually generate a paper quality diagram that would usually take a PhD student a month in Photoshop or something to do is incredible.
A
Yeah, it's incredible. It is amazing.
B
I want to ask about your principles running Arena. I think you manage a giant community 5 million miles. What have you decided are the core principles? I guess before becoming a company and now that you're a company, I don't know if there's anything that's changed for you.
A
I don't think anything has really changed. We want to provide the North Star of the industry and center the use cases of real users, foreground those so that people know what to target. The goal is to create a benchmark that is constantly fresh, that does not suffer overfitting because of the fact that we constantly have new data points coming in that tracks all the different new models, all the different new use cases of AI and gives the whole world sort of ground truth for how real users are using these models and how good they are on those use cases. We continue to do quite a few open source data releases. We've probably released more data than basically anybody on the real world use cases of AI. Millions and millions of conversations, real world conversations from real users that the community's using to study and improve on.
B
Yeah. And then I think like in terms of what you will build versus what will not build. I guess I'm not necessarily caught up on everything that you've launched. I know recently you've done the dev or code arena.
A
Yeah, code arena.
B
That's the most Code arena.
A
Expert arena.
B
Yeah, Expo arena. So basically like what is in the critical path for you, let's say for next year and what, what have you decided you'll never do?
A
So let me first talk about things that I'll never do. The platform integrity comes first to the platform. The basically the public leaderboard that we show on Alan Marina, I think of as a charity, it's a loss leader for us. We don't really make money on the public leaderboard. You can't pay to get on the public leaderboard. It's not like a Gartner in that sense. It's not like any of these like, you know, play systems. Never going to be like that. Models are going to be listed on the leaderboard whether or not the providers pay and whether or not they're getting a good score. They can't pay to take it off either. And so what that means, that's very important. And so what that means is that the leaderboard has a certain integrity that will never be compromised. Of course.
B
But not all preview models will make it onto this.
A
No, but that's okay. Those preview models have never been released. Yeah, right. Who cares about putting Unreleased models on the leaderboard. The point is that for every released model, the score that you see on the leaderboard is statistically sound. It reflects the real world capabilities of the model. Why? Because millions of people from around the world have voted for it and that's where that number comes from. All we do to compute that number is millions of people are voting. We take those votes, we turn them into a number that's always going to remain transparent and fair reflection of model performance. Where are we going? Lots of different new categories. I don't know if you recently saw we exposed occupational and expert categories. So now single digit percentage of our user base. We're millions to tens of millions of users, right? So single digit percentages means a lot. Single digit percentage of our user base are in medicine, in legal, in business, you know, finance, accounting, creative marketing, stuff like this. And we're able to show the performance of these models in all these different verticals because we have all these users in our, in our user base and we're working more towards multimodal, you know, video. We're soon to launch on the site at some point, you know, later this year or early next. So lots of things in the pipeline.
B
Amazing. Would you expose an API?
A
We've thought about it, yeah. I think it's a, it's a possibility, yeah.
B
What are the counter arguments? Why not?
A
Well, there's obviously a need for an API. The question is more of focus of our company just because we're a startup and so we really should be doing one thing well.
B
Arenas.
A
Yeah, arenas. So I'm not sure how far we want to sort of splay out and on what timeline we'd want to do that.
B
Yeah. Any other sort of like community management tips more broadly? Every AI company really wants to grow their community. You're obviously one of the strongest in the world. What's really, really worked well.
A
So first of all, I want to give a shout out to our community manager, Greg, who is doing an awesome job managing our community. Whether that's on Discord or on Elamarina. He's really incredible. So I would say hire Greg, but don't. Don't hire Greg.
B
Don't hire Greg.
A
Don't hire Greg.
B
Find he's ours.
A
Find a Greg.
B
Find a Greg.
A
But in general, you know, the question of how do you get to so many users?
B
That is a tough question.
A
And keep and retain them. That is a tough question because consumer is one of the hardest markets in the world. There's a lot of websites in the world that people can go to, you know, why should they go to yours? And the reality is, if you want to create a really dominant product, you have to provide people value. And to be frank, I don't think we're all the way there yet. It's not like I have the solution and answer for how to build a great consumer product. If I did, we wouldn't be attending tens of millions of users. We'd be at hundreds or, you know, we'd be at a billion users. We'd be like, is there a world.
B
You like are bigger than ChatGPT?
A
I don't know. I don't. I don't know that we need to be. Yeah. And I don't know that we ever will be because that's a. That's an extraordinary generational product that they built. Right. It took a lot of time. And then to some extent, it also involved luck. There's a lot of lightning in the bottle moments like nanobano was for us, where our user base just like goes up by a lot of. But when those users come, they can just as easily leave. So the way I think about it is every user is earned. You have to earn them every single day. They can leave at any moment. They're fickle. And so all the time you have to be thinking about, how do I provide this person value learning how are they using my website, what more could I give them? And how do I build in all the retention mechanisms so that they stay? And then they're also bringing their friends.
B
Is there one that's working in terms of retention? Like you said, a lot of people are signing in now.
A
Yeah, sign in was a big driver of retention.
B
No, no, but what did you give them in order to encourage, like, history, persistent history. That's it. That's enough. Yeah, that's.
A
That's one thing that has had a big impact.
B
Okay. Yeah, cool. What do you want from people? What, what are you looking for help on, like any call section?
A
Yeah, I. We are always looking for people to come and join us. If you are one of the best people in the world in your area, whether that's consumer product, whether that is machine learning, whether that is, you know, know B2B, go to market, marketing, all these things. We need you at Arena. We're building like a high performance team of real experts in everything that they do. And, you know, I'm always looking for excellent people to work with.
B
Do you need, like, what about partnerships? Right. Like, let's say I make cognition. I want to partner with El Marina or just Arena. What works for you. What existing partnerships do you already have?
A
That.
B
That's really fruitful.
A
Yeah. So, I mean, we of course partner with all of the major model labs.
B
Yeah. And that's. This is straightforward. Like, hey, we have a new model here.
A
Here you go. Exactly. So I think the. The most straightforward thing would be for someone like cognition, it's like, let's evaluate that agent.
B
Yeah.
A
But we should be continuing to shape our. Well, Codarin is an agent evaluation.
B
Is true.
A
Is true.
B
And focused on like all these arenas tend to focus on the model rather than the harness. So.
A
But that maybe should change. Maybe we should be evolving towards that direction. And I think the code arena is a good example of an arena that. A full featured harness, like a Devon. Yeah. And so in my view, if I'm talking to cognition, I'm saying, hey, let's get Devin on the arena and figure out how to loop together the Devon harness so that we can. I'm sure that there's something that could be really valuable there, especially given Devin. Last week, people talking about Devin dead. Did you see that?
B
Yeah.
A
People were saying, devin's gone. Devin's not gone.
B
Devin's everywhere, but people.
A
So can we highlight that for people and show them, hey, Devin is actually the best or one of the best in the world at doing what it does. Al Marina can actually do that. And our. Our place as a central evaluation platform allows. Allows that to happen. Yeah.
B
Love it. All right. Thank you for owning the state of Vivals.
A
Thanks so much.
B
Congrats on a wonderful year.
A
Appreciate it. Congrats to you too. Congrats on all the growing, you know, momentum in your podcast and in your career.
B
Thank you.
A
It's really impressive to see. Sam.
Latent Space: The AI Engineer Podcast
Date: January 6, 2026
Guest: Anastasios Angelopoulos, LMArena
Host: Latent.Space
This episode features Anastasios Angelopoulos, co-founder of LMArena (now simply "Arena"), a leading consumer platform for evaluating AI models through real-world usage and feedback. The discussion explores Arena’s origins, vision, mechanics, funding, competition, recent controversies, and principles for building a trustworthy evaluation platform at scale. The conversation covers behind-the-scenes decisions, the role of community, and Arena’s outlook for the future as a central evaluation and benchmarking engine for the AI world.
This episode offers a transparent, insider look at Arena—how it became the go-to place for LLM benchmarking, why organic real-world evaluation is crucial, and how the founders insist on integrity and openness even as the product and community scale rapidly. With ambitious plans, embrace of multimodality, and community-driven features, Arena aims to sustain its impact and trust as the “North Star” for AI evaluation.