
LMArena cofounders Anastasios N. Angelopoulos, Wei-Lin Chiang, and Ion Stoica sit down with a16z general partner Anjney Midha to talk about the future of AI evaluation.
Loading summary
A
Thanks for listening to the A16Z AI podcast. We have a fascinating and lengthy discussion for you today, so we'll keep the introduction brief. If you're familiar with the world of generative AI models, you're likely familiar with LM arena, the leaderboard and competition space created and managed by a team at UC Berkeley. What began with a focus on language models has since expanded to cover vision models, coding models, and more. And very recently, the team behind LM arena announced they're starting a company to scale the project's reach and its impact. They want to amass a global community of AI AI users and use their collective experiences and ratings to make AI models more reliable and to help everyone find the right model for the right use case. So without further ado, here are LM arena founders Anastasios N. Angelopoulos, Weyland Chang, and Jan Stoica discussing the state and future of AI evaluation with a 16 zine general partner Anjane Mittha. They kick off the discussion discussing the importance of mass scale real time testing and evaluation right after these disclosures. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com disclosures.
B
Yeah, yeah, that looks great.
C
Sometimes I get asked what's the last exam that I should take for humanity? And I It seems like that's the wrong question to ask. We should be asking what's the real time exam you want your eyes to be taking before they get deployed Every hour, every second of the day, especially as we start to get I think one of the things we that's emerging for me is that one of the ARENA is misunderstood partly because we're just early in AI and so while benchmarks like MMLU and the idea of these static exams were useful three years ago, the future is about real time evaluation, real time systems, real time testing in the wild. Now one thing that that concerns a lot of people is the reliability of these systems. When we start going from chatbots that are good at let's say companionship and more consumer use cases to mission critical systems, defense, healthcare, financial services. How will arena have to evolve as we go beyond companionship or web dev to those kinds of mission critical use cases?
B
I think that's one of the very reason we wanted to create a company to support this project to further scale the platform so Right now we are at million monthly user. Now what if we scale it to 5 to 10 or even more to capture even more diverse user base across different industries and then in that case we'll have ability to really zoom in into all these different kind of areas that people really care about that for critical mission task that it will be used to.
D
Yeah, you can imagine we are going to, when we are going to scale, we can have micro size for nuclear physicists, radiologists and so forth. Right. And these experts are going to come there to get the best answers to their again, research questions.
C
So that's interesting. Is there a future where now that arena is becoming a company? You could see a scientific lab or a shipping company or a defense company deploy their own arena on their own infrastructure for their own users or their own prompts essentially.
E
Many people have asked us for this already.
C
So these would be sort of private arenas, private evaluation.
E
And it's worth saying, I think when people have these mission critical industries in mind, they often are thinking about the factual nature of the responses and so on and so forth. But in reality, even in such industries, the majority of questions that people ask are subjective. Okay, so the mythology that in hard sciences or in mission critical industries, people just have cut and dried questions and they just need a retrieval and a lookup that's completely false. That's the very reason why these models are useful is because they allow you to sort of interpolate between these weird questions and answer questions that are not fully specified and give responses that are sort of geared to answer the question, but might not have a fully factual basis. Right. And they might incorporate factual elements through rag, let's say. But there's a subjective nature to the response and that's a reality that everyone's going to have to live with. If these systems are going to be deployed in medicine and defense and so on and so forth, they're going to be deployed in places where the data is messy because that's where they're useful. Okay, given that fact, how are you going to make sure that they're reliable? Well, you need something like Arena.
C
You know, at this point arena is hard to miss in the AI space. Whether it's Grok 3 coming out and Elon putting it up for the bulk of the keynote or Demis using the web dev arena scores to kind of demonstrate how good Gemini is. It's sort of become the standard bearer for evaluation and testing at all the big labs. Right, but it does that mean you guys have been helping them more than open source labs? Or smaller labs.
E
No, we work with model providers, small and large. Okay, so we work with basically anybody who wants to work with us within our constraints, try to be as like, helpful as possible. The fact is that part of the reason to build a company is because we don't want Weyland having to serve all of these requests from people like Manually himself. So it's been a challenge, but we try to scale as much as possible. And in fact one of the things that we help everybody to do is pre release testing of their models. Okay, so it's not just that we work together to evaluate the models are released, but we also try to be their release partners and say, hey, can we help you guys pick the models that do best on our user base and use that as a guideline for which models they should actually release to the world. So that's the way our platform works and that's getting us closer to something as we've talked about, like reliability is so important, these subjective measurements are so important. How are we going to get to a world where there's a CICD pipeline where people can test their models pre release and make sure that they're doing well for all sorts of different diverse people? Well, you need something like arena to do that and that's part of what the company is geared to do. So what people do is they can come, they can test a bunch of different models. We do this with basically every provider that comes to us and they can see, oh, which one's doing better or worse on the distribution of arena users and then they can use that information to help them decide which model to release. After they decide to release the model, it gets continually evaluated forever. So that's where you get the freshness of the data, that the model continues to be tested. And we're pushing towards a world where these subjective and human considerations are part of model developers final release pipeline.
C
So if I'm hearing you right, the more testing, the more reliable we should expect AI systems to get.
D
So we should be like with any software system.
C
This is one of the fundamental debates in the space, right? Is what is the measure of progress the right measure of progress in the space? And there's a body of work that tries to create exams. They call them harder and harder exams and for two reasons. I've always found Elle Marina interesting. One is the opposite approach which says let the wisdom of the crowd guide us and two, let open source actually define the examination. Right. I think this is quite important where if you get a group of experts in a room who decide what the right exam is for humanity. Inevitably then, if it turns out that group's values get encoded in that, we have no way for the rest of the world to use AI systems that are measured by a different set of values.
E
So it's great that people do these expert evals. It's totally fine. It's orthogonal from what we do. I'm glad we have. But at the same time you have to ask yourself, what makes somebody an expert? Right. What are they an expert on? I think the whole world is moving in a direction against like experts being the be all, end all of everything. And everybody actually has their own opinions and everybody has their own point of view. And in fact there's so many natural experts in the world on all sorts of topics that they don't necessarily need a PhD in order to be really intelligent and high taste and have valuable opinions. And I think that's one of the things I'm proud of with arena is it allows us to actually go and say, hey, where are the natural experts? And actually can we find data driven ways to identify them? What if we can go look and say, hey, this person here in this random part of the world is actually incredible at coding and math. Right, let's. Their vote actually means so much and their preferences are able to guide the future of AI. That's an amazing thing and we hope to be able to scale it further.
C
Okay, but I'm going to push back a little bit on this. I'm going to channel a few of the criticisms I've heard.
E
Yeah, please.
C
From experts, which is that, look, if you're an expert, you've been blessed with the brains, the resources and so on. To be a highly educated individual in your field. We have a responsibility to guide humanity. It's our job to actually guide the masses. We should be defining what good human preference is versus not because the masses actually don't know what's good for them. The everyday users. The layperson prefers slop. Right. We've heard these arguments.
E
Yeah.
C
Is there a grain of truth or how do you think about that?
D
So I think a few things here. So one like Anastasio said this alternative to have hard exam, to create hard exam is valuable. No question about that. I think the other thing I want to point out, and I took at heart this kind of criticism and about kind of having expert labeling right now I want to quite a few experts I know and I respect and ask them, would you label it? Would you label. Almost everyone told me, no, I don't have time Right. So there is a question there about are you going to get really the experts? Right, Right. And I don't know. I don't think so. Right. You're getting some people from that area who are willing to do the labeling, but the best people are not willing fundamentally, they don't have time right now. But these people, if we offer to them and we are not doing now, I'm talking about the future, a platform where their community come there and ask questions to push the boundaries to help with their research and things like that. They are going to be. First of all, you are going to get these people. Right, because they do that in order to advance again in their research and we are going to get also their votes. So fundamentally I do think that we can get in arena again right now. I'm not talking about today, I'm talking about the future.
E
It even happens today.
D
It even happens today.
E
It even happens today.
D
So you get real experts. The top ones. Okay. The answer, unhiable people, unhirable people. That's a great way to say it. The other thing I want to say about the layman and so forth, Ange, you fund a lot of company, you are funding a lot of companies, AI and A16Z and you are on the board of many companies. What do they build? What do these companies build? What is their product? Who uses their product? Right. He's not the top experts, he's a layman. These are their users. That's how OpenAI makes money and so forth. Many others then wouldn't the evaluation should take into account the preference of these users? The answer is obviously yes. So how can something. Again, these exams are very important to understand the capabilities of these models, these benchmarks. Again, no question about that. But they are not going to reflect. MMLU is not probably as good as reflecting the preferences of the users of these AI products and services.
E
I want to just dig one level deeper here, which is that one of the things that we are really excited about is the question of why do people vote the way they do? Who's voting left? Right. Why are they doing it? On what kind of prompts do they vote for one model or another? On what kind of topics or models? Better, basically, can we decompose human preference into its constituent components? Let's say you have a criticism. You say people vote based on slop. Emojis are driving votes. And response length is. There's a huge response length bias, which is. It's true that people vote for longer responses preferentially over shorter responses, even given the same Content or well known human bias. This gets back into rl. Can we learn this bias and actually adjust for it and correct for it? And the answer is yes. That's why we're making style control default. So what we developed is this method called style control, which allows you to run not just, let's say a Bradley terror regression, but also include certain covariates that model the effect of style and sentiment on how people vote. And what you get when you fit this model is not just a prediction of preference, but also an understanding of why people are voting the way they do. We're trying to target a causal quantity which is the causal effect of, let's say, response length or sentiment, and so on and so forth. Okay. There's always more work to do to get things closer to the actual causal estimate that we want. But if we continue to decompose human preference into its constituent components, what we're building is an ever richer evaluation that can tell us all the factors that go into response, how you can optimize people's preferences, keeping style fixed. Let's say I want to remain concise, but I want to maximize your preferences. Okay. How am I supposed to do that? Well, not a lot of people have that information, but we're building the methodology that allows you to do that. Okay. And the question is, can we disentangle style versus substance?
D
Yeah, that's it.
E
There is an effect and you want to know about it, period. The platform helps you.
D
And what is the impact?
E
And what is the impact?
C
Yeah, so this is another, I think, important area to dig into. There was a moment where you guys decided that it was insufficient to keep measuring the progress of the models on coding with the base design of the platform, which is just Chatbot Arena. And I remember seeing a launch which was web dev, I think you call it web dev arena, and showing up to a completely different interface and then realizing that this was a pretty big change for you guys, Right? Why did the. Why did we need a new kind of arena to correct for that effect? Why was that necessary?
B
I think that comes back to, I think, Jan's point a little bit like when you build a product, AI product, you want to know about how people use it. You want to understand why users prefer this over that. And in order to collect that kind of data, you have to build software product first. And then, as we know, over the past few years, people have been building very different kinds of applications on top of AI, beyond just Chatbot. The Chatbot is one of the widely used interface right now. For human to interface with AI. But these days people are applying these models to coding and then more like tool use, agentic behavior, that kind of stuff.
D
Right.
B
So as the first step we were thinking like, okay, how can we capture all these use cases? And then the answer to that is we have to build something that people can use again, the same an environment that people can test in real time to give us real world feedback. So that was the original idea. That was kind of like last summer, just at the beginning of this kind of text to web, text to app trend. At the beginning was very beginning was cloud artifacts. That was the first and we saw that, we were amazed by that. And then how can you build evil for that?
E
Right. We have to give credit to Ariane on our team.
B
Yeah. And then Ariane basically was like just joined the team and then he was interning actually at. So we were saying, okay, why don't we build something new? And at that time, cloud artifact, how can we build evaluation for that kind of applications, text to web. And then that was the idea.
D
One small parentheses. So far we talked, the three of us, but eventually the team grew quite a bit. And right now, I don't know, it's like almost like 20 people and both graduate and especially undergraduate students and doing a lot of very exciting and interesting work to expand the abilities and capabilities and the reach of Chatbot Arena. So I just want to make sure that other colleagues of mine are involved here and like Joey and others. And I think the credit goes much more beyond the three of us. But that's what we want to provide. You want to whatever you want, we hope that you'll find the answer.
E
Fundamentally, if you're looking at the leaderboard, there's only one thing really that matters, which is do you care about the preferences of the community? People that come to vote on our platform? That's it. That's what we measure. It's the only thing that we claim to measure. We don't claim to be an AGI benchmark. We are faithfully representing the preferences of our community. That's why it's so important to us that we continue to grow our community and we get a diverse community of all different people, experts, non experts, artists, scientists, different languages, different languages, everybody under the sun. We want coming to this platform to express their preferences. Because if we can get that to happen, it already happens to some extent. But if we can continue to grow it, what's going to happen is again, in order for a model to do well, what needs to happen is new people need to come in and vote for it. And so if we can provide this lens into the preferences of the world.
D
Things are moving, right. It's like these are changing. This is. Again, it's not like. It's like we have the study and you want to talk about the freshness. Right? About we always see fresh prompts. Right. It's not like, oh, we are going to see all these prompts over and over again, like, kind of saturated. That's not the case.
B
That was the very beginning of why we believe ARENA is fundamentally different, related to this contamination. The very beginning of ARENA is to try to solve the contamination problem, the overfeeding problem, which is like people test models on a static benchmark or what people called overfeeding. Yeah, or people call overfeeding. So how do you overcome overfeeding? You collect new data.
D
Right.
B
That's how you overcome overfeeding. And an ARENA is designed to collect new data every second. So all the questions are new, all the votes are new. And then we measure basically how different. What's the difference between all these problems?
D
Right.
B
What's the distribution look at and so on. And then we conservatively estimate like over 80%, something like that.
E
Prop up.
D
Correct. And this study was done by another member of our team, Lisa, and basically measures out how many more fresh prompts you have in one day compared to what you've seen in the past three months. Right. And by a similarity score of something like 70, 75%. You have over 70 of these prompts are fresh. 75, 80% or over 80%. Right. So it's a large number of prompts are fresh. Right. And when you say fresh, we are talking a similarity. We are not. The test is not very high that it's not like, oh, they are identical. No.
E
Yeah. Just to dig in one click deeper into what Weilin said, what is overfitting? Static benchmarks overfit. Why? It's because, as Jan said earlier, you're giving the student the same test over and over. You have a model, you test it, you look at whether or not it's improved on a static data set. Then you find another model, you test it, and you pick the one that does better and better. And what ends up happening is that the test becomes meaningless because you've seen it so many times that you've memorized the answers. That's what overfitting is. Chatbot arena is immune from overfitting by design. Means you're always getting fresh questions. In order to do well on the arena, new users need to come and vote for your model. That's it. That means that users like it.
C
One thing I've noticed is that the same researchers who often argue with me that arena is a terrible evaluation system. The leaderboard is not to be trusted, the scores are rigged, they're gameable. Have one tended to also celebrate when they're on top of the leaderboard. And the second is I find actually increasingly, especially for specialist arenas like web dev, there's a natural tendency to just accept that this is a really good indicator of the underlying capabilities. Why is that? Why is Web Dev arena such a good proxy for actual performance improvements when it comes to capability like coding which is quite general purpose, right. It's very counterintuitive. Programming is actually a very general purpose discipline and skill. And yet it seems like the capabilities on that in a very general way are still being able to are captured well on a specialist arena like web Dev Arena. Why is that?
E
Yeah, first let me just say I think that all of these arenas have signal in them. It's not just web dev arena, it's just people have more opinions on language. So web dev is a little bit more objective. It's easier to see the website, you build it and it's this one's better than the other. There's a lot of signal in it. Also it shatters the models. And what I mean by that is it can be very clear that one model is one way better than another. On webdev arena immediately bam, you see it and it's like why is that just the capabilities of the models maybe willing can say more.
B
I think it's just much, much harder task because it's like from text a description of a website and you have to first understand the request and then build like write code, right? And then the code has to fit maybe satisfy certain requirement that say a style requirement or like component that kind of stuff. And then it has to compile, right? It has to be. We basically run it live in browser and then with connect to a sandbox, that kind of stuff. So there are a lot of parts the model has to get right in order to build a website that people can really interact with.
D
Fundamentally discriminates much better across the models because it's a much harder problem. So very few get it right.
C
So it's in reality for critics who say these exam static exams are hard tests and arena is an easy to game benchmark. In fact Web Dev arena is a great example that shows it's a really hard evaluation. It's a hard test. Actually it just proxies the real world better than some static multiple choice question test. Is that roughly right?
B
Yeah, for sure. And then a lot of like every input from user is for a real world task. They are trying to build some website real and then that also measures something that's beyond just academic benchmark that we imagine what user would do. This is really trying to approximate the user intent, user preferences directly.
E
I have to say I also just completely disagree with the foundation of the question. The implicit assumption is that chat is easy or that it's even easier than web dev. That's completely false. It's a completely naive perspective that people have on this because it's hard to build something that people love. People are good at chat, but you like some people way better than others. It's subjective and everybody has their own opinions and the landscape is very rich. You might like a very different model than me. Somebody else, let's say a musician might like a different model. Much a much different model than I do. And understanding all of those differences is really hard. And anybody who thinks that is gamble is deluding themselves.
C
To take that to its strongest form argument then isn't the right way to allow me to evaluate whether the model is good or not to allow me to generate my own leaderboard?
E
You as a person? Absolutely. Yeah. And we should be giving you the tools to do that and we're currently building them.
C
So this is quite profound. You see the world going where everybody has their own personal arena.
E
Absolutely, absolutely. It should be personalized just for you. You should understand which models are best for you.
D
And it's going to be for your task, his person and task. Because for a different. If you want to do different things, you may have a different leaderboard. Right. If you have a question about tax today, you go to different people than if you have a question about programming or whatever. Right. So that kind of also is going to depend on what task you want to accomplish. Right. And I really want to go back to one thing. I think that because it's. I try to think quite a bit about all the criticism because on the face of it, intuitively for some people makes sense. Right. That's how many people make the same criticism. And I think there is another thing going on. We as humans, we believe that why do people say arena is not good? Because people are fooled. Right? That's fundamentally their kind of argument. They're fooled because long answers, more emoji and things like that. And when I look from that perspective as a human, what I in my mind is that I am not going to be fooled, right? That's what. It's not relevant. These guys are going to be fooled by these things. No, I am not. That's why it's not good. That's why he's not a good proxy. The problem is that you are. We. I am fooled, right? Everyone is fooled. And this is actually, that's why the chatbot arena, it's like Anastasio said, it's a proxy. It provides a magnifying glass. But you know, all of us, we have our own peculiarities, our own culture, our own history built on interaction with different people. We have different preferences. That's fundamentally what it is. And these preferences are not fully objective because people say it's only on objective answers, but all of us are different. That's kind of the fundamental disconnect between the criticism and actually what we provide and actually what everyone we believe needs. Right.
C
To double click on that a little bit, you said we all have our own culture, right? And Ben Horowitz, we all know well, has a quote I love, which is he says culture is not a set of beliefs, it's a set of actions. Right. So let's say our belief, the philosophy that arena has is that the set of actions a user takes when using an AI model is the best source of truth for whether that model is good or not for them versus some third party closed source evaluator telling you.
D
What is good, telling you what is good for you.
B
Right?
D
And that's why again, it's like going back to the previous that, you know, I think what you hear is that capturing the human preference is fundamental because we are building these AIs to interact with humans. Right. I think that's kind of the foundation. But if you like we have in the previous discussion say, people say, well, like, yeah, but these other people are fooled. This is not me. Right. Also I am fooled as well, but I don't believe. Right? You always believe about yourself that you are better than you are. But then we can provide like style control. Okay? So force, you can remove that, adjust for that effect. Right? And we are going to provide more and more. You can adjust for that effect and that effect. Right. So you get your answers as well.
C
LM arena started as a research project to take us back to how it began.
B
So it was started around two years ago in late April 2020. And then at that time before arena, we were working on project called Vicuna, which is like one of the first open models that's being released, like the ChatGPT kind of clone yeah, and at that time Llama One was just released, which is a base model. It doesn't really know how to chat with human, only gone through like pre training process. And that time I don't think people call this post training yet, people call it instruction fine tune, that kind of stuff. So we were like in the lab exploring how do we reproduce this, right? How do we make an open source version of ChatGPT. And then credit to Lian Min, he was having kind of like idea that we could use some of the open data published on the Internet. That's kind of like user chatgpt conversation. It was called shared GPT and it was like high quality set of dialogues that users shared. So basically we come as a group, bunch of PhD students in the lab, set an ambition goal which is like we try to release this model, train this model in two weeks, that kind of stuff. And then during that process the result was surprisingly at that time we were playing with the model and then we thought we got demo this to the world. So we basically just set up a website and then put that model on the website and then release it. At that time it was like there's like a huge debate internally when we release it, how should we evaluate this model? How good is this model really? It vibe? Well, right, because when we compare it to Llama, the base one, you can feel the difference. The models just learn how to chat, learn how to speak like a chatgpt. So there was a huge debate like how do we evaluate this? At that time we didn't have much time, so we were like, okay, we either do this kind of like labeling, come up with questions ourselves and label the data and then compare it with other models, or we do something like something automatic. And then at that time there was like GPT4 just came out in March, people were like wondering what can it do, right? And then we were like, okay, why don't we just use it for evaluation as we use GPT4 as a judge to do this automatic eval. At that time no one believed it. And then it was like again, huge debate, but we didn't have time so we just ended up doing it. And it works surprisingly well again and we release it. But the huge problem after that is like still there's an open problem, which is how do we evaluate these chatbots, right? So soon after we come up with that idea that why don't we let everyone in the community vote which model is better? Because at the time we serve that model and then we also serve some of the other open source model at a time, at a time every week there's a new fine tune. So we register a website, we demo them all of them and then we come up with kind of like a side by side UI that people can compare them. And then soon after we say, okay, why don't we come up with a battle mode which is like we anonymize their identity and that people vote. So that was the original arena, which is basically trying to solve our problem, which is how do we evaluate these models, understand these model stiffness.
D
Actually the first time we did start, we did try to have some students buy some pizza, get them in a room and label the replies from Vicunia and other model to compare them. And then obviously that didn't scale. And then it was this LLM as a judge. We tried and it worked surprisingly well. And then it was quite a bit of debate. Okay, we are going to build a platform to scale. Because the question is still okay. It seems that anecdotally like GPT4 was released just two weeks before we started to use it as a judge. It's still the question, okay, it seems that it's doing well, but still it's like how about how does it compare with humans, right? And that was a question about okay, how do we scale the human evaluation? And we discussed quite a bit about how to do it because it was not clear. Because if you think about before you just ask people, I have a prompt and the prompt is answered by all models and then you label it, right? Good and bad. That's kind of the typical way that how would you scale up the process? Then you need to kind of to rank them, right? And you have end choices, answers of the same problem. And models is very hard to rank them, right? It's like think about, right? We get slightly different in tone and so forth, try to rank them. And then we thought quite a bit. I think the inspiration about was how humans in real life rate say players or teams in games, right? And obviously you have the tournament, that's one way to do it where you have, so to speak, the players plays with each other head to head. And then based on that, you are going to have some number of points to win or lose or die and then you are going to have a leaderboard, right? But that again the problem is that if you have a tournament, typically the assumption in a tournament is that the number of players do not change across during the tournament, right? And also in general most of the tournament, each player has to play with everyone. So it's kind of N square problem for N is the number of players. And then we thought about, okay, there are other ways in real life how players or teams are ranked. When they don't play with each other, they don't have a chance either because a number of players is too large, or you need to also accommodate new players entering the game. And that's why we were about. Then we thought about, okay, there are disciplines where this is done like chess, that tennis, ATP rating and many others. And that was the idea. And we start, okay, why don't you do something like ELO score? Okay. And for that, what do you need? Oh, you need only head to head. And not everyone needs to play in the same tournament. And that's how we adopted. And that's why arena, it has its battle mode, which you have a prompt and answer from two randomized anonymized light language models and you can pick which one is better, where there is tied and so forth.
C
And when was the moment when you joined the conversation and brought the. From what I understand, the Bradley Terry approach.
B
Yeah, it turned out it was like deeper technically than we thought. So at that time, Jan was like, we need to find someone to back this up or like on the theory side to have a solid foundation to rank all these models. At that time, it was no longer really a fun project anymore. It was started as a fun project.
D
And people started to pay attention to it, right, so you better do something. And I went to Michael Jordan, my colleague, very famous machine learning AI researcher and faculty here. And I've been working with him actually when I. When he built this kind of labs at Berkeley, cross disciplinary labs. We were working with him in like 2005, 2006. He was joining the system people, database people to work together and exciting projects. And he told me, oh, I know exactly. I have the guy for you. It's Anastasio.
E
I saw what was being built. At the time, ARENA was not still close to what it is today. It was, I think, not that much usage. I saw it and I thought, wow, what a great opportunity to do some interesting statistical modeling and theory, like being able to understand how do we optimally sample models, how do we perform this estimation? Okay, let's move from ELO to Bradley Terry because we're actually performing an estimate here instead of just like. And the ELO score moves over time. It doesn't converge. But Bradley Terry models converge. How do we then construct confidence intervals properly for this S demand and so on and so forth. All that stuff was super interesting.
B
To me, we were meeting like just.
E
Next to two doors down and we wrote on the whiteboard like five or six different topics. Yeah, I mean, we started working on them and the rest is history. Right.
C
In many ways, I feel like this is a, the birth of ARENA couldn't have happened anywhere else other than an interdisciplinary lab at a research, fundamental research university like Berkeley. Is that true or do you think.
E
Well, it certainly would have been worse if it came out of somewhere else. And the reason is because the fact that we come from Berkeley and from a university really speaks to our scientific approach in neutrality. I think if it came from an industrial lab, people would always have questions about, oh well, these people, are they also training a model and what's their incentive and so on and so forth. But the reality is we were just, we were doing this in order to evaluate models and they came from a scientific perspective. That's it. And I think that's something that people can see when they look at us and builds a lot of trust in our business.
D
The other angle here is that in a lab like this one, what you get, you can get maybe in industry you can also get maybe interdisciplinary teams. But they are going to be large teams, right? Because you go, okay, you are going to do the team which is doing AI. The team is doing systems work together. These are already large teams, but here what you get are small teams, a few people, which everyone can come from different areas. Right. We have people who are kind of systems. Early on we have to build the systems to serve these open source models. Right. We have to serve this Vicunia, like Woirin mentioned, Right. Then you have to have people who are pretty good. Actually, when we use this shared GPT data, we are doing quite a bit of data pre processing in order to pick some data curation and so forth. Right. Then when Anastasio joined, we have now machine learning experts, Right. But the team still is like four or five people. And early on, small teams move very fast. So I think that's kind of the difference that you have a very small team, but interdisciplinary and small. So I think you may get in industry, may get in interdisciplinary teams, but they're going to be large.
C
If you could just teleport back in time to that moment in early 2023. Correct me if I'm wrong, but if I had to kind of summarize the research environment in the Bay Area at the time, most people basically were extolling the death of AI in academia. The idea that, oh, you can't really do any serious Research you can't contribute to the frontier of computer science or in AI from a research institute. It was quite common actually if you remember that.
D
And there's nothing more satisfying than proving these people wrong. Right? Right.
C
So what do you think people got wrong?
D
I think that if I may take a step back here because I'm old enough so I've seen a few of those. I remember when I was a student I was doing system work and networking. That was the Internet days and I was going to these conferences and there are panels Are the operating systems dead? Research in the operating system, Is that dead? That was the topic of the panel and the reason for that was at that time, it was at the end of the previous century, Microsoft was dominating and then Apple and then of course there are some FreeBSD and so forth. But then, although it didn't come from academia, it was Linux. Right. Actually Linux was preceded by Minix, which came from academia from Netherlands. So that's one. Then in 2004, like when we came here and we started this lab, there was a question also that was this kind of distributed system because a lot of researchers on distributed system, what can academia do? Because it was Google was doing all this research of Q Systems, MapReduce, Google File Systems. All of that happening at Google best people going at Google and so forth and then we've done here, then it comes spark. Right. Which also come from academia. Right. But and I think that's when this started. It was actually the question people are surprised about Vicunia, just a bunch of students and actually their own initiatives. Right. I knew almost after the fact this happened pick this kind of data set from the Internet which was high quality and use it. And people are so surprised about the quality. Right. That was where people asking is this real? Right, Right. So I want an evaluation. I'm always kind of, you just show me some stuff anecdotal, right. And oh, okay, it looks good.
B
But how about is that some people didn't believe it.
D
Yes.
B
And then say this is a GPT4 wrapper.
C
Yeah, I remember that. I remember, I remember we were at Neurips later that year, right. And I remember Elon was sitting at a table next to me and a really well known famous researcher who's still at OpenAI asked me oh, is that the team that worked on that Vicuna bot? And I said yes. And he said oh yeah, I've been wanting to have a conversation with them because we think they're violating our terms of service because they're just reselling our GPT4 and I don't know if he came and confronted you, but that was very much the default assumption people had. It was a disbelief.
D
So because for a game, for a while those are the best open source models. I don't know, three months, four months or whatever. So that's why the evaluation was so important back then, right? Because it's disbelief. So we try to support with some evaluation which seems more objective, that indeed is a good model. Again, since then there are many other things we've done here and open source like inference, LLM inference like VLLM and as yelang. But I do think that what happens in this kind of, this sense in industry. I was actually on a panel yesterday was the same this kind of discussion, okay? It's like, you know what can. Maybe academia should do this thing and the industry do Zync, right? Like let's. As an industry you just can't do anything like pre training and so forth. I think at the end of the day it is about what resources you have and what problems you solve. And over again I think through the example I gave, if academia has resources is going to surprise you, clearly it's going to be at the very edge of innovation and creativity. And so that's always almost happened, right? Like in this case, in this case, of course we had, we didn't need huge resources, right? It's just a group of smart, passionate students. So when Cherbot arena started it was a lot of excitement and so forth. But then it was this thing, okay, it was a feeling at least for some people in the group that we are done here, we publish the paper and so forth, right? For a while, even if you look at the usage is kind of dropping a little bit and it's almost, almost died. Almost died. And then Weilin at that time his main trust of research was different. Some graph neural networks, distributed graph neural networks and things like that. And I remember at some point at one of our one on one meetings Waylin came to me and said look, I really like and I want to instead of doing this kind of work, I really am passionate about Sherbot Arena. I want really to do it right and to focus on it. And then when kind of then it started and Weilin is like one man backhand starting and marketing and he started to add more models to the leaderboard market it and so forth. And very soon after that Anastasio came and then it was kind of magical, right? You have these people who are so passionate and they are working so well together, they are so complementary in skills and even personalities. Then it started to shoot up. And I'm mentioning that because without that kind of inflection point, which came long after it started as a project, or we wouldn't be here.
C
I think there's a chart I saw recently that showed that compared the number of models being released and tested on LM arena per year over the last two years. And if you look at Q1 of 2023, it was, I think, two models. And if you look at just this past quarter, yeah, I think there were 68 models or something like that. Right. In total, that first year there were about 12 models or so. And today it's over 280 or something on the platform, so. So at some point it took, it sounds like it took the two of you realizing that this deserved to be more than a one off paper. When was that?
E
I still remember when we, when we worked on the paper for Chatbot arena. It was like a couple weeks of really hard work and we were like pushing all the way until the deadline. And afterwards I turned to my girlfriend at the time, I was like, you know what, I think this is going to be a pretty good paper. And yeah, Wayland and I were talking at the time, but I think we started very early thinking about this, what this could become and trying to de risk it in various ways and trying to build it and seeing, hey, is this growing? Can we keep building on it?
B
Another field that really drive the growth is competition. The competition of AI has become much more intense in early 2020 when Cloud3 came out.
D
So let me tell you to answer your question because I think that it's very interesting and maybe I have a more kind of unique view. I started other companies which are based on project coming from this lab, like Databricks with Spark or anyscale with Ray. And there the motion was pretty clear. You have a successful open source project which gets more and more popular. And then there are some companies which start to use a project and then you get to the point that, okay, if I'm going to bet on this project to be part of my infrastructure, what happens when the students who build it, like mate and so forth, graduate who is going to maintain it, who is going to evolve it. So in that particular case it's kind of natural, okay, if really this gets to get even more successful, you have to have a company backing it, whether it's a new company or an existing company. And if there is no existing company, it's people who are on that project if they want to Push it further. Almost like you have to start a company to have enough resources to push it. But this was different for the reasons Anastasio said. It's kind of Berkeley. It's kind of a trust neutral. And Weylin, I remember mentioned to me like one year ago, he's like, I think maybe you should have a company. And I told him, man, what you're talking about. This has to be remain neutral. Maybe we do a kind of foundation and so forth. And this discussion actually went back and forth for a while. I was even frustrated. I'm telling this guy what I think it should happen and it should be just kind of foundation and so forth. And she come, he comes to me. It's like not hearing is like telling me the same thing. Right.
B
So we were trying to like convince you, basically, right?
D
They were trying to convince me. Right. And then for me, it's like I talk with some of these foundation and so forth. But when it was very clear for me that when you started to get more and more demand and so forth, and there is no way you can. You need so much funding to build such a platform. Right. Because you need to serve the models and you need to build an entire backend, scalable backend and things like that to do it. And then you are ux. Right. So when you look at that sheer amount of work in order to push them to the next level, there is no way you can do it without having significant funding. So that's kind of, for me, was. Was a kind of inflection point. But these guys can say more because they are convinced about this long before I was.
E
Yeah.
B
Another thing I think we were discussing last year was like, when we were trying to discuss whether this can be really like a business that solves more fundamental problems in this space. I think Anastasia at that time was like giving some perspective on ever more granular evaluation that we can provide with the data. So you want to say more about that?
E
Chatbot arena, when you look at the leaderboard, runs like a marginal regression, which means that the leaderboard sort of ranks models on average across all users and all the prompts that they ask. But there's a vision where you take this to the logical extreme, where there's the overall leaderboard. Then you can categorize the leaderboard into different categories, coding, math, hard prompts, and so on. But the real value is in, well, what if I can tell you which model is best for you? What if I can tell you which model is best for you? And your question for your business there's so many interesting methodological questions to ask there, and actually they require a lot of resources to answer. So one thing that we've been working on recently is called prompt to leaderboard. Prompt to leaderboard asks the following question. You give me your prompt. Can we tell you which models are best for that prompt specifically? Now, the problem is we've never seen that prompt before. We've only seen any prompt once or zero times, because most people don't ask every question under the sun. Fundamentally, it's a hard question because the thing that you're trying to estimate is, what if infinitely many people came to me and asked the same question and then voted? That's the thought experiment you're trying to run in your head. But you can't really answer that question by running a standard regression. So instead, what we came up with was a strategy for training language models that can output leaderboards. And it's actually a deep leaderboard question, because what essentially you're doing is you're training LLMs to output these Bradley Terry regressions that we were talking about earlier. And how do you do that? Well, you have to make sure that as you train the model, the regression sort of naturally emerges from the data, and the only thing you're getting is binary preference. But nonetheless, it turns out that you can do it. This has so much utility and it requires so many resources in order to really scale up. It converts the problem of testing and evaluations, which is normally kind of like an unsexy problem. You think about it as like, okay, how am I going to evaluate ML? Well, I'll just, like, calculate the accuracy, right? But the reality is that really doesn't reflect the heterogeneity of the performance of the model for different settings and for different people. But instead, what prompt leaderboard teaches us is that you can convert the problem of evaluation into the problem of learning. What if I learn something that can tell me how my models are performing in all different parts of the space? It turns out that you can do that by training big language models. And that because language models are sort of the intermediary that gets you to this evaluation, there's also a scaling law that comes along with it, which is to say that the more data you get, the bigger you build the platform, the better you can make your evaluations. The more granular you can make them, the more personalized you can make them. And that's a very powerful idea. And I think that's part of the reason why we were convinced, hey, this deserves to be a Company of its own. So fundamental technical innovation that's going to change the way people approach the space.
D
And let me try to follow that with a more less accurate explanation. But I think it drives home the point why the data is so important. So with a prompt on leaderboard, it's basically when you give your prompt and again, like Anastasio said, we may have never seen the prompt. More likely, however, what we have seen maybe a lot of other prompts which are similar with your prompts. Right. So intuitively you can think that you can use the votes to these similar prompts as a proxy to compute how good are the models for your prompt. Now, from this kind of maybe not as accurate analogy or explanation, you can see that the more data I have, the more prompts similar to your prompt I have, so the more accurate I can be. There is another thing I want we didn't touch on and what actually for me was I was so excited about the project early on. And if you think about outside Vicuna and our own story, how people and still evaluate these models, you have this kind of benchmarks, MMLU helm at that point sweeping all of this model. The problem with that is that they are static, so you can overfeed them. And at that time, if you remember, there are already starting to be discussions, I'm talking about one year and a half ago, two years ago, about contamination, very high profile examples. And why? Because these large language models are going to train, as we know now, the data is a bottleneck. So they train on all the possible data they are going to get their hands on right in the Internet. And many of these benchmarks are also out there. So it's not intentionally, probably many, but they are going to train on some of them on the very benchmarks they are going to be evaluated on. So this is kind of another fundamental problem. And I think that the unique thing about Charvad arena is kind of evolves over time. We are thinking that the way people typically evaluate these models is like giving the student over and over again the same exam. Right. Certainly we don't do that, or at least we try not to do that. Right. I'm talking as a faculty now. Right. For each class, for each year, we need to give different exams. Right. So that's kind of again with a humans the same thing. Right. To evaluate humans, to evaluate which kind of learn over time like these models, you need to come with something, you need to evolve the benchmarks, right. The examination. So I think that's kind of unique part and unique value of Chatbot arena, and probably these guys can say more about the kind of freshness and the evolution of the benchmark over time.
C
What are the biggest differences between benchmarking and evaluation?
E
So let me just zoom out for a second. Benchmarks, how are they collected? What happens is that you ask a question or give an input, and then a human grades the output. And then what is the benchmark supposed to be? There's an answer key. A benchmark is like a test with an answer key. A human has to look at it and tell you what's right or wrong. The fundamental insight of the arena is that by virtue of the fact that we built this platform, we can do something closer to reinforcement learning. Benchmarks are like supervised learning. Arena is like reinforcement learning. In supervised learning, you can only do as well as the best human that you have, because what's happening is that you're learning from the teacher. In reinforcement learning, you're learning from the world. You're able to learn things better than the best human could ever teach you. Why? Because you're only getting these preferences. You're getting, was this good, was this bad? Nobody needs to tell you why. Nobody needs to tell you, hey, oh, you need to improve the fact, oh, your writing style needs to improve in XYZ way and you should edit the sentence. Forget all of that. For the same reason why reinforcement learning has been so powerful in training language models, it is also powerful in evaluation. It can capture things that you and I, if we were looking, could never understand how to encode. Right? It is the open world. Nature allows you to go back and mine the data in order to extract insights that are much more profound than we could come up with ourselves.
C
So this seems to be the fundamental tension, right? If you, let's say you are a leader in the AI industry, your product lab or your product company, and you say, we believe the most valuable thing for us to do is build useful AI products. We're not interested in benchmark hacking. We're interested in making truly useful products. If that is actually true, you should be strictly supportive of testing your systems more and more on arenas like web Dev Arena. Let's say you want to build a useful web development AI experience, then you should want to your teams to be testing more and more on this product, right?
E
You want to do well on the distribution of natural use.
C
Let's say, then we expect anybody who's serious about building useful AI products to want to use testing environments like web Dev arena more. Why are people complaining that some labs are testing more than others? And why are they saying that's a bad thing?
E
So, first of all, I think it's worth saying that we offer the same level of service to all labs. Okay? There's nobody that we treat preferentially or anything. It's a neutral platform. We want to help the ecosystem advance. But second of all, addressing your question more directly, people do not yet fully understand the arena. I think people still think about the arena as a benchmark. People still think about it as something like, oh, people can overfit on this thing. But what hasn't sort of permeated, and it's because it's just such a new way to approach evaluations, is when you have fresh data, you can't overfit. It just means you're doing well, period. There's no overfitting that can occur. What can happen is you can do well and you can argue with me about whether doing well is a good thing. Okay, that's perfectly fine. That's not where people's heads are at. I think people's heads are still at, oh, you tested so much. And that must mean that you're. Because people are used to it. Because people, oh, it's stat 101, so on and so forth. I know statistics. If you're doing well on this distribution, that's a strictly good thing, all else equal. And then people can choose, how much do I want to tune my model for chat? That's your choice. You can choose how much you care about this signal and that's okay too. So I think it's a fundamental misunderstanding. But I think as we go, as we continue building this, as it grows, people will become more educated on this topic and then I expect that the world will understand it.
D
And again, just to make sure, because we had the discussion with just us early on, so overfitting refers to the same data. But when you do, like you do supervised learning or something like that, then what? You have data, train data, and then you have test data which you don't show during the training and you hope that it's going to do well on the train data. So overfitting means is doing well on the test data, but only on the train data. Right. If you think from that perspective, there is cannot be overfitting because we have continuously fresh data. The one thing can people say that it's a particular domain which is given by the set of users and so forth, and you are going to learn to do better with this domain, in this domain, which is perfectly fine. Probably you should care about that because the Domain is a group of people you care about.
E
Right.
D
But it's very different overfitting. It's very particular meaning and what people think about here. Oh, I'm going to do when they use a term overfitting, I'm going to do well on. I'm going to learn how to dwell on arena audience. Right? That's what they have in mind. Right. But it's again, that's fundamentally different.
C
Well, actually. So let's talk about a second for the ARENA audience, because you mentioned that's a critical part, right. As opposed to continuing to train your model to perform well on a static distribution. One of the things that shocked me when between the first time Waylon and I chatted the beginning of last year to the end of last year was that are, you know, traffic had grown by 10x. The user base of the community had gone up by 10 times. Why is that? That feels like something people don't. It certainly wasn't visible to me what's going on under the hood. Why are more and more people using arena? And in your mind is that one of the reasons why people don't realize how hard it is to actually overfit, why overfitting is almost not possible on.
B
Millions of people's preference. And I think one of the reason why people are kind of like surprised to see usage grow is because when they think about arena, they think about the leaderboard, right? They think about again, a benchmark. How would a million people use a benchmark? That's strange. But in reality, ARENA is basically real world testing and not just real world testing. The best AI from all the frontier labs. Does the demand grow over time for people to test the best AI, use the best AI?
E
Yes.
D
Right.
B
So that's the very foundation of arena, which is like this is like an open space where everyone can come here to compare all the AIs for their own use cases for free. And this demand we've been seeing has been growing and we believe it has a very strong potential to continue to grow. And in the same time we collect all sorts of comparison data that we can use for evaluation for all sorts of tactics.
D
So one thing that I want to point out because we have been talking through this discussion a lot about votes, right? The votes is a fundamental construct which allow us to evaluate this model and so forth. It's the votes have to be so have high quality. If they don't have high quality, it's like you said, garbage in, garbage out. And we do believe, and there are two things, at least two Things we believe that the votes on arena are high quality. One is that the people who ask questions are the people who evaluate the answers. So presumably they are going to have the context for that question and for that answers as opposed to I have one question and two answers and I'm asking someone random labeler to say which of this answer is better. This is now from information retrieval field for decades and it's called gold standard. When people evaluate the answer to their own questions. When an expert evaluates someone else questions and the answer is called the silver if I remember correctly. But the second thing, people who give votes, who vote in our case are intrinsically motivated. We are not asking them to vote, they can choose not to vote. Only people who want vote.
C
Relative to companies that pay humans, pay.
D
You to vote, or provide other kind of incentives like oh, if you vote more, we can give you more resources or something like that, right? Because you can imagine, you can easily imagine how you can get their wrong incentives which are not necessarily aligned. When I say wrong incentives, they are not necessarily aligned with increasing the voting quality.
C
One of the things that, that strikes me as I hear you guys talk about the design of the platform is that unlike these other paid services where you can just, you know, essentially hand out cash or incentives, when you have somebody intrinsically testing that, the usage of the quality of a model that starts to look more and more like software testing. So 15 years ago, when software systems were starting to be deployed to the Internet, they were bugs, they were insecure, they were unstable, they were unreliable. And so as an industry we developed the idea of unit tests and CICD and AB testing. And today software systems go through a sort of fairly reliable set of checks before they get deployed to production. Am I wrong or should I think about that as a pretty good analogy that we should want if we'd like the progress of AI, the arc of AI progress to head towards more and more reliability, then we actually want model developers and AI developers to be testing their systems more before we actually get, they get released to the world.
D
So I think that's kind of when we started and this is another thing about exciting, it's about we do believe and you can see right now one of the main challenges of adopting AI in a wide area of scenarios. It's actually reliability. Especially if you look at enterprises, right? Is this answer correct or not? That's kind of fundamental, right? And that's like you said, it's very similar with software systems. And for software systems we develop, like you said, this kind of long and Sophisticated testing processes. Right. CICD and so forth. Right. So you should think about that. You need something similar for these models right now are basically tells the truth is like almost static benchmarks. Right. This is what we are doing, right? You start training your model and when the loss rate plateaus, you start testing checkpoints. Right. And you have a 60, 70, 80 kind of benchmark and you look at that in a spreadsheet, see which checkpoint is doing better, whatever, then you can merge them. This is what happens right. Today. Right. But like we discussed, if you really are going to build your application for humans, okay, you can still test on your static benchmarks. Nothing wrong with that. Very valuable. But you also want to test your models, your checkpoints on Sherbot arena for all the reasons we mentioned during this discussion. Yeah. So ideally you want arena to be at the limit. Part of your CI CD for training the models.
C
We spent a bunch of time talking about how arena was born and how the big idea, at least the theoretical idea, is that to unlock more reliability in AI, we need more testing of AI. So let's spend a little bit of time going deeper on the practical realities of making that possible. What are the hardest challenges when it comes to actually building the best testing platform to make AI more reliable?
E
Arena is a very interesting platform. It's unique and it's kind of like n of 1 at the moment. And so there's a number of like technical challenges that are actually quite exciting. We're always looking to improve the platform both from the methodological side and from the infrastructure side. And what makes it unique is that it's this combination of AI machine learning, converting evaluations into learning algorithms, like reinforcement learning side of things, plus like pretty large scale infrastructure. A lot of people don't know this, but Chatbot arena is used by like a million plus monthly users. We get like tens of thousands of votes on a daily basis. We have over 150 million conversations that have been had on the platform. It's massive and it's like the leading platform for these kind of subjective, real world evaluations continuing to grow. So the infrastructure side is actually quite challenging. And then the question is, we have this unprecedented data set, how can we use it and leverage it maximally in order to actually target what we want, which is like the most granular possible evaluations and measurements of model performance.
C
Why is that hard? Why is granularity hard?
E
Well, granularity is challenging because fundamentally the questions you're asking when you talk about granularity is how does it work for this Specific individual or this specific prompt or this specific use case. That is a hard question to answer. Why? It's because you, Ange, come to the platform, you ask three questions and you vote on one of them. How am I supposed to tell which model is best for you? It's like a sparse problem where what happens is that there's a big matrix of users and queries and the number of queries is infinite that the user could possibly ask and the number of users is very large and they've only asked three of them. How are you supposed to learn which model is best for that specific user? Well, you have to do something creative. And the methodology for that, it relates to all these sort of like core topics that are very deep in machine learning, statistics, recommendation systems, so on and so forth. But they come into kind of a new light when you think about language. So one example of a problem that we're working on towards the future is personalization. How am I supposed to create a personalized leaderboard for you? Let's say I have your prompt history and a few votes. Well, in order to run a regression that's just for you, I probably need hundreds of votes. It's just going to be too high variance unless I have that much data. But I'm never going to collect that much data on a user or like only for the most power users am I going to collect that much data at the moment. So we need a way that we can train models that look at your interaction history and then can compare you to other users and pool between users so that you can create leaderboards for specific people, categories of people, so on and so forth. That is a challenging and interesting problem and you need to do it using only this sort of limited information that we have, which is binary preference data. How do you do that? Well, it's a cool problem, it's a hard problem, and it's one that we like have taken steps towards solving. And it's not just personalization. What about if I want to value the data? What if I want to tell you which data points are high signal which users are high taste? What if I want to say, Ange, he's fantastic at bioinformatics, but when you ask him about history, this guy doesn't know what he's talking about. Or what if you want to say, hey, this person right here, they're a local expert in this particular topic and I really should upweight their opinions, let's say. Or this person's just voting noise. How do I take them out? We need to be able to do tasks like these. And they're fundamentally hard because of the structure of the data that we collect. But they're also very exciting methodologically and we keep making progress on them, which is part of the reason why it fuels us. And it's all enabled by this massive infrastructure and platform. It needs to be done at scale, it needs to be done very quickly. And Weyland is kind of the expert on this and he should speak to more of that. Yeah.
B
So before we go into infrastructure, I think one related note all sorts of problem we are looking at like ML problem which also related to recommendation systems in early days where people try to figure out the cold start problem, right? You only have very few data point per user but you are trying to do something personalized recommendations for them or Netflix.
E
Netflix.
B
Netflix, yeah, for movies. What do people like? And as we lean towards a more personalized world where company try to build AI products for consumer everyone and that leverage all these user histories prompts, that model has memories now. So there's quite a new methodology need to be developed. And in particular in this kind of like evaluation context.
C
It seems like there's two or three emerging frontiers of AI progress, right? Relative to two or three years ago where models were pretty simple, the vast majority of questions people had about the quality of performance of the models were mostly about in context learning, right? I give the model a couple of examples. How good is it at predicting the next token or word in that sequence? And it was a pretty simplistic measure. Fast forward two, three years now models have gotten extraordinary. Models clearly look more and more like systems. And one of the systems improvements that you've described is memory, right? So relative to five, six months ago when most AI assistants like ChatGPT didn't have memory, but now do people are starting to notice a discernible verticalization of the model and the systems layer, right? So famously OpenAI has spent a ton of time post training their latest model 4.1 or 4.5 or whatever it was on the with the assumption built in that the model has access to the user's memory and context, right? When you have how do you solve the problem of evaluating a model that where the lines are blurring between model system application, this is turning into a full stack sort of product experience relative to a model that let's say doesn't have all of those, right? Because now these relative to two, three years ago the side by side taste test was naively looked easier to do because it looked like Coke versus Pepsi or Whatever. Right now it looks like a dessert versus an entree versus whatever. I'm doing a terrible job of the analogies, but you get what I'm saying, right? ChatGPT today for example, has memory. Claude doesn't. Right. These are two consumer apps that look very similar on the surface, but under the hood, fairly different. The implementations are diverging and yet on arena they're evaluated side by side. Right. So what does that future look like? How do you guys disentangle the fact that the stack is becoming more and more verticalized and integrated across model system interface, application? But arena today is largely side by side evaluation of models that people are used to seeing thinking of as basically symmetric systems.
B
Yeah, I think it's a combination of again, evaluation would ever become more challenging and more specific to your applications. Just like all software systems need its own CI CD pipeline that's very different from each other. I think the same thing would happen to all the AI products as well. So our belief is in order to collect data or evaluation that really means something that matters to us, to the app builder or to user, we have to build a real world environment for everyone to test, to use it, give us real feedback. That's also why we are like. And then it's a combination of challenge of ML product design and engineering infrastructure. Because ultimately we are going to serve, we are already serving millions of users, we are going to serve tens of millions of user.
E
Right.
B
How can we design a product that people really love to use and then in the same time that's the most organic feedback that we could collect for different kind of user, including memory. So what if we have memory in arena that kind of like applications testing, like really like the long context capability of the model to reason about the past and then to have the potentially the RAC system to retrieve relevant information from the user's past history in order to create a more personalized content for users or more personalized leaderboards for users that help them to choose what's the best AI for their use cases.
C
When ChatGPT has memory built in but Claude doesn't, how would that actually work in production? When I show up to the site and I'm trying to evaluate these models side by side, both are serving the model via an API, does that mean on the arena side you have to recreate memory and then abstract that away as a shared service that all the models consume, how would that implementation work?
B
Yeah, so I think increasingly we are going to go beyond just single model that the model has the capability to connect to different Source of information. You will say like context.
E
The search arena is one example of this.
B
The search arena like we launched a couple months ago is basically arena to specific to evaluator models that has Internet assets of web data assets.
D
Right.
B
And in that case model is not just model itself, it has to be in combination of other components. And same thing happened to memory, right. You have another component which is retrieving relevant information from user history and then this history is actually richer, like not only just prompt, that has all the battle between different models, comparison data and then users express preference. So that kind of like. More like. And then it could be also multimodal, right? Can be like image, can be like video or PDF.
D
Right.
B
People upload long document, that kind of stuff. So all these kind of like different contexts, different modality of data, how can we leverage them in order to create more personal experience and then evaluate them? That would be like very interesting challenge.
E
Yeah, I would say there's basically two ways that we're moving forward. The first is the platform is going to continue to evolve for sure. We're going to keep creating new arenas, we're going to keep improving the arena to integrate things like an artifacts component and things like memory and so on and so forth. And the second is integrations. At the end of the day, if someone wants to evaluate their app, we should be able to provide them a toolkit that integrates with our services to do that.
C
So if, let's say I'm building a code editor.
E
Yeah.
C
But I'd like to understand which one of the 17 models out there are best for my users.
E
Exactly.
C
What does that look? I use an arena SDK.
E
Exactly.
B
Got it.
E
And that's exactly right.
C
And so what would that look like? My users would generate a bunch of interactions that then the Arena SDK is serving on the arena side to run side by side. Or is that eval actually happening in my app?
E
I think it can happen in context. So what you can do is you can have some kind of a gateway that allows people to access all sorts of different models, even maybe the ones that they didn't handpick themselves, but the cutting edge ones that maybe they don't even have access to them. Maybe they're even pre release. Right. And then what we can do is on our side, use all the experience that we've built on sampling data tools, training models, this huge data set that we've collected that has all these multi provider comparisons to do things like choose what the best model is for your users, understand how all the different Models perform all the cost benefit trade offs. The Pareto curve of cost versus performance of different models. All that stuff is stuff that we can instrument and we can do it using in context feedback. Let's say somebody says, hey, let's hook into a thumbs up, thumbs down button, pass that back to the arena SDK. Well, we can look at that and using that information, we can produce leaderboards for that organization. We're the experts in doing that. Right. We've been doing this for years. Things like prompt a leaderboard and various technologies. You know D3. Yeah. So we're building a project now that we call Data driven debugging. D3. It's a little farther out. It's a little come in a couple months. But the fundamental premise of that is that pairwise comparison feedback is not the only kind of feedback that we can use to construct leaderboards. We can construct leaderboards with any form of feedback and because of that we can hook in not just to pop up pairwise preference comparisons for whatever company, which is of course something that we can do, but instead, what if I want to rank code models in part on how many times the code is copied or accepted? Yeah, code change is accepted. What's the edit distance between the code that the model produced and the code that the human ended up sort of shipping?
C
So that's interesting. You're saying we're moving from a world where the primary signal that is used to figure out whether to improve any model is sort of very explicit. Thumbs up, thumbs down, binary preference. You see a future where every interaction I have within a product engagement retention down to a GUI interaction, can help tell the model what to improve.
E
Absolutely. So that's exactly the kind of stuff that we can loop into our methodology that we've been developing and generate useful feedback for people to continue improving their models. If you want to create code that people are going to use, make sure that people are using it and that the edit distance is low and that people accept your changes. Okay. If you want to build an agent like a Devin, that's going to be your software engineer, how many of these PRs end up getting merged? This is the sort of stuff that we're building. The technology that gives you very rich insights into. And I think by virtue of the fact that we're developing this new methodology, we think we have an edge to be able to provide people that kind of service.
C
You talked earlier about prompt a leaderboard. One of the things that surprised me when I looked at the repo, it's an open source repo is how well the model performed on Arena. Can you actually just walk through what happened when you guys recap what it is and then what happened when you actually deployed it on Arena?
E
So I'll get a little bit into technical detail here because I think it's cool. So prompt to leaderboard, what does it do if you look at the chatbot or in a leaderboard? It's Bradley Terry coefficients. Prompt to leaderboard is a technology that we built that allows you to take a prompt and then produce Bradley Terry coefficients for every model that are specific to that prompt. The Bradley Terry coefficients are a leaderboard. Higher is better. It means you're more likely to win a battle. So what's the natural next step from producing a leaderboard? Well, let's make a router. Anj asks me a question. I'm going to produce a leaderboard just for that question. And then how about I route his question to the model that's on the top of the leaderboard? It turns out that when you do this, when you train a prompt to leaderboard model, which is like, let's say a 7 billion parameter model, and then you use it to route Ange's questions on the arena and everybody's questions, that model does better than any of the constituent models that were used in the router by a pretty substantial margin. Now, here's another thing that's yet more interesting. Because the Bradley Terry coefficients have a particular parametric form and a statistical meaning, you can use them in downstream optimization problems. So one example of an optimization problem is a router maximize performance subject to cost constraint. So the router can be, for example, a randomized router that chooses between different models. It has like a random policy that chooses. Hey, Anja asked me a question. With 50% probability I'm going to route here. With 50% probability, I'm going to route here. And I'm going to do so in such a way that my average cost is $0.01 and I'm going to maximize my performance subject to that. Now, if you trace the performance, the best performance that you know, any individual model can give you as part of the router as a function of cost, that's like 2x worse than the router. In other words, the router is giving you double the bang for your buck in terms of performance per cost. If you want to achieve an arena score of 1280 using the router, it'll cost you half as much as it costs you to use any individual model. That's amazing. What it means is that you're taking advantage of the heterogeneity and performance of these models across different parts of prompt space in order to properly route them. And by virtue of the fact that it has the statistical interpretation, you can cost constrain it too. And that's why prompt to leaderboard is interesting, is that because we believe it's like a fundamental first step towards addressing this routing problem in a principled way. And from our perspective, it's like the right way to do routing if you want to do routing to maximize preference. Even internally at OpenAI, they're doing these A B tests. Right. If you want to maximize the sort of feedback that you get there and the engagement, then you should be using a strategy like prompt a leaderboard. So our hope is that this sort of thing would make it easier for them to avoid the dropdown and that they can actually implement it in their own product. I'm sure that they have strategies of their own, but maybe this can be helpful to them too.
C
Let's talk a little bit about. You said that experience will look different over time. The arena experience will look different than ChatGPT. Let's talk a little bit about the roadmap. One of the biggest things that you guys are working on over the next few months and then let's go longer, longer term.
E
Yeah. Well, two that we've already mentioned are personalization and a leaderboard of users. Right. Can we get people, first of all figure out which models they like best and sort of lean into that experience, incentivizing to give us better votes, come here for their personal leaderboards and their personal metrics and then give them a lot of them to drill down really deep in that.
B
And in that case, we align the interests of individuals and the platform as a whole. Because you don't want to mess up your personal leaderboard. Just like how people these days when they use social media, they don't like a random post because if they do that then their feed will be messed up.
D
Right.
B
So it's like, oh, I will be more careful voting, I'll be more careful looking at all these different models, sensors and so on, which we believe collectively we will create a better, even more higher quality arena.
E
Absolutely, yeah. And then on the note of a user leaderboard, can we value the data in such a way that allows people to know where they stand in terms of what kind of questions they're asking, how useful they are? We think people are going to love that. It's such a fun thing to be able to see that in terms of math, I'M asking the best questions. I would love that. I would love if I was asking the best statistics questions in the the world. And I think people will use that and think, hey, I want to be on the top. And so can we continue to align the incentives? And by the way, once we do that, it'll make it much more valuable. It'll make the leaderboard much more valuable because it'll mean that we sort of start removing the noise from people that might be sort of, oh, I don't know what these buttons are, click. And instead people are getting really intentional, really high taste votes, identifying who those people are and maybe even being able to personalize so carefully that we can produce leaderboards for different types of people. That would be incredible.
B
And then on the fifth side of it is like we as a platform has more visibility into who are those user and how do we even customize the distribution that on the flip flop, flip side is model developer care or developers or developer at large cares.
D
Right.
B
I wanted to say, oh, I want to test my AI or my system in developers in Japan, let's say.
D
Right.
B
And then can we have the ability to customize that kind of distribution to target what are the most meaningful distributions that reflect your use cases?
C
One of the things that you guys have been pretty vocal about is open source. I think from day one LM arena has open sourced prompts, votes, chunk of the data that's being generated on the platform.
B
I think we do every week probably updates on the leaderboard and then all the code infrastructures that we process, the data is published as open source and also research blog paper and then including prompt leaderboard, we publish the paper, open source, the models, the code and everything. Because we believe that this is critical in terms of building trust with the community and also really build the foundation of this that we can like enable more and more value on top of it. So for adoption, reason for trust and then for collaborations, as you guys have.
C
Made the transition from being a research project and now being a company, what are the most important values do you guys to create and hold at the company as you guys grow out the team as the project grows?
E
Absolutely. Well, we are very focused on neutrality, innovation, trust. We come from an academic background and yeah, we want to maintain the culture of this is a project, it's a community focused project that's going to continue to grow. Yes, it's going to be a company. The company is going to support the project that we've already built and allow it to grow. Yes, it's going to Continue to change. It's going to change for the better. We're going to keep improving it, we're going to keep publishing papers, we're going to keep releasing open source, we're going to keep releasing open data, right? That's all going to be part of our culture. And it goes both ways because that's the way that you recruit the best. People don't want to hole up at a company and develop a bunch of proprietary technology that is never going to be released. And it's just going to sort of stay in the annals of their nearest neighbors within the company and they're the only ones that are going to know, we want the world to know what the best ways are of evaluating these models and accelerating the ecosystem. And releasing this data is also a big part of our trust. If people want to ask the question, hey, how are models performing on their why are they performing well? Go look at the data. That's what we did with Llama, right? When people had questions about Llama, we just released the data. Easy, right? Just go look. And we plan on doing things like this for the lifetime of our company, right? That's how we're going to recruit the best researchers that are going to help us develop the methodology. That's how we're going to develop the best engineers who care about the whole ecosystem, not just one company. And ultimately that's how we're going to develop best products. That's how we're going to become central to the space. We already are, but we're going to cement it is by remaining open and neutral.
C
And how would you resolve the tension that often exists when there are people.
D
Who.
C
Are concerned that as AI gets more and more prevalent, as AI systems start being deployed in pretty mission critical industries like defense, we talked about healthcare and so on that in fact there's an argument to be made that these systems should be closed source and evaluated in a fairly locked down environment as opposed to being openly tested in this manner. And this is actually irresponsible. How do you think about that cultural tension?
E
Listen, I'm not an expert in national security, but I think an evaluation platform like ours has many different ways of being used. If they want to evaluate it publicly, they can. If they need a private deployment, we can probably also do that for them. It just depends on the sort of level of national security risk, which is way above my pay grade. But for any of these things, you're going to need sort of these subjective community driven evaluations, that's for sure. If things are going to be deployed in the real world, you're going to need real people testing them.
B
Yeah. And also there's a point when you develop the model and this model is going to be used by broadly the public. There has to be like a phase of testing it.
D
Right.
B
And then we're trying to. What we are building is like to bridge this gap between the lab building something that's like the latest frontier research and the world would use it as a large. You need an environment for you to test in the sense that it's a more controlled environment with the people that the distribution. You want to customize, you want to understand the preference. There's a need for a platform like this to exist and want to serve it.
C
Yeah. Could you talk a little bit about Red Team Arena?
B
Yeah. So for example, this real world testing idea of arena can be applied to many different applications like we discussed.
D
Right.
B
From Chatbot to web depth to different modality image, that kind of stuff. And as well as red teaming, because red teaming at its core is like a bunch of people try to jailbreak the models to see if it's really faithfully following what the model has been instructed to do or graded to do.
D
Right.
B
So these days many Frontier Labs in publishing kind of like models back that kind of idea, like how models should behave in this way, in that way.
D
Right.
B
But how do you make sure model follow that instructions? You need real world testing again, you need red teaming, you need a group of people knowledgeable in this space to help.
D
Right.
B
So again, this can be community driven too, because there's a group of vibrant community of jawbreakers. They want to help. And then they wanted to also they tested for fun as well. So in Red Team arena we have a leaderboard not just for model, but for user for Jawbreaker who is the best jawbreakers that can identify issues for all different models. So that very particular, the very idea of real world testing still apply here and then still can deliver value to the ecosystem that we believe.
C
So is it fair to say that if I wanted to understand the security or the safety sort of risks in a model, I could go to Red Team arena and look at the evals that the models are generating over there. How does Red Team arena actually work in practice to improve the security and reliability of these models?
B
Yeah, for sure. So I think same as how we understand Chatbot Web, that that kind of thing there will be in many different applications people are trying to build on top of the models. That's a custom service, customer services or like retriever systems that Kind of stuff, right. You want model to behave in certain way and you want control.
E
Right.
B
And then in Red Team arena the idea will be like, why don't we build an environment to simulate that applications. So for example, can we build an environment to simulate customer services where the model is instructed to not give certain, not take certain actions and then if you are as a jawbreaker trying to break the model so that kind of like signals that we will be getting in terms of like real world testing, Jawbreaking would be reflective to the particular use cases that people care about.
E
By the way, Red Team arena right now is still a little bit of a prototype. We're continuing to work on it. But it's interesting to see people can. It's not necessarily the model that's like most like refuses the most to answer these like queries that people ask necessarily better. Some people want a model that's more controllable, some want a model that's going to say whatever they want. Some people want a model that's going to be completely safe and you can use it PG13 or rated G. That's okay as long as people have the choice.
C
So as we start to wrap up here, one question that a lot of people ask is what does the world look like, especially the world of evaluation and testing as we go from a pre training world to a post training world in a world of models to agents.
D
Right.
C
In some sense it seems like you guys were actually a little bit ahead of the curve where arena has always been an environment for agents more than a set of static. So as people start, as agents get better at long horizon tasks and tool calling and so on this future where a ton of work in the economy is done largely by fully end to end automated systems, does arena have to change in any fundamental way for that future or does it largely look the same?
E
Yeah, I think as we've been talking about, what's the fundamental is organic real world testing with feedback? That's not going to change, I can tell you that is not going to change. Will we have to adapt the ui? Yes. Will we have to improve the product? Yes. Will we have to launch new products for evaluation? Yes. Will we have to develop new methodology? Yes. Does the fundamentals change? I think no. I think the reality is if you want to test your model for real world use, you have to subject it to real world use. It's to collect feedback from real world use and that's it. So we're really excited about what the future has to hold there. We don't actually even know ourselves where the product is going to evolve over the next five to 10 years.
D
Right.
E
The ecosystem is moving so quickly, but wherever it goes, we're excited to follow.
D
Yep.
C
Awesome. Thanks guys.
B
Thank you.
E
Thank you.
A
If you made it this far, thanks so much for listening until the very end, and keep listening in the weeks to come as we have some great discussions lined up. Finally, if you enjoyed this discussion or anything else you've heard on this podcast, please do share it far and wide and rate the show on Apple Podcasts.
Date: May 30, 2025
Guests: Anastasios N. Angelopoulos, Weyland Chang, Jan Stoica (LMArena founders)
Host: Anjane Mittha (a16z General Partner)
This episode dives deep into the journey and mission of LMArena—a platform that began as a UC Berkeley project and is now a company seeking to transform how AI models are evaluated. Rather than relying on static benchmarks, LMArena champions large-scale, real-time, community-driven testing to make AI systems more reliable, accountable, and suited to real-world applications. The founders share insights on the evolution of evaluation, the importance of subjective human preferences, new technical methods, and their commitment to neutrality and open source.
[01:26–04:54]
“The future is about real-time evaluation... in the wild.” — Anastasios N. Angelopoulos [01:26]
[02:26–06:54]
“In fact, one of the things that we help everybody do is pre-release testing of their models.” — Anastasios N. Angelopoulos [05:18]
[06:54–14:52]
“What we’re building is an ever richer evaluation that can tell us all the factors that go into response, how you can optimize people's preferences, keeping style fixed.” — Anastasios N. Angelopoulos [13:44]
[19:29–21:24]
“Chatbot arena is immune from overfitting by design. Means you're always getting fresh questions.” — Anastasios N. Angelopoulos [21:24]
[15:00–24:54]
“Programming is actually a very general-purpose discipline...and yet it seems the capabilities in a very general way are still being...captured well on a specialist arena like web dev arena.” — Anjane Mittha (Host) [22:09]
[26:03–28:41]
“It should be personalized just for you. You should understand which models are best for you.” — Anastasios N. Angelopoulos [26:25]
[30:06–43:47]
“The inspiration was how humans in real life rate players or teams... head to head. That’s how we adopted...battle mode.” — Jan Stoica [37:08]
“It was started as a fun project.” — Weyland Chang [37:18]
[47:40–55:34]
“Prompt to leaderboard... you give me your prompt. Can we tell you which models are best for that prompt specifically?” — Anastasios N. Angelopoulos [52:25]
[58:43–63:40]
“People still think about Arena as a benchmark... But what hasn’t permeated is when you have fresh data, you can’t overfit.” — Anastasios N. Angelopoulos [61:14]
[68:43–82:11]
“You need something similar [to CI/CD] for these [AI] models right now...you also want to test your models, your checkpoints on Sherbot Arena for all the reasons we mentioned.” — Jan Stoica [68:43]
[95:10–99:45]
“In Red Team arena we have a leaderboard. Not just for model, but for jawbreaker — who is best at identifying issues.” — Weyland Chang [97:21]
[91:34–94:26]
“If people want to ask the question, ‘Hey, how are models performing?’ ... Just go look at the data. That’s what we did with Llama—we just released the data.” — Anastasios N. Angelopoulos [92:49]
On the end of “hard exams” era:
"Static exams were useful three years ago. The future is about real-time evaluation, real-time systems, real-time testing in the wild." — Anastasios N. Angelopoulos [01:26]
On democratizing expertise:
"Everybody actually has their own opinions and... there’s so many natural experts in the world... Their vote means so much." — Anastasios N. Angelopoulos [07:57]
On subjectivity & bias:
"People vote for longer responses preferentially. Can we learn this bias and actually adjust for it? The answer is yes. That’s why we're making style control default." — Anastasios N. Angelopoulos [13:44]
On why fresh data is disruptive:
"Overfitting means you’re doing well on the test data but only on the train data. There cannot be overfitting [in Arena] because we have continuously fresh data." — Jan Stoica [63:40]
On the vision for company & platform:
"Neutrality, innovation, trust... We want the world to know what the best ways are of evaluating these models and accelerating the ecosystem." — Anastasios N. Angelopoulos [92:49]
On open source & transparency:
"If people want to ask the question, hey, how are models performing on their why are they performing well? Go look at the data. That's what we did with Llama, right? Just go look." — Anastasios N. Angelopoulos [92:49]
LMArena’s journey is a case study in how large-scale, user-driven, transparent testing is reshaping the definition of “good” in AI—moving from abstract, expert-defined benchmarks to immediate, empirical, and subjective measures shaped by the entire community. Their commitment to openness, neutrality, and enabling both the industry and individual users signals a foundational shift in both the form and substance of AI progress.