
Loading summary
A
I think that we are at a threshold of mathematical renaissance, which is to realize that there are so many unsolved problems that will currently take, say, researchers months to crack. That we believe is not sort of out of reach for today's AI technology. The world is realizing that we need verification. Like, you know, we can have very cool like lovable websites, but like how do I vibe code nuclear reactor? Our worldview is mass reasoning is a true reasoning layer of AGI. From math you kind of get to code and from code you can run a lot of real world experiment in the software first tech. He's a beautiful vision is for all the theoretical problems to be resolved in a satisfactory way.
B
Hi, I'm Matt Turk from Firstmark. Welcome to the Mad podcast. Today my guest is Corinna Hong, founder and CEO of Axiom. Corinna is a 24 year old Rhodes scholar who's had an incredible personal journey from competitive math, Olympian in China, to mit, Oxford and Stanford Law. She started Axiom less than a year ago to build an AI mathematician, a system designed to be 100% correct 100% of the time, effectively solving the AI hallucination problem. We talked about how Axiom's AI aced the notoriously difficult Putnam math exam and then went on to autonomously solve four open research conjectures. Why verification is becoming the missing layer in AI, how Axiom works under the hood and what this could all mean for the future of AI. Please enjoy this great high signal conversation with Kuruna Hong Kong. Hey Corinna, welcome.
A
Hi, great to meet you.
B
So you are building an AI mathematician. Why does the world need an AI mathematician?
A
Yeah, so I think that the idea where you have infinite number of mathematical reasoning agents going out to the industrial society to solve all the theoretical problems, I think that's incredibly compelling. I think through solving math we also realize that it can solve a lot of other problems, such as verification, such as optimization is actually I think that math is great. And if you solve math, you can have physics, you can have a lot of logic and property and you can extend to a lot of things. The world needs more math.
B
Great. And what kind of math are we talking about? Is that, is that high school math, is that competitive math or is that deep research math?
A
Yeah, I think people generally start with competitive math because it's kind of, you know that you have like a non solution and then you kind of start hill climbing the infinite sort of like infinitely high mountain of math. There are actually two axis of difficulty. One is how creative the solution is and the Other one, roughly speaking, is how abstract the mathematical object is. So say a qualifying exam can be incredibly abstract, but the sort of creativity required to solve each problem might not be that high, might be very standard. On the other hand, IMO problem, well, it's very sort of easy to understand even by high school students. Not very abstract, but it's incredibly creative.
B
You guys are ultimately a young startup, but you've already had incredible success. So let's talk about the Putnam last year first and maybe define what the Putnam is for people that may not be in the math world.
A
Yeah, 100%. So we started out in July, mid July and so Putnam was December and we were like a four month old startup. We were kind of like looking at Putnam as this like really hard mass competition and most of the people actually got zero. So over 50% of humans got zero. And I think over the 100 year history of Putnam there's only five human perfect scores and you have six hours
B
to do it and 12 questions is at home.
A
That's right, you have 12 questions, you have three hour sessions, you have morning and afternoon. So yeah, that's a setup. I think it was a Saturday, it was December 6th and we all kind of gathered at the Axiom office and decided to put Axiom Prover in the real time task. So it's not a benchmark, it's, you know, we got the exam from the proctor of the, of the Phnom exam and then we just basically throw it to the approver and we announced that we got a perfect score.
B
So 12 out of 12.
A
That's right. Eight within the time limit and then 12 out of 12.
B
And then the more recent challenge that you guys solved, I think you solved four challenges. Maybe talk to that, that just happened.
A
Yeah, that's right. So I think that because we have like a lot of mathematician friends and they all have a lot of really hard research conjectures. So for example, Professor L, and he has like four failed conjectures is the last one still standing. And he's a Israeli professor at Technion University. And there are also Dawei Chen, a Boston College professor who's an algebraic geometer and he knows actually Professor Ken Ono who's our founding mathematician for four years and they recently met at this joint math meetings conference. So he also supplied a problem. So people start sending problems to us and then we just put the system into test. And recently I think like a couple of weeks ago we just announced that Axiom Prover solved these four research level open problems. And it's quite Interesting, because it's probably the first AI to solve a research conjecture completely end to end and self verified. That means the output are fully verified, 100% correct.
B
And that was without human intervention.
A
Without human intervention, that's right.
B
Okay. My very uneducated understanding of world class level math is that a lot of the top implementations today in history ultimately operate through a combination of sheer iq, deep knowledge and all the things, a lot of intuition and a little bit of serendipity. Where does that fit?
A
Yeah, I think kind of two parts. One is there are some really important mathematical breakthroughs that happen once in a decade, or maybe perhaps once in a decade for each domain probably. But those require very interesting sort of eureka moments, deep intuitions and, and very sort of almost like lucky kind of moments. And there are a lot of other questions that are sort of proficiently routinely applying the standard bag of tricks. And I think that a lot of the research questions can be solved by a combination of both. So a little bit of intuition, a little bit of, sort of just, just kind of like one step at a time. I think that we are at a threshold of mathematical renaissance, which is to realize that there are so many unsolved problems that will currently take, say, researchers months to crack or even technical lemmas in those really long standing conjectures that we believe is not sort of out of reach for today's AI technology, but only through I think very intricate design of system and hybrid kind of use of different methods. And that's what Axiom is trying to do. These are the first batch. We hope to have a lot more coming. We actually have actually a few more research conjectures that's being proven every week just by the supply of mathematicians from the world. And we try to put those problems into use.
B
Fascinating. And is the system solving problems in a predictable way? Where I'm going with this is the whole Move 37 discussion where you find AI solving problems in almost alien kind of ways. Is that part of what you're doing? Is that what you're seeing? Or is that something that's coming up in the future?
A
It's interesting because there's I think like a hindsight problem. So it's like we didn't know how to do it. I think I definitely have no hope in solving those conjectures. And our founding mathematician, Professor Ono didn't know how to do it either. Okay, now we saw the Lean code, right? Like thousands of lines of lean code and sort of read through it, understand it, and maybe we see some of the techniques as sort of standard. But I think the application of them and the combination of them is also not entirely. I think it's somewhat at the level of a junior math professor, say like, you know, a postdoc or a junior researcher. Obviously a lot of junior researchers do amazing work and they have their move 37 moments in some very long standing open questions. But there are a lot of sort of day to day of research that it feels like it's at that level. There's also this question of like because it's solving the problem in Lean, which is a kind of machine language, not a human natural language, the proofs actually look quite different. So we actually analyzed all 12 problem solutions of the Putnam exam and we found that a lot of the solutions actually differ from the human solution. So because it is a Lean based system, it is really good at sort of routine bookkeeping and it will actually choose a lot of the more mechanistic, you know, arguments or the ones that require like a clever, say one picture solution. And on the other hand, you know, there are a lot of these sort of caseworks that humans shy away from. It's just very, you know, easy for the machine. There might be like a slightly bit of what's difficult for humans versus AI are different.
B
So you mentioned Lean a second ago and I guess that's going to take us a little bit into how the product and the model work maybe for people again who are not in the math world. What is leaning?
A
So Lean is a programming language for math proofs. I think that's kind of the one line high level explanation. There's this kind of concept called Carl Howard correspondence which basically allows you to kind of code math up as computer programs. And so Lean is like similar to Python but it also can serve as, it's like self verifying function. So in the computer science analogy it's roughly both the C language and the GCC compiler. So two in one, it's a formal language. There are a lot of sort of other theorem proving language before Lean, such as Isabel, such as coq, now now called Roc R O C Q and such as Hou is actor. So there is this sort of like family of like formal languages and Lin is one of them and it's a very popular language. There are a lot of people, mathematicians around the world that use Lean that choose to code their proofs up in Lean and they can just run it and then they will see a check mark which shows that it's logically correct proof. And if there's an error message, maybe there is Like a bug somewhere or there is some sort of syntax or you know, type mismatch just like any other programming language. The fun fact is you can actually use LIN as a functional programming language. You can write an autograd in LIN for example. And that's very interesting because basically it allows you to both math and code at the same time. So if you think about a security protocol, you can try to implement the code in Lean but also prove its soundness in Lean. So it's a very flexible, adaptive language.
B
Great. People may have heard of both OpenAI and Google DeepMind sort of winning IMO the International Math Olympiad and other very hard to crack kind of like math problems. How does their approach differ from what it is that you guys are doing?
A
Yeah, I think the sort of concept formal theorem proving actually existed like automated theorem proving as a field existed before deep learning. So I think there are a lot of researchers, a lot of them in Europe in 2018 and even before then were doing sort of like automated reasoning without the LLM component. And we actually have some of these people on our team, like they were the authors of ATP Boost and it's a very interesting time in 2019. Francois Charton and Guillain Lampo, the co founder of Mistral, they have a paper which tries to put transformer on sort of symbolic integration and realized that it can beat computer algebra systems such as Matlab or Mathematica. I think Ilya was actually the reviewer of the paper and he actually tweeted about it. There is this other fields medalist Tim Gower said it was either amusing or game changing because it was an open review. People don't know if it's correct or not. And that was not amusing. That was the beginning of AI for math. And Francois is now also at Axiom. This is like a long history of what people are trying to do with it. And I think Google started the alpha geometry effort in 2021 and that was very exciting effort. They realized that if you convert the figures and lines, triangles, circles, intersection points in the symbolic expression in the vector language, a specific domain specific language for geometry, Euclidean geometry, you can actually try to do those geometric problems like a lot, a lot easier using machines. And that's very interesting because it's just like drawback to my childhood time where I was doing math Olympia and I could never solve one Euclidean geometry problem. I don't know what's wrong with my brain. It's usually the easiest problem problem of every competition. So it's like if you go to a math competition, you don't solve the geometry one, then obviously you can't solve the inequality one, and obviously you cannot solve the number theory or the combinatorics like Holy Grail. That's the one problem you must know how to solve. And I don't know how to solve it. And I remember my teacher, the coach taught me how to do the complex coordinate one, which is a very tedious way of converting everything that shows up on the figure into a complex coordinate and just basically manipulate those algebraic expressions. And through that I can solve it. I will solve it a lot slower other people, but at least I will solve it. But it's a very interesting philosophical point which is you can convert like, you know, geometric, geometrical figures into algebraic expressions. And I think that's what they, what they did. I mean, not exactly the human version, but alpha geometry. And then that led to alpha proof in 2024, I think Google sort of silver medal in alpha and IMO missing the gold medal only by one point. That was my moment. That was my moment of IMO at least. And then they couldn't solve the two combinatorics problem. And in 2020 25, no one solved the one common targets problem either. So in 2025 there's only one common targets problem. And I think that was kind of the history of things. There are also other players in the field and also a lot of really great academia labs doing it. We kind of take the approach of, you know, we think that it's important for the system to be able to reason both informally and formally and in a way, sort of bridge across these different abstraction from high level intuitions to low level, like, more like lean, sort of formal checking.
B
What does that mean formally versus informally?
A
Yeah, so informal is like say reason in natural language in English. And mathematicians, quite fascinatingly, they have been doing reasoning in English, I mean, for thousands of years. I mean, they write formulas as accurate, but mostly they write arguments, proofs in English. And I think to us, math is code in a way. Mathematicians have been coding in English for centuries and thousands of years. The formal language means lean and the sort of output will be lean machine code, I mean, wouldn't be super readable to humans. But I think there are two beautiful things going on here. One is first time this sort of formal proving kind of comes in to assist mathematicians, which is a traditionally informal reasoning subject. The second thing that's interesting is you can play the strings of both informal reasoning and formal reasoning. So you're kind of like, you can bridge across these different sort of like level of Abstractions and auto formalization, which is the sort of capability of converting the natural language reasoning to say the formal language. And that's harder than translation because it's different than say translating between two programming languages. You're translating something that cannot be verified natural language into something that can be. And that direction is obviously very challenging but also very promising. There's also auto informalization which is kind of translate back, I mean from lean two to English. That's easier than auto formalization because most of the machines AI have seen a lot more English than Lean.
B
And just to make sure I understand. So are we saying that the OpenAI and Google DeepMind approach would fall into the informal correct. Whereas you'd be falling into the formal.
A
Google was doing I think formal until I think as of the previous year's IMO 2024, the Alpha Proof system was a formal system.
B
So it's English and my words, not yours. But like more of a brute force kind of approach versus what you do is more neuro symbolic. Is that accurate?
A
That's how I would describe it. I think first of all is that we're supportive of scaling. We think scaling works in a lot of the scenarios. There's also this question of simple efficiency which is kind of, you know, how effective scaling is potentially. And I think for the sort of informal way to solve mathematics require vast amount of training data, you basically throw everything you can possibly find on the Internet to it. Now my question of that is what if you also throw these vast amount of mass text data to your AI but you throw the lean version of them as well? I think that's, you know, we believe in doing things at big scale and Internet scale data set of lean. I think that in addition to the Internet scale data set of math is going to be quite interesting. And I think that we shouldn't do pre training. We shouldn't try to just only train from scratch. I think kind of focusing on post training reinforcement learning can potentially get us
B
better performance gain and to that exact point like how us contrast and compare. So RLVR versus which is reinforcement running with verifiable rewards against what it is that you do is that are those just completely different approaches because they both aim at the same thing which is to basically get to perfection.
A
Yeah, I think the world is realizing that we need verification. Verification means very different things in math. I think in 2020, early 2025 or late 2024, it means the numerical answer associated with each problem. Now the thing is one is reward hacking. We have seen from say Frontier Math and other benchmark which only compels a numerical answer that it doesn't actually necessarily reflect the model's capability in the logical reasoning. So it's able to get to the answer without reasoning through it. Which is quite fascinating. I mean there's always these, when I did math Olympia before and there's always this like classmate who's really good at guessing the answer. I don't like AIME, which is this exam that all the answers are between 000 to 999. Like I remember there's one year where my friend told me that like, you know, he just basically guessed three questions correctly versus the rest of us need to like reason this through. And like Jesus Christ, he just put like a zero in there and then somehow that answer is indeed zero. It sounds very unfair. And you know, in a way in high school the teacher will like ask you to show your work. So for a while I think verifiable reward means that final numerical output. I think that people are now realizing it doesn't scale to the sort of math AGI. And however you define it, most of the sort of adult mathematics mathematical research are proof based, require sort of step by step rigorous deduction based on logical reasoning. And a lot of them don't even have a numerical answer. Like a lot of the problems are prove that something exists, proves that something cannot exceed a certain value. You know, very seldomly beyond the Math Olympiad kind of high schoolers context, you would have a math question where getting to the answer at the end of it, it's a lot more difficult to get verification reward for the intermediate steps. Right. And so if you want to have a reasoning engine that really truly masters at logic and mathematical reasoning, then you need to somehow get verifiable reward for the proof steps. Coding is great. I mean people have seen RL and coding have incredible gain. And can we turn math into code and lean, which we just talked about, the Carter Howard correspondence. Exactly. Turn proof into a computer program. So that makes RL VR possible in our setup as well.
B
And just to drive it home for people, in case that's not obvious by now, what we're talking about is building an AI that is 100% correct 100% of the time. Right. So completely solving the hallucination problem or the stochastic issue make AI perfect, the perfect prover. How generalizable do you think the approach that you guys are using is?
A
I think the one thing if you talk about perfect AI, I think people's first reaction is wow, that's really valuable. I think a lot of the different labs are trying to reduce a hallucination or increase sort of the accuracy through many, many different ways. If you have a lot of the sort of industries where mistakes are extremely costly, that's a block to AI deployment. If you don't have that sort of provokal guarantee. And now that's the value of like say catching the edge cases. And there is this additional value of trusting that your edge case can be covered. So two additional layers of value to sort of reliable, consistently correct AI. In terms of how general this is, I think we start with math. Our worldview is math reasoning is a true reasoning layer of AGI. And I think a lot of the labs share that view. Labs across us, China, Europe. And from math you kind of get to code. Math gives you proof of property and code gives you output. Output and property affiliated to it are two quite important part of the digital world. And so from math you go to code and from code you can run a lot of real world experiment in the software stack, then you can have a lot of other things. We don't claim to be doing things that are in the physical world at all. We are obviously not doing things that are non verifiable. Say just like sometimes mathematicians are stereotypically not the best sort of writer. We are not kind of building an AI that's very good at literature. But I think in a lot of the fields such as from math and code verification and sort of verification applied to many different domains, it's incredibly valuable.
B
And do you need a Lean equivalent for each one of those domains as you expand?
A
That's a very interesting question. I think so as you can see even within math, right. Sometimes creation of domain specific language, like the vector language for Euclidean geometry have its gains. There could be the case where in other domains something that is not exactly the abstraction of Lean are the right sort of medium. But in that case you can sort of do co translation and you can kind of build out the sort of stack that's required to use your Lean based serum proving engine. I think the sort of gap between say for example Lean and another like strongly typed language like Rust is a lot closer than the gap between Lean and English. And I think that's a lot of the commercial value. I mean if you can sort of reason in between informal and formal space, that I think is going to unlock a lot of the things beyond just the power of a formal theorem prover.
B
Yeah. And do you have a sense for where that threshold is? So if you have Math on one side and English on the other side. Effectively with your approach you're going to be able to cover kind of like all of science. And the second you start getting into non scientific fields, then the approach doesn't work anymore or you don't know yet and you're about to explore.
A
I think we want to try to figure out what are things that can be done in software. One is, I think math and code, they really complement each other very well. I mean there are a lot of great code generation company we can provide approval guarantee and code verification. There are a lot of other domains where just that sort of verified generation capability is incredibly valuable, like hardware. And then I think if you can have a lot of theory and you can have partners who are really good at real world testing, then that is AI for science. And I think that's also incredibly promising. I feel like this is a generational effort. It's going to take a long time. We're going to see the DNA of the company remains math and we're going to see best first market, maybe verification, best second market. I don't know what that is. Could be optimization. A lot of the things I think are waiting to be explored, but just the generation verification loop I think itself is going to have a large tam. And then I think, you know, there are a lot of things that we're also learning together with the potential customers.
B
Yes, because you're also a year old company, not 10 months old.
A
Seven months old.
B
Seven months old. Okay, amazing. Before we go further, you mentioned your background a couple of times in passing and as I was prepping for this, it's just so fascinating. I want to spend a few minutes talking about it. You covered it in some other podcast, but I think the story is just amazing. So taking it from the top. So you grew up in China and you were a competitive math kid. Just tell us that story.
A
Combative math kid is such a great term because you can break in different ways. Competitive math kid, Competitive math kid.
B
So which one was it?
A
Both. Well, I think I like to win.
B
So walk us through. What is that experience and how formative was this?
A
It was extremely formative. I think years of math Olympia training. You have one goal that is to score as high as possible on whatever that is. Next math competition, you have people that are in the same sort of community circle that are also doing the same thing. You are friends with them, you're competitors with them. There are a lot of sort of background reading, learning exercise you need to do to over prepare for every competition. I remember I did 75 exam papers to prepare for a competition that I didn't know if I will be selected for. And I didn't end up being selected for it.
B
How old were you in high school?
A
I think I was like 14, 15. I think I learned a few things. One is resilience. It's just like, I think you get addicted to pain and suffering so that like the word resilient is like almost a paradox because it's like you like it. Failure is like a given. I think throughout that time, I mean just the exam keeps getting harder and the number of people competing keeps shrinking. Like elementary school, I have like 1,000 friends competing for, you know, the spots for the middle school. Then middle school is like 90. Okay. And then like high school 25. That means vast, vast majority of your friends lost the opportunity to compete. And that's a very interesting thing I think it does to a child. But then I also like learned some like, you know, other side, you know, not just the math Olympia. When I was, I think 14, 15, I got into the Ross math program, which is one of, I think the best like high school math camp in the states. Ohio State University. I think my first trip to the United States was that summer. It was like, you know, summer of like eternal joy. Like every day I will be learning cool research math. Like they taught us undergrad math and they asked us to deduce everything from the ground up. We were asked to prove zero times everything, zero by a limited number of axioms. And that was very defining. It felt very different from math competition. It's not like how many people will win that award. It's like there is a vast amount of math that you just have no idea about and you get to build it yourself. Almost like one brick by another from the limited number of theorems you are provided you prove new things. So we're given about 25, 30 problem sets and each of the problems that have problems that are probably just bookwork theorems and you would just learn it in college. But instead of presenting it as something that is a given fact, it asks us to prove it. So our world of mathematical knowledge is constrained to how much we can prove. And that's actually what's going on right now with Action Prover. Like Action Prover has access obviously to a lot of the word's information, but because the lean data is so scarce, it's like a lot less than say the amount of code data out there. It's only 2 digit million number of tokens out there in the open, open world actionproover learns to prove things and it kind of self improved in a way where all the things that it proved got fed back into it into a kind of a skill library can be like apply for the next challenge. There's also this sort of like self challenging conjecturing component that keeps giving it harder problems. Just like my camp counselor gave me the problem set. And this is a very beautiful process. I think without this sort of right order, sequential order of introduction, I wouldn't kind of go this far in math or love math as much as I do. I think the sort of curiosity and discovery is a basic human need and that definitely exists in like back then the Tinder Child and that being used to motivate and inspire mathematical learning. I think that was a very beautiful process.
B
Amazing. And then on some other incredible things that you've done. So you did MIT in 3 years I believe then you went to the UK on a Rhodes scholarship to study neuroscience. Why neuroscience? Was, was it all part of a grand plan towards AI or was it just your interest naturally carried you?
A
Yeah, I think the Rhodes program did a really good job and probably too good a job to encourage us to just shift direction. It has this sort of broad sort of belief that you need a lot of disciplines and studies to help you become a global leader. That's what the Rhodes Scholarship is trying to sort of nurture. People who have a background in stem, they will encourage you to go into liberal arts. I wasn't fully encouraged to go into liberal arts, so I picked something that's kind of in the middle, like neuroscience obviously. I think at Oxford I have a lot of math friends and so kind of still math was part of the equation. I was also trying to apply math in my neuroscience study specifically maybe because I just am afraid of animal experiments. Like I'm not gonna, I'm, I'm probably just gonna stay in, you know, data analysis, computational neuroscience. I was quite interested in topological data analysis and persistent homology and. But later I realized, I think it was like two, three months after the school year started, I realized there's something called UCL Gatsby. UCL Gatsby is this like premium, like you know, AI hub in London and Oxford to London, it's a short train and there's so many world class faculties there, like doing really cool research in say like theoretical machine learning, like you know, analyzing the neurodynamics of stuff. They're also doing like various other like applied AI research and some from like cognitive science, motivation, but really the core is. Core is AI.
B
Then you topped all of all of this with joint PhD in math and JD in law at Stanford. All of this been fascinating not just in terms of achievement, but in terms of range. And I'm just curious how you are able to do all of this and whether there's any lesson for anybody else. I mean, clearly there's an element of the chart iq, but there must be something else.
A
I think there's a lot of things actually my junior year, I mean my last year mit, I'm like, I kind of grew up with a very sole focus and goal to do Maths Olympiad and then do math research. And the very next step is to probably go to grad school directly and probably not even do the Rhodes scholarship. I'm not sure because a lot of the Rhodes scholars are like, you know, politicians or aspiring lawyers, judges. I'm like, I just want to have some fun intellectually. At the end of the neuroscience, I was like, okay, I want to go to Stanford to start my math PhD. But also there's this incredible opportunity of Stanford Law School, literally one of the two, I think first ranking law schools in the country. And they have really good IP professors who marry like AI and copyright law. They have really good Professor Mitchell Polinsky in law and economics where you basically are doing differential equations, but you're like analyzing deference and retribution, the ratio of each sort of criminal law measure. And there are a lot of like other things, cool methods to apply textualism to constitutional law. That's very similar to looking up definitions in math textbooks. And so I was like, okay, that's very cool. So I did my, I mean the JD PhD is that you have to spend one full resident year in law school. So that was my first year. I spent like one year being a diligent law student. I was even trying to apply for clerkships. It was a fascinating year and I learned so much. And. And then the second year, which is kind of a very interesting year where I was browsing all the AI for math research paper and realized that, wait a second, there's so many ideas. I mean from draft sketch proof to I think STP self placer improving. There's so many exciting papers and I wish I just have the resources in industry to execute. And that was like when I think very, very soon after, I just basically decided to do Axiom fully focused on the company.
B
Fascinating. Thank you for that. So let's actually go into the product now. So we alluded to some of this. Let's unpack how it actually works. So you mentioned there were three components. What is the architecture? What do those components do?
A
So I think our very broad vision is that we are going to have a contractor, we're going to have approver, and then there is knowledge base. In a way your knowledge base is like, let me take a metaphor, I think for the. Because this is quite, I think in the niche area of very subfield of AI. But suppose you are sailing on an ocean and where do you know where to go? And sort of your ship that basically decides where to navigate, that's your conjecture. And then you sail toward one direction and then you land at this island. Okay, well do you know if you have been on this island before? You don't necessarily know. Basically you need to look up your knowledge base. You want to make sure that this is indeed uncharted territory. And then once you realize that it is uncharted territory, how do you know if it's like say India or West Indy? Right. So is it, I don't know, is it going to have some rare metal that's kind of where your prover starts coming in to basically prove this new conjecture that is not in the knowledge base, that is mathematically correct and has merit. And so that's kind of the. And then there's auto formalization, which is the ability to reason across informal and formal space, kind of weaving all these great.
B
And so the conjecture part, is it LLM based or are you in a completely non LLM world?
A
So we do post training on say like, you know, open source, like LLM.
B
Okay, so there is an LLM. So how does that work then? What creates the conjecture? Is that front based? Like what goes into.
A
Yeah, I think I would say here like the conjecturing part is still the under development part. Like we have been in the last, I think seven months very focused on prover and also made a lot of progress on the knowledge base. So in the Puntum exam, right, you don't need to conjecture. You have 12 problems, they're incredibly hard and they are like, you know, basically tests for your prover. So action prover tried on pandemic exam, got perfect score. The kind of underlying system is a system of ensemble of models and there's also a set of deterministic tools and also there's like a proprietary data set that's very large. So kind of a combination of these three things that led to that success. Specifically, I think that for the deterministic tooling, it's quite interesting because these are actually written for Lean in the language of Lean. So a bit like metaprogramming and that's very interesting. We are actually going to release them on a public API on all these dozens of pools very, very soon, beginning of March.
B
Great. Is that a big announcement today on the podcast?
A
I mean it's releasing the infrastructure for mathematical reasoning. It's called Axiom, Lean Engine Excel. It's interesting because there are a lot of sort of grassroots effort from the open source community to try to provide infrastructure tooling for lint theorem proving because Lean is a really sort of it's relatively new language and there are a lot of reasons why it could be a bit slow sometimes. There could also be things where if you assume an axiom that's mathematically incorrect, if you assume n plus n equals n, then you will be able to prove 2 plus 2 equals 2. You don't want that 2 plus 2 equals 4. So a lot of this sort of verify prove is actually one of our prover tools that's about to be released and that's actually 100 times faster than the other counterparts that are the open source like effort cloud comparator. So a lot of them are hopefully going to make everyone prove more theorem.
B
And in this architecture that you just described, compared again to what seems to be becoming the norm in other parts of AI, fundamentally this pre training LLM plus post training system, is there a trade off to your system in terms of is that more or less compute intensive, Is it more or less fast or slow?
A
We had a little bit of a sort of co start problem. Right. I think the data is quite scarce. So where there are more than 1 trillion tokens of code is probably a lot less for Lean. So we had to basically take very both data beds to generate a lot of Lean proprietary data. So that's one difficulty.
B
And let's double click on that like that. So how did you do that? So you created synthetic Lean data.
A
That's right. So it's interesting because when people talk about synthetic data generation in the unverified domain, you really don't know the quality. How do you know the synthetically generated financial advisor data is actually good? Then they have human experts to try to label and grade it. Here you have links so you know that your thing is correct at least. And if you do good quality control on the statements, then you will have things that are off mathematical merit. And when we kind of take data beds, we use things like auto formalization to convert existing math from informal language to formal language. We also do things kind of that are more formal system Inspired such as repair fuzzing is actor to make there a lot more synthetic variants of the existing formal data that we currently have have. So the other difficulty I think is like lean runs on CPU and then the sort of LM part runs on gpu so you have a little bit of cpu. GPU means engineering. Just like a very interesting effort where ideas are out there. You need a very strong industry strength engineering team to execute the many good ideas. Maybe some academia researchers have produced. Some of our researchers have produced in terms of kind of how compute intensive. It's not horribly compute intensive. Definitely not compared to pre training. I think that data is a large part of it. I think that good sort of infra engineering is another part of it.
B
So as I was researching this there were some numbers on a Putnam question basis where there were millions of tokens give us a sense for the order of.
A
It really really varies. I think there are been cases where a heart problem In Putnam takes 1 million tokens and stuff. But there are also a lot of other things we could do. We in house have something that can shorten proof. So for example, you can shorten a proof significantly 20 times shorter. It's like different levels of how you would like to count, like how kind of bulky a proof is.
B
And then still bearing in mind that you're a very young company. What is the current state of the product? Are you mostly focused on MVP kind of product that can solve this amazing problem, but that's not industrialized yet. What part is research versus what part is engineering and product?
A
So far we focus a lot more on I say so there's this sort of like, you know, team of really strong machine learning researchers and engineers and they're all like both researcher and engineer in one. They're like really amazing. We have a lot of really good people from Meta, from Google Brain, from Anthropic, et cetera. And we keep hiring more and more sort of Frontier Lab researchers. This part I think is focused on developing the core capability of the system. So we want to basically push the goalpost forward. So from Putnam, perfect score. That was four months in. Then two months later was the four research contractors. And then during this middle we also have tested something that is transfer learning from math to code verification. So another evaluation on a community recognized benchmark. We want to kind of try where we can get because we have really great mathematicians telling us how we should think about certain research problem targets. And we currently have really hard research math problems in house that we are tackling. It's showing some Problems, it's also obviously getting stuck. So there's this part, and this part is like the current focus of the company. Once we know where the frontier is, then we can try to say, okay, let's make it robust, let's make it sort of production grade. So when 1,001 million people hit, doesn't break. But this part is kind of sort of an effort that's surrounding that and sometimes people jump between different task as well. Now we learn something about the applied use cases. We also have a lot of subject matter experts and we are hiring subject matter experts to join us to work on trip verification, to work on code
B
verification, to just double click on something you just said. So is there a long list of just pure math challenges ahead for the uninitiated? Is there an Everest in math?
A
There is, there is, yes. So what is it roughly by, I mean journal submissions, A lot of other factors obviously. But you can think about currently the batch of papers axiom prover has autonomously proven and mathematicians have written. He can probably get into Journal of Number Theory, Journal of Algebra, like that level. Well, that's a very different question from Annals of Mathematics or Jams Inventionese. That's one big jump. I think to get to that sort of result requires a lot of pushing and we are not pushing it to just chase the amazing feeling, which is quite amazing, of proving something that is grand and open for a long time. But also at the same time we are basically teaching the model things that it could not do before, such as a more complex reasoning tree. So on the easy end of the Putnam problem, we have 40 nodes. On the hard end of research questions in house, we currently have a research problem with thousands of nodes. So it's a much wider and much deeper tree. And we want to see are we going to hit a limit or not? We currently are not seeing one. So we really want to basically scale the complexity of the reasoning of the problem. We want to make the AI be able to do library learning. That is, we have seen it actually quite promisingly auto formalized definitions, which is really, really hard. So in math you have theorems, proofs, lemmas, propositions, basically you have definitions and that's very hard to ground. So you want to be able to auto formalize definitions, you want to be able to have the model system explore definition to they'll be relevant for further
B
proof to progress through this series of problems. So if this is not super GPU intensive and this is super fairly, fairly, yeah. And if you've built a way to create Synthetic data that works. What is the fundamental bottleneck? Is that doing more of the same thing across more domains or is there an architecture evolution?
A
There's scale up and there's scale out. So we're currently scaling up in difficulty. We believe that is a defensible mode. We believe we are currently doing certain things interestingly and are ahead of a curve in terms of how we get rid of running out of context, these kind of problems, how we scale learning from experiences, how we scale inference. I think we are doing that and we are also scaling out in a way of both. There are some mass problems, they are not solved, not because they are like or they're not auto formalized. Actually there are interesting things you can do. You can choose an existing math result and try to auto formalize it, or you can choose an unsolved math problem and try to prove it. Both are incredibly valuable. I think people talk about unsolved problems all the time and there's a lot of value in actually picking good targets to try to auto formalize it. A lot of these are unsolved or you know, not completed because they are very complicated in terms of the sheer volume of that result. So that's kind of scaling out within the domain of mathematics. And then there's also scaling out from math to other domains such as code verification and then hardware verification.
B
Do you think that AI can win a Fields medal?
A
There is this friend who taught me a lot of things about math and he said that you don't celebrate when you win the Fields Medal, you celebrate when you gain the shortlist of Fields medals. So obviously there is only a finite number of awards. And I think that we really want axiom prover to be able to solve one long standing problem in mathematics that you can objectively say, even though if it's an AI or double blind, whatever, that will be in the shortlist.
B
And just to unpack that, why shortlist? Why is the.
A
Because then there are reasons whether. Oh, because then it gets political, not quite political. I mean sometimes fairness, for example, if a certain domain just got, you know it's. Yeah, yeah. But to the shortlist that is sort of the objective standard and to the
B
broader creation that I guess we alluded to a little bit earlier in the conversation of just creating brand new sort of groundbreaking science, like you think AI is well on its way. I mean obviously it's doing some but like in terms of humanity altering kind of groundbreaking discovery the first couple, well,
A
not the first couple of months, a couple months before we actually start executing it. Was incredibly exciting for me. On an intellectual level. Like every day I have this sort of excitement. Like it's like I drink like six cups of coffee kind of excitement for months. And the main source of that excitement, which I will tell you actually about, you know, my colleague Shubo, good friend, his excitement of this in a bit. But my excitement is like, just like we are now realizing that we are at the threshold of mathematical renaissance. We could also be at the threshold of theoretical discoveries in science. Massive, massive scientific discovery at the theory level. And I think what I mean by that is we have been in a very mass poor word. The supply of outlier mathematical reasoning skill is so lacking that people are like in the scarcity mindset like you will like hear discussions of oh, like this problem is so interesting. Unfortunately I'm solving that problem. They should all be solved. Everything that human mind can conjecture find interesting, find tasteful to be solved by AI hopefully majority of them by accident prover. And then you have the question of high physicists. When they talk to mathematicians generally they will have interesting opportunities for collaboration. Like I actually have this paper with Professor Kanono and others, Xing Tong Jiang and Michael Mertens, which addresses the elliptic expansion moonshine conjecture. And that kind of stemmed from like you know, three, I think three theoretical physicists, they, they conjecture this based on their observed or like physics like phenomenon that I know frankly not, not, not very much about. But I can solve the math part. They come to the conclusion that what they believe are a beautiful phenomenon that they find it worthy to formulate as a conjecture and publish as a paper, have a proof because they know some mathematician that doesn't seem right. I think that really the beautiful vision is for all the theoretical problems, all the curiosity, all the lack of understanding to be resolved in a satisfactory way across all scientific subjects. And beyond this, right, There are things that we still cannot solve that we will get a closed form sort of not a closed form solution. We'll get a very like, you know, approximation as precise as possible. I mean there's a lot of value for example to know say what is after the 1000 or 10,000 digit compared to what is after the third digit. The word is actually a lot of the times not diminishing, return the last mile, carry a huge amount of value. Like in search, for example, if you cover some edge case you likely win. You will be a market winner in for example writing, right? If you write just that extra bit or any sort of creative art, getting that extra mile correct or done has a lot of value. Like optimization, you know, precision, we can try a lot of these things as well. And then as math kind of helps with both first principle understanding and trial and error. It just kind of this cycle like you have, you have some first principle understanding, you try testing it, you try some trial and error and you maybe give some sort of, you know, risk bond, uncertainty principle, robustness estimation and then you go back to your first principles and then you go to your trial and error again. You have this sort of of circle of discovery and this is really not the end of it. So that's why I'm already very excited. I think this is going to be amazing. Ideas can diffuse between different fields a bit. Like you have abacus and now you have trade and commerce, you have calculus integration, you have thermodynamics, mechanics, industrial revolution, you have the Babbage engine, which is to calculate log tables faster. And okay, well you have the prototype of computer science. The rest is history. You have have number theory, you have rsa, you have all these kind of mathematical tools kind of open up new discoveries and new use cases and in turn demanding more mathematical tools. Beyond this cycle and this cycle marrying science, here comes code. We haven't even talked about that. And that's why actually Shubo is very excited. So Shubo, city of axiom, before that was long term meta veteran. He was an IC director. He believes in code is math. Okay, so through all my good friends telling me about Lean, telling me about Curry, Howard correspondence, I believe math is code. He believes code is math. What does that mean? Okay, so it means that you can try to fulfill the dream of Donald News literary programming, have computer scientists, programmers enjoy the luxury of mathematicians where they can reason in natural language. And this is kind of starting to happen, right? Web coding, like front end, right? Like you know, we can and have very cool lovable websites. But how do I wipe code? Nuclear reactor, how do I wipe code? Control flow, how do I wipe code? Complex systems that require quite honestly superhuman hierarchical reasoning skill. It's interesting because you're not in code alone. You have code and you have math. So you have in addition to the flywheel, we are already seeing in the coding companies an additional layer of flywheel of verified code and sort of mass starts to come in and this kind of flywheel of data keeps compounding. You have actually two, even if you're counting the science part, three orders of flywheel.
B
How far do you think we are from that world where we have all.
A
That's why we have to execute something every couple months. We have to move extremely fast. There's so much to do. I think Axiom is a very, very young company and we are at the very, very beginning tip and we are already. I personally feel some sort of shock and emotional response and I know some of my mathematician friends, Scott Commoners who's a Harvard microeconomics professor, also a Morgan Prize winner, we're good friends. We all have this sort of emotional response. When axiom prover proved Vals conjecture proves that almost all primes are partial irregular partial Vandiver conjecture which is one part of the original Vandover conjectures that have been open for 90 years. The parity of differentials for services of genius 0 and 1 by algebraic geometry paper we are really just leaping across a point I mean pandem marked I think the end of AI trying on math Olympia. We are very glad that we got a perfect score. It's a really good period point is by a lot of sort of experts grading harder than IMO 2025. So it's the hardest reward math Olympia test. And now we are leaping, we are leaping to research and I think I'm going to have another similar emotional response if it really does solve one of those breakthrough mathematics problems. Interestingly, I think there are a lot of experts in domains that are currently overlooked by AI development. So if you're a software engineer, you feel like oh web coding really change and improve your quality of life in a meaningful way. There are people who are in industries where because of lack of approval guarantees they couldn't use AI. And there are for example aeroastro for example like I think defense for example. There is no partial credit for mostly verified gpu.
B
It's all or nothing or mostly flying plane.
A
Yeah, yeah, yeah. A mostly verified formalized hypervisor. I think for these experts they are currently kind of hand holding a lot of the traditional tools. Their life has not been changed and I want to see the AI for math movement that axiom is hopefully leading to transfer to these domains and to try to solve some of those problems as well.
B
Speaking of emotional response, is the entire math field super excited about math AI or do they feel like the rest of us possibly disinubiated and replaced by AI?
A
So I think a lot of the adverse reaction from the mass community about AI is actually coming from the fact that they cannot verify an informal solution. So suppose GPT generate a mass proof a million lines. No one's going to check that. I'm not going to check that. I know there are a lot of really great of data labeling you know, services, they couldn't have that sort of level of expert to verify a proof of that length. It's just hard to do. On the other hand, if you have a formal certificate, like a stamp certifying that this is a correct lean proof, I think people are a lot more receptive to it. That's why actually a lot of mathematicians, especially almost all of the new school of mathematicians are accepting Lean. What they want to do is to have human to formalize Lean. Now it's really fun for humans to do Lean. We have a lot of the initial mathlib people here at Axiom, and it's a very fun team to see them formalize some statements. But when it comes to say like hundreds of thousands lines of Lean code, which is already not a hypothesis, like automated reasoning team at some big Tech currently have 260,000 lines of their improving code, not lean to verify one component of hypervisor or CPU utilization. So that cannot, I mean that they actually did write about human, but it's just not quite sustainable.
B
So maybe just to close just one sort of final chapter in this conversation. So while I was listening, I was thinking about how someone with such a deep background in math becomes a CEO and what that transition was like and whether there's any lesson for anyone considering making that jump or any founder out there.
A
I remember that reading the anecdote of Hamilton that he writes down all his flaws and shortcomings every night and forcing himself to correct them. I know what my habits and flaws shortcomings are. I'm a very spontaneous person. I don't have the best ideas when it's planned. Which I think sometimes that's like, you know, in the course of research you have like mass research, you have these kind of very interesting eureka moments, I think. So I try to overcome that in my day to day. I try to do scheduling very intensely. I try to surround myself with people who inspire me to execute faster. And that's basically the entire team. And I think it's a great honor to be working with them every single day. I think like I go into the office and I look around and I hear the discussions and I'm like, wow. I'm like, so I'm so lucky to be here. And I think that's something that in this kind of talent market, I think people like to work with people who value their intellectual judgment, respect their voice and opinion. And I think Axiom has this sort of very flat, like non hierarchical, everyone's member of technical staff or a mathematician. Right? If you're early, you're a founding. Plus that this culture is really great for sort of, I think the sort of open communication of ideas and debate of ideas and.
B
And that helps with the pace.
A
It helps with the pace of iteration, helps with being more correct. A lot of things I'm learning very rapidly. I mean, it's been very interesting roller coaster, like seven months when I was in math. I mean, there's this sort of idea where you have taste and sometimes these tastes can be interpreted as arrogance. And I learned that ego is a really, really bad thing and we should just basically get rid of it.
B
Do you hire people for taste? And if so, how do you evaluate it?
A
Yeah, we hire people for taste, but not for ego. And I think that's a very interesting
B
kind of how do you select people? How do you evaluate someone's taste? Is that their prior work?
A
Right. So one is basically, that means I have to learn every day because I need to have a certain ground, you know, a basic amount of taste. And also the other people who are senior, the company need to have research scientists have, need to have that sort of amount of taste and what makes us excited. And when we are excited, we just go after that person. We have had a lot of, I think, extraordinary hires from amazing and interesting backgrounds. And we really like this team. And I think that we raised the bar for hiring actually continuously. So we recently have an uptick in the number of people who would like to join us and our candidates that we're excited about. We take recruiting very, very seriously.
B
And how did you secure that founding team initially? Because that's kind of ridiculous in terms of like caliber of just world class mathematicians. World class. How did that all come about? One of them was your mentor. So now technically, you mentioned that you were very flat, but like technically working for us. CEO.
A
Right.
B
How did that come about?
A
So quite interestingly, I think at the start of this, I realized that there is a movement and this movement of AI for math is very much by and large in academia or actually people are hiding in labs secretly doing AI for math where their day job is something else. When I talk to each of these people, it was a very, I think, mutually exciting feeling from both parties. And basically it's like two whales and just kind of realize that only they can communicate in that frequency range. And that happened multiple times. I got very inspired by this. I think there were a lot of times at the beginning where fundraising was quite hard, I think. I mean, I'm a nobody. No one should trust me with their large amount of money. And I think that was very difficult and challenging. But it was those conversations that basically made me realize I have to do this. I just have to and have to be the best deal for this team because my team deserves the best of the world. And those kind of intellectual alignment were a main theme. And the other thing I realized was that they found, say, maybe the other AI for math opportunities is not particularly attractive, or when reason or another, maybe they don't want to move geographically to. To London or to China, or maybe, you know, it's just a different kind of vision. So kind of gathering them was relatively natural process. And then after that, I think when you have a bunch of really smart and nice people, you just attract other like smart and nice people, and especially people who are adventurous, like rebellious people, like to disrupt who come from cogen, want to disrupt Cogen, people who come from Mass want to disrupt math, disrupt and I guess elevate as well, because they do have a lot of affection still with that field. So I think it's a very interesting. It's almost like a tribe. Axiom is like a tribe. And we have more and more people joining us. And sometimes I feel like both in those initial conversations and still now, and even after all these sort of like, you know, talking to the world about what we are doing, we still feel like secret keepers. We still feel like we cannot fully elaborate and emphasize the thing that we are seeing that is the next frontier of AI, that is a generation and verification loop, that is the discovery of verified knowledge.
B
Feels like a wonderful place to live in. This is all incredibly fun, compelling and inspiring. Thank you so much for spending time with us.
A
Yeah, appreciate it. Thank you so much.
B
Yeah, hi, it's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't, or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode.
The MAD Podcast with Matt Turck
Episode: The AI That Beat the World’s Hardest Math Test (Putnam 2025) — Carina Hong
Date: February 26, 2026
Guest: Carina Hong, Founder & CEO of Axiom
This episode dives deep into the development and implications of Axiom, an AI mathematician built to be 100% correct, thus fully solving the AI hallucination problem—in math, at least. Host Matt Turck and guest Carina Hong discuss how Axiom aced the Putnam exam (the world’s hardest math test), autonomously solved open research conjectures, and what this means for the future of math, proof verification, AI, and science. They also explore Carina’s unique personal journey through competitive math, academia, and founding an ambitious tech company at age 24.
[01:25]
[02:02], [03:06], [04:13]
Putnam Exam Explained:
Autonomous Research Conjectures:
[02:10], [05:31]
[07:19]
[08:59], [09:10]
[10:50], [11:07], [14:28]
[14:28], [16:06], [16:16]
Definitions:
Auto-formalization:
OpenAI/DeepMind Target Informal, Axiom Targets Formal:
[18:00]
Hallucination Problem & The Need for Proof:
RL with Verifiable Rewards (RLVR):
[20:46], [22:21], [23:45]
Potential Beyond Math:
Limits:
[24:53], [25:41], [29:53], [32:00]
Early Years:
Influential Experiences:
Academic Diversity:
[34:06], [35:28]
Three Main Components:
Mechanical Insights:
Announcement:
[43:10], [45:19]
Pushing Limits:
Fundamental Bottlenecks:
[46:32]
“We really want axiom prover to be able to solve one long-standing problem...so that you can objectively say, even if it's an AI, that would be in the [Fields medal] shortlist.” — Carina Hong [46:36]
Philosophical and procedural questions about awarding prizes to AI—but real advances would be undeniable.
[47:49], [52:20]
On the Brink of a Mathematical Renaissance:
All Theoretical Problems Solvable by AI:
Math Enables Science & Code:
[53:25], [55:19]
Industries with Zero Tolerance for Error:
Axiom as an Enabler:
[56:01]
[57:48], [59:16], [60:42], [61:07], [62:48]
Transition from Math to CEO:
Hiring Principles:
Company Culture:
On the core challenge:
On the future:
On discovery:
| Timestamp | Segment / Topic | |-----------|----------------| | [00:00–02:02] | The need for AI mathematicians & scope of math | | [03:06–04:05] | Putnam exam: Setup and Axiom’s achievement | | [04:13–05:07] | Solving open research conjectures with AI | | [05:31–06:58] | The role of creativity, intuition, and routine in mathematics | | [07:19–08:59] | Differences between AI and human mathematical reasoning | | [09:10–10:50] | What is Lean and why it matters | | [11:07–13:38] | The history of AI in math and comparative approaches (OpenAI, DeepMind, Axiom) | | [14:28–15:06] | Formal vs. informal reasoning; auto-formalization challenges | | [16:06–19:13] | Verification, RLVR, and eliminating hallucinations | | [20:46–23:45] | How generalizable is Axiom’s approach; moving from math to other scientific domains | | [24:53–28:18] | Carina’s personal journey: China, MIT, Ross Math Camp | | [29:53–32:00] | Transition to neuroscience, law, and founding Axiom | | [34:06–36:47] | Axiom’s technical architecture and new product launch | | [38:40–40:14] | Data scarcity & synthetic Lean data generation | | [43:10–45:19] | Internal benchmarks and scaling bottlenecks in proof AI | | [46:32–47:49] | Can AI win a Fields Medal? The real challenge in research math | | [47:49–52:20] | The future: mathematical and scientific renaissance | | [53:25–55:47] | Industry applications of verified AI: going beyond research | | [56:01–57:25] | Math community response; formal proofs increase trust | | [57:48–63:22] | Leadership, hiring, and founding an elite, adventurous team |
This episode offers a rare, detailed look at the future of automated mathematical reasoning and proof, the technical, philosophical, and societal shifts it will drive, and the remarkable journey of one of its pioneers.