
Loading summary
A
I'd like to thank our friends at Capital One for sponsoring today's episode. Capital One's tech team isn't just talking about multiagentic AI. They already deployed one. It's called Chat Concierge and it's simplifying car shopping using self reflection and layered reasoning with live API checks. It doesn't just help buyers find a car they love, it helps schedule a test drive, get pre approved for financing and estimate trade in value. Advanced, intuitive and deployed. That's how they stack. That's technology at Capital One. This podcast is sponsored by Google. Hey folks, I'm Amar, product and design lead at Google DeepMind. We just launched a revamped vibe coding experience in AI Studio that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app and Gemini will automatically wire up the right models and APIs for you. And if you need a spark hit. I'm feeling lucky and we'll help you get started. Head to AI Studio to create your first app.
B
Math and coding are two important, perhaps the two biggest part of digital world and coding is heavily invested. Math is not. Not that math is not trackable, but that math has not been turned into programming language yet and there will be new markets and new use cases that get unlocked because of this.
A
All right everyone, welcome to another episode of the TWIML AI podcast. I am your host Sam Charrington. Today I'm joined by Karina Hong. Karina is founder and CEO at Axiom. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Karina, welcome to the podcast.
B
Thank you for having me. Great to be here.
A
I'm excited to dig into our conversation. We'll be talking about mathematical reasoning which is what you are working on there at Axiom and it is a very timely topic. Before we dig in, I'd love to have you share a little bit about your background.
C
I love math since as long as I can remember what most people don't. And I feel like Olympian math training gives you this sense of constant dopamine hit. Like you solve a problem and then you feel so good about yourself and then you move on to the next one. A bit later I went to MIT and started research mathematics career and that was a lot more pain and suffering. You usually are stuck on a problem for months and I remember I was working on a really hard problem for about like half a year and I would meet with my advisor every week and have nothing to report that's kind of the life of a research mathematician.
A
What particular field in mathematics?
C
Yeah, so I work on number theory and combinatorics. A lot of people who work in combinatorics come from the Olympic math background, so they can solve problems. Yeah. A lot faster than other combinatorics researcher. But number theory hopefully is better. You can read a lot of literature and make really interesting contributions without having this sort of outlier problem solving skill.
A
Nice. I took a real analysis class in grad school and I think that was about enough for me.
C
I love graduate analysis. Yeah, I remember back then at the graduate analysis class, I don't know if it's because no one is quite following. We were just like passing note and saying that the professor looks like Winnie the Pooh.
A
That must mean that the class was a lot easier for you than it was for me.
C
Or maybe we're just completely lost.
B
I don't know.
A
Nice. So you're working on this hard problem at mit. It took you half a year.
B
Yeah.
C
And then I think I realized at one point that was, wouldn't it be great if you can just let your intuitions take you? You have some lemma that you want to prove instead of being stuck there constantly verifying the details? Wouldn't it be great to just, you know, think that there's a high chance if your intuitions is right, that the lemma is correct and move on to the next one a bit Like, I mean those creative AI arts startups, right? Like Pika. Like you want to put a fox on the cartoon and then have the fox talk to a bunny instead of actually sketching it out like the cartoon creator.
A
Let's say it and let it be so.
C
Yeah, exactly. I think that might be the difference between say, Terry Tao and Karina Hong, which is perhaps both can have some sort of lemmas they want to prove, but Terry can just move on to the next one and Karina need to be stuck there for about six months before moving to the next one.
A
You know, one thing that strikes me about what you're doing is that we seem to be in a particular moment with regards to math and AI and you know, this confluence of math and reasoning. I recently had Christian Zegedy on the podcast. Like a week after we talked, he left to start another or he started another startup, Math Inc, to focus in this area. There's also Harmonic working on this area and probably untold others. I think the first thing I want to get your take on is why now? Why is so much investment and effort More broadly, going into mathematics and AI.
B
I think what we feel at Axiom.
C
Is the timing is right. And the timing is right because of three reasons we believe in AI mathematician being built by bringing three fields together, and that's AI, programming languages and math. If you look at the current landscape, informal models are really good at reasoning. They're really good at math and coding.
A
The various informal models like LLMs.
B
Yeah.
C
So informal models, large language models. The O series of OpenAI really marked an era of post training, reasoning, reinforcement, learning like this is great technique breakthroughs. So that's number one. Number two is on the four bone sides. Lean Four, which was rolled out September 2023, has been widely celebrated by the community. The number of people who learn Lean, use Lean on a daily basis have pretty much expanded. And a lot of mathematicians, field medalists, are joining force in using Lean and seeing, okay, well how can this formal language, this programming language, help us verify math and you know, hopefully have more and more digitalized mathematics. So that's number two. Number three, I would say, and this is maybe like Axiom's hot take is by turning math into programming language, you really need a lot of the cogen techniques and code generation has been crossing a threshold. We have seen really impressive performance of reinforcement applied to coding. And the hope is by turning mass proofs into programming language and using a lot of the code gen techniques that we think are a bit kind of overlooked, we can have really amazing performance gains on math as well. So three things are really converging together.
A
Are these CodeGen techniques overlooked in CodeGen or overlooked in application to math?
C
In application to math and for good reasons to look at. For example, there are probably more than a trillion tokens of Python code in the Internet. That's a lower bound. I think there are about 10 million tokens of lean code. So we have the Lean Universe, which is the largest Lean dataset. The developer Lean Universe is currently at Axiom and we kind of are estimating, wow, there is a 100,000 times data gap and that's ridiculous. Like scaling can't quite work with this amount of data scarcity.
A
Yeah, talk a little bit about Lean. I see that come up frequently in the conversation around mathematical reasoners. It's a programming language. What does it kind of look and feel like? What problems is it trying to solve? Who's using it?
B
Yeah, 100%. I think the magic of Lean is human mathematicians have been doing mathematical reasoning on pen and paper literally for thousands of years. And because of how they are trained, they do it In a very rigorous way it is by humans to, you know, make sure that each step strictly follows the previous one. Lean makes this more like how computer programs work. You can write a proof in Lean and the Lean proof will hopefully compile if the proof is correct. And if it's not, then it will report a bug, say a bug at line 37. Then you look at line 37 and you try to analyze exactly what the bug is about. It can be that you get the syntax wrong. So like every Python beginner gets the syntax wrong. Sometimes it could be not mathematical merit of the proof that's wrong, but just the syntax, or it can be that the math is just wrong.
C
And there's like a third kind of.
B
Problem that's a bit more stubborn, that is there's something about the types that are not matching. So LIND is built on something called dependency type theory. So then you have types associated to every object and sometimes these sort of bugs are harder to fix. So, so by trying to feed the error message back into say your prompt, you can do inference with feedback and you can do the standard like RL EF type of stuff. And hopefully you can have the machine see more lean, learn more lean and generate more lean, so it kind of stops making those mistakes. That's at the syntax level. But whenever there is a mistake, you can be more confident that something about the mass is.
A
And what does it look like? You know, I'm thinking about the math that we see in papers, and I'm also thinking about one of the first computer programming languages I learned and you know, which is in education was APL that reviews, which has all kind of like weird symbols and stuff like that. And I'm wondering if Lean looks like that.
B
Yeah, Lean has some interesting math symbols like the inverse A, which is like for all it has like some. It basically looks like a programming language. It's definitely not quite apl, looked a.
A
Little bit like hieroglyphics.
B
Oh, okay. Yeah, it's very interesting to think about the abstraction of these languages. Like with natural language being very readable but not quite easy to compute, right? You can't actually do computation with English and then you kind of go down the stack, you have programming language like Python, like Lean. We now think about, for example, Python being kind of translated almost to an intermediary language and then intermediary language and eventually the machine code that is doing the actual computation that's going down the stack. But it doesn't quite. There's some local examples where it goes up the stack A little bit. It doesn't kind of like go fully up back to the Python stack. That's not how compilers work. On the other hand, if you think about the highest stack being natural language math proof, and at the bottom you have Lean proof, and in the middle you have things that are needed to convert to formalize a natural language proof into Lean. You have maybe finer and finer blueprints like math that are 10 times longer than the original paragraph of proof. It actually is able to go down and up the stack freely, and I think that's quite interesting. So auto formalization is for a model to convert English proofs into linked proofs and informalization is going back up. So then you have like a full bidirectional circle, which is very interesting because then you have human mathematicians reasoning high level intuitions at the English layer, and then Lean is proving at the, you know, bottom layer. And these sort of formal proving signals then can feed back in to, to, to advise the mathematician. Okay, well maybe this part of verification isn't quite working to then guide the continuation of these high level intuition. That process is, it's very fun. It can, it can actually be transferred.
C
To a lot of other settings as well.
B
If you think about formal verification, other contacts such as chip design, such as code verification, those are quite interesting examples.
A
Is it, are you able to say conclusively that if a program compiles and runs in Lean, then the proof is without error mathematically? Like it seems like that in and of itself would be a research topic or a significant achievement like having that one to one mapping between validity in the lean space and validity in the mathematical space.
B
Yeah, so I think it's quite interesting that a lot of the mistakes actually happen as a misformalization of the statement. So if you're sure that your statement is like correctly formalized and your proof compiles, then congratulations, you have a correct proof. Now there might be multiple correct proofs and there might be for the same proof, multiple ways to write it in Lean. So it's not one to one matching per se, but, but you do have the sort of provable guarantee that if your statement is correctly formalized and then you have a working Lean program, that you have a proof right there.
A
You talked a little bit about like the three elements that are coming together to make this an interesting time for mathematical reasoning, Lean being one of those pieces. And then we're talking now about auto formalization. Like does auto formalization, is that part of Lean as a package or is it something that folks are Working on that kind of utilizes lean as a foundation. And what's the current state of auto formalization from your perspective?
B
We think of auto formalization, which is the process again for a model to convert natural language mass proofs into Lean. It's like surprisingly very hard. I mean it's, it's not just a translation problem. So I think there are two ways to think about auto formalization. One is as a data bet, right? So you can convert a lot of math data in the informal space to the formal space and then you kind of solve the data scarcity problem of formal math data. And so one way to think about it is it's a way to gather data. The other way to think about it is it is also a core technology in and itself. So auto formalization and proving have very sort of entangled capabilities. Like I think a model that can auto formalize very well likely can do better in proving and the other way around. So we strictly don't think that auto formalization and proving should be trained separately. Like they should be in one loop, they should be in one model. So I think these are the two ways we think of auto formalization. If you look at the entire mathematical archives, There are about 30 to 70 million statements and proofs pairs and none of these are in lean. I mean very basic stuff in domains that happen to be the domains that the lean developers lean matlib contributors like such as algebraic number theory, those are in, but differential topology for example, is not in. If you can do auto formalization well, you can get these very high quality seed data at all levels of difficulties of math and then you can convert these seed data into even more synthetic data. Then you can try to close the data gap we just talked about of math and coding.
A
Lacking those, how do you approach closing that data gap?
B
Yeah, I think auto formalization is important. We also think that the other parts of the synthetic data generation, after you have a lot of high quality data, how do you make it 100 times, 1,000 times more? I think those are interesting as well.
A
Maybe I'm not hearing this correctly, but part of what I'm hearing there is that we need more data to do auto formalization. But then more data and order formalization is needed to do proving. But in the first place, how do we get the data required to do auto formalization? Or more broadly, what approach are you taking to tackle auto formalization?
B
I think it is a cold start problem. Right, because machines have not seen enough lean. They don't know how to do lean well. So you do have that sort of like cold start chicken and egg problem. I think there are mature research techniques out there in the publication space that are trying to solve this problem by bringing the gap between natural language proofs and formal proofs closer. We think broadly reinforcement learning is an important part of that.
A
How do you see RL playing in this space?
B
So we have seen really amazing scores on various benchmarks from models that use reinforcement learning to to do proving well. We have seen for example Seed Prover is a really strong paper by, by Dance Seed foundation. And there's that Hilbert prover. Before that there is Kimman Approver Deep Seq Prover. We have seen reinforcement learning applied on formal math having pretty good results.
A
And is there a general approach that is being taken to apply RL to formal math?
B
I think people try different things actually. There are many different sort of designs that are more Monte Carlo tree search based models and there are more sort of creative agents design one can do. And there are interesting RL training recipes that each lab chooses to take. It's not, I think it's more like an art than a science.
A
How do folks approach objective design with, with proofs? Is there a kind of standard way to think about that?
B
Yeah, I think there is a difficulty. I think it's that the signal can be quite sparse. Right. So once the proof gets incredibly complicated like you know, it's a little bit hard to get it to work. So some people do see that as a challenge. But I mean it's a very like rapidly evolving field and people have been trying to you know, define good curriculums for the model to hill climb. So currently math models, formal math models are doing quite well in the high school Olympiad math level. You have seen two formal models actually do spectacularly in the imo. And then beyond that you can have more fine grained curriculum to try to teach the model harder and harder. Math actually across 2 axis 1 is how abstract the mathematical object is, the other one is how sort of eureka moment, how creative the proof is. And IMO is something that is low abstraction. High creativity requires a lot of genuine like you know, innovative approach. And then you have for example PhD qualifying exam that could be relatively straightforward if you know all the other concepts. And that's at a higher abstraction level than imo. And you can define curriculums for the model to do better and better across on both axis.
A
And what are the two models that have done well on the IMO through formal methods?
B
I think for example the seed prover model is doing pretty well and Aristotle is also Doing pretty well.
A
Yeah. What can you say about those models? Like what approaches have they taken that you think differentiate them? Among other attempts at this, I think so.
B
Both are industry labs with more compute and data budget resources.
C
Always helps, always helps.
B
We have seen for example the Aristotle one being more Monte Carlo tree search based and then the seed prover one having pretty, I think intricate designs, system design. So they have lammet pool, they have very different sort of designs for their light, medium and heavy inference tiers. Uh, so I think really there are many ways and eventually I think people will just try everything and see like which one works better.
A
And so taking it back to Axiom, it sounds like you've got this, you know, this big problem that you're going after and a lot of, you know, tools in various tool chests to kind of apply to the, the problem, a lot of levers to, to pull, like how do you think about where you invest or what bets you place or what needs to work well for you in order to be successful? What general direction are you taking through all this?
B
We believe in system that can prove and conjecture and self improve by having the prover and conjecture talk to each other. I think verification which exists here in the domain of formal math and self play really tied together. You can have the prover as the reward signals for the conjecture. And that is not a new idea. It hasn't been an idea that worked well in other domains before. And we want to take pretty bold bets on the data end as well through all the formalization and synthetic data generation. So I think of like our approach as sort of, we do a lot of things and they all eventually fit together pretty pretty nicely and cohesively, you know, use auto formalization to generate a lot of data and to extend that data by 100 times thousand times through synthetic data generation methods and then have a really strong prover that shows capabilities across benchmarks that test different capabilities, do benchmark and evaluation very carefully and then eventually have the prover talk to the conjecture for the conjecture to propose new problems and new curriculums and the prover to then prove it and then form like a self improving loop as the prototype of self improving AI.
A
And do you approach all of those in parallel or is there a particular starting place and that that's kind of foundational that you tackle first?
B
We currently are doing more than one thing for sure. But as you, you know, as you're, you're right that startup is always like a kind of constraint optimization problem. We currently feel like the, the, the, you know, the different parts are all very important. And by kind of not, you know, like constantly thinking about these, these ideas and staffing those teams. Actually it's a very kind of cohesive picture. Even though like obviously for an early stage startup there's too many ideas for us to, to execute all at once. But we're trying our best.
A
You mentioned that one of the elements of a solution that you think is important is self play. I think we've seen some significant successes there through like the AlphaGo AlphaZero style of models. Like do you, once you have the pieces, is it, you know, a straightforward thing to set them against each other and say, you know, go off and talk to one another and figure this out or is there also a lot of, you know, technical challenges just in like setting up your self play, you know, system and environments?
B
For sure. I think that is there, there are quite, quite a few, you know, difficulties there. You know, for example, right, if you, okay, if they just cannot prove like is it the fault of the conjecture or is it the fault of the prover? Is it that okay, well this problem is just bad or is it just that it's a good problem, then your prover like I think disentangling them. There are a lot of like open research directions I think that are under, under explored. But I think like the, the general idea which I think is quite creative and noble is that through math we can, we can try all these things. It's a sandbox for, for trying self improving AI. And the conjecture proposes new conjectures and based on the current knowledge base of, of math. And that knowledge base is evolving, it is expanding and contracting, expanding because of new things being proven, contracting because well maybe some of them are, you know, low merit. And then provers prove or disprove these conjectures, the results added back to the knowledge base and the cycle repeats and extending the body of knowledge and that I think we're going to see some difficulties in judging what is good mathematics from, you know, by the model. You know, what is elegance? What is novelty? There are certain sort of rough proxy measures that we can try, but not like a conclusive, you know, definition. I mean there are other papers out there. I think Yosha Bengio has this paper on, you know, looking at building AI mathematicians through the lens of compression. Well, is the first order logic viewpoint satisfying or not really? Right. If you have a theorem and then because of including that in your knowledge base, many theorems are then bore is that is that, you know, is that end all be all for like elegance, you know, there may be probably some statements that are extremely elegant from the perspective of human mathematicians that don't actually that kind of like conclude things right, like, like close, like a threat, like you know, nuke field. After we know this, there's nothing new to be born. So I think that might be a human mathematician sort of alignment problem with what is good. Taste I think is still very, very hard to measure.
A
Taste is something that I think we're grappling with across a lot of different fields in AI and it's not clear how we necessarily get there beyond more data that somehow exemplifies better taste. You know, which kind of brings us back to the data problem in your field.
B
And I think the data problem in our field is like for the really not so complicated, not like frontier research math, like just undergrad textbook. There haven't been that much data.
A
We're having this conversation in the context of the post LLM AI era, if that's what we want to call it. And a lot of that was kind of this unlock of self supervision through next token prediction, that kind of thing. Do you see an opportunity for self supervised data exploitation in mathematics? We've got this tremendous. There are tons of papers, many of them are digitized in mathematics. Even the ML papers that we look at have lots of math in them. You know, there's narrative about those proofs. You know, can we use that without necessarily requiring Axiom or Math Inc. Or everyone else to like take those manually convert them into lean and like build these supervised data sets? Do you think anything's there or.
B
We at Axiom actually believes in the informal part of reasoning being also important in the loop. I think if you have the informal data and quite aligned with your formal proof and you train them on the verified pairs, it's great. We think that's a, that's a good idea. Like you know, if you look at even not talking about mass verification, just talking about like Python, right there is this paper R Star Math. I remember was earlier this year in April or something. Microsoft Asia paper. They see that small model can achieve on par with O performance. Back then O was still very, very impressive. It's like amazing how fast the field moves. But training on the paired data of informal solution to high school math problems and the Python one. Yeah, I think that's interesting, right? And you also see papers like Lego Prover trying to have the informal proof as aligned with the formal proof as possible. We don't believe in only formal. We don't believe in purely informal for sure. Which we believe in kind of taking a hybrid approach and there are strength of the informal models, especially maybe in high level intuitions that could be very interesting to do that together with the formal side as well.
A
I have the impression that your team is not exclusively mathematicians or mathematically focused folks, maybe mathematically interested folks, but for folks that are coming into it with a background in AI and an interest in math in this field, like what's the way to get started?
B
I think shameless plug of we are hiring like applied AI researchers who have not worked in AI for math. I think we really believe in the transferability of techniques and just broader AI, mainstream AI and also AI and cogeneration. We think there are a lot of techniques that can be borrowed from these other fields and applied to math. In fact we find mathematicians being really helpful but for only some parts of this effort, not for example training the prover. We find mathematicians being helpful to give us strategic guidance on what are interesting auto formalization targets. We find mathematicians being helpful in working on the mathematical discoveries part, constructing examples or counter examples that settle long standing conjectures, so very specialized problems and have nothing to do with Lean. So we haven't actually talked about that part of the axiom roadmap led by Francois Charton's world famous AI for math researchers. On the discovery side we find mathematicians being extremely helpful to the team. But you know, for folks who come from, you know, the Strawberry line of work, I think there's like immediate sort of transferability of techniques and we also find we have really strong team coming from applying deep learning to code gen and that team have been authoring many amazing work like LLM compilers, kernel, LLM and are also working. They have also worked on cwm, the computational word modeling. We find those also quite, quite interesting. But I think to answer your question, for someone who is not super familiar with AI for math, there are many amazing survey papers. I think there's one paper called Formal Mathematical Reasoning in the Next Frontier in AI and first authors like Tai Yu Yang, they survey the field really well, how different pieces fit together. They also have many calls to action. The last, I think Section 5, like what are the open problems to consider? I find that very helpful. There is an AI for mass GitHub, repo listing all the papers and also including some that have nothing with Lean, just the math discovery site for example. I find those very helpful as well. There is an amazing community, I mean there are so many conferences each year that talk about AI for mass like ICER and I think Brown and there's big proof. And recently Albert Wohfog many conferences. And also I think there was something at the Simons Institute at Berkeley. It's a very welcoming and friendly community. People like to go and just learn stuff.
A
You mentioned some of the work that your team is doing Francois is doing on the discovery side. Can you talk a little bit about that?
B
So mathematical discovery tackles specialized problems. So it will be one specific problem instead of the benchmark. Usually have problems coming from all branches of math. Right. There's some algebra, some analysis, combinatorics. Francois and also his team's work tackles specialized problem that is known to be very hard. Usually these conjectures are open for about. There's one that's open for 130 years, the other one is open for for 30 years. And shocking. The 30 year old one is actually disproved. So it's actually not true. It's a very different landscape from the lean dirt proving side. There are many techniques that are helpful to tackle those problems. Pattern Boost, for example, is a work that is quite. It's a general method that can tackle these sort of specialized problems. And I think there are more works coming for that that I can't talk about yet.
A
Well, can we talk about Pattern Boost for a second? What's the general idea there?
B
Pattern Boost, for example, uses transformer to generate examples that satisfy certain pattern. And there are global techniques and local techniques that you know when you look at the examples you will perturb them like either locally or globally to synthetically generate more examples. And that has led to the discoveries of many interesting mathematical results. Patent booth kind of extended beyond the paper itself. And it can be a toolkit for generated methods that can be extended to evolutionary algorithms and diffusion methods potentially. But those are all under development, are very early stage. So similar to the techniques and Pattern Boost you can actually have say diffuse boost. That's like a work that is upcoming. And there are also other papers like there is this paper called End to End. It is an open source transformer for translation translation of integer sequences. And it's developed and open source by Francois, improved by Alberto Alfarino who's also at Axiom. And that is sometimes considered the first transformer kit for many mathematicians. And I think there's like one other branch of mathematical discoveries that is under explored for Lean, for example. It does not quite cover graph theory very well. It's very hard to represent graphs in Lean. But there are like, you know, you can potentially do graph foundation models or word models for graphs or other objects and these can probably derive from end to end. But it's a little bit unclear how that line of work will go and whether that will end up as a toolkit.
A
So with regards to auto formalization as kind of a linchpin technology and all of this, what is the current state of play like? Are we trying to get it to work at all? Do we or. I take that back because I'm imagining that if you constrain your problem sufficiently, you can create a trivial auto formalizer. So there are things out there, but like how do you characterize how far along we are with auto formalization? And what do those models look like? Do they look like LLMs where you give them some texts, you give them natural language description of a proof, and they spit out, you know, some representation of that proof, or does it look different than that?
B
I think auto formalization, you know, benchmarks are hard and I don't think there have been a good auto formalization benchmark yet in the community that's widely recognized. So when it comes to auto formalization, there are autoformalization of statements and there are other formalizations of proofs. Once you are sure that your statement is correctly formalized and lean auto formalization of proof is actually easy to verify, right? You see if it compiles or not.
A
When we talk about statements, are we talking about natural language or lean statements?
B
So statements as in like auto formalization of informal statements into lean statements. Now in Lean, you can only know if the syntax as a statement is correct or not. You don't actually, it's very hard to verify the statement like what is. So you can verify a lean prove is correctly serving the purpose of proving that certain statement. Right. You still need to.
A
What's an example of a statement that you think of as kind of a canonical thing that we might want to.
B
Prove, the statement is say like the problem, right? Like I, you know, Fermat's lab theorem, the state the without the proof when it was unsolved, that is, that is a statement. And that's what I'm kind of using. So like, if you think about like a benchmark of 100 problems, all these 100 problems are statements, you know, when you don't have to prove, you don't actually know if your statement is correctly formalized or not. You still need expert to like eyeball and that can be very difficult. So auto formalization of statements. Yeah, I think that haven't. There haven't been like a benchmark for testing that capability per se. But once your statement, you know that your statement is correctly formalized. You want, you want to check if the auto formalized proof compiles, as in whether it is serving the, you know, it is, it is proving the statement. But you want to be sure that it is the same underlying mathematics as your informal language proof, not another proof that also is correct. So, you know, one idea in the community is to test the capability of auto formalization of proofs. You select the problems that likely only have one proof that like you select a problem that you just don't believe. There's like, you know, the large language model will just. The auto formalized like, you know, model will just like generate another proof that no human knows. So that's kind of the way to know. But measuring auto formalization capabilities is hard and that's why people have been trying to auto formalize very difficult results. Right. The bet is we don't know another proof of Fermatla theorem because it's already so much effort to find one. So yeah, I think autoformalization measuring the capability of that is hard.
A
Yeah, there's something in there that I'm trying to get at with regards to auto formalization. So are you saying that there is no to date auto formalizer model? Like this is a goal and even if we constrain the problem to like high school geometry proofs, like, we can't auto formalize those. Like it's. And that was my conjecture that like at some point you can like limit your universe to proofs that are sufficiently simple and you can automate those even if you're just using regular expressions or what like you've got to be able to do it at some level of scale. And the issue is that you're trying to just solve harder problems. Is that right?
B
So I think, you know what I was saying, that is one, there is no good autoformization benchmark. That is the test that has not been quite established in the community. Second is if you look at auto formalizer Kimina, a collaboration effort between Moonshot AI and numina, a nonprofit organization, they have an auto formalizer and you can play with it. It's a demo, but it can only, I think from my experience of playing with it, only auto formalize short statements or extremely short proof. So like I think five lines and more proof, it does actually struggle. So, so there's that. So that, that's high school or I guess middle school level math. There are auto formalizer models that claim to auto formalize and actually, you know, release a link proof of very big theorems. But that is with a lot of human intervention. And I think that the statements and proofs that are being, you know, auto formalized actually rely on the author sometimes of the paper to break that mass down into very fine grained pieces that is workable for a model to auto formalize. So far we haven't seen you know, a completely human intervention free hard result being auto formalized yet.
A
It's a little surprising to me that there's such a gap in that I'm thinking about like high school geometry. Like there's gotta be, you know, if we're talking about geometry, like there's a whole bunch of proofs, they're kind of known. There's a whole bunch of like informal descriptions of those proofs. There's the opportunity to generate synthetic data I think because like we're, the relationships are constrained, you know, by geometry and they're like really well understood. And so we should be able to have some black box where you give it like a natural language description and it'll spit out a proof. Does that not exist?
B
I think your intuition is very right that there's something special about geometry. Geometry is actually geometry. High school Euclidean geometry is actually not covered by Lean as a formal language. So if you look at alpha geometry, that is actually another domain specific language for Euclidean geometry. It's using a vector based language. So other formalization that generally is described. I think converting the natural language proved to Lean like it does. I guess, I guess it's a very different system. So for all the formal, it's kind.
A
Of presupposing a different universe than it.
B
Is, you know, it is a, that is a different universe. The graph foundation model and discoveries I was talking about for graph theory is another universe.
A
Okay.
B
For some specialized problems I rely heavily on constructions, not proofs. You know, like pattern boost, right. Like that is another universe. They're like, okay, yeah, but there's like a big main universe is like lean based formal theorem proving for say like algebra analysis. These like mainstream MATLAB areas.
A
Got it, got it. And so that universe kind of presupposes a certain degree of I think both complexity but also abstraction.
B
Yeah, I think so.
A
And it seems more difficult than when.
B
We talk about like a broad universe. Sometimes you also have interesting problems you run into that maybe the proof relies on a dependency that Matlab doesn't have. So the proof is for an algebra problem, it relies on combinatorics dilemma. And combinatorics is not in Matlab yet. Majority of combinatorics is not unlike as much coverage in algebra. So Then you're like stuck. Library learning, which is the ability for model to come up with definitions or bring a new definition into the knowledge base. That has been also hard. So for the model to say, okay, well, I'm creating a definition which again is not approved. Right. So it cannot be like, verified. It's like the statement being hard to check, the definition is hard to check as well. So there are a lot of, I think, roadblocks that are in this.
A
Is there a role for simulation in these kind of proofs?
B
So, yeah, simulations are going to be a bit hard in that if you think about formal verification, it's quite rigorous. It's a proof. It's almost like many screwdrivers working together to make sure that the wheel actually is turning as it is. I think you would need other stuff to help you do simulations or say approximations and computation. Think like it will eventually be many parts working together, but proving that the property is sound is going to be an inevitable part of the engine.
A
I think what I hear you saying is that all roads lead back to auto formalization and then proving.
B
Yeah, I think we need auto formalization. Proving. We also need other things. I think we also need things that help us compute things faster. We need, you know, eventually, if you think about simulations, that it will definitely involve other things as well.
A
I think there's a certain degree of, like, when we talk about in the context of AI, you know, mathematics and reasoning, there's a degree to which it just kind of makes sense. Like, of course this would be a way to go. Like, mathematics is like a formal formalization, you know, like if we can solve math. And it addresses a lot of the issues that LLMs have. But then at the same time, it's like, you know, mathematics and mathematicians are these like, kind of, you know, geeky things in the corner. And like, we don't really want to think about that stuff. And it hasn't been, you know, like the commercial commercialization of that, like, it always expresses itself through some other more practical thing. Like, what do you think the, you know, the end benefits of these systems that you're trying to create are?
B
I think that's a great question. You know, some people think of Lean as like almost like a subculture. Like the people who love Lean and they use Dulib. It's not even like Discord, it's definitely not slack. It is a group of people who are very passionate. Some of my friends, and they're now also, like joining Axiom, are the initial Mathlib contributors. And they, I remember in 2020, it was during COVID and you know, that friend is teaching me how to play, how to play chess. And I asked him, you know, what do you do all day, you know, besides school and Covid days? So boring. He said, I'm doing chess and I'm doing Lean. So you know, it is, it is really. A lot of people think of it as a, as a subculture. I would say that a lot of things look like subculture when they were just early. Like I remember say, say GPU used to be subculture. Like it used to be just for like, you know, graphics and then the applications of that, you know, that is down the road in the future. I think this sort of geeky things, it is representing something that I think is going to be big. It is going to be a fundamental technology that for the first time you have provable guarantees of things in large language models. I think math and coding are two important, or perhaps the two biggest part of digital world and coding is heavily invested. Math is not, not that math is not tractable, but that math has not been turned into programming language yet. And by Curry, Howard correspondence and Lean, you are able to do that. It is sort of the natural direction. But also there are existing markets that will generate commercial value once the technology works. Formal verification of software, formal verification of hardware. And that is only the formal verification part. Right. And also not, not the super intelligent part. Yeah, I think there are, there are two things going on here, which is one, the main capability of large language models not fully trusted for the most high stake, you know, like safety critical domains. The other thing is they are not insanely smart, they're not super intelligent. Yet both angles, you know, unlock huge number of fields as, as applications. You know, one can even push further and say, hey, Javon's paradox is our opportunity. Once the price of outlier, abstract mathematical quantitative reasoning gets elastic, you will have elastic demand as well. And there will be new markets and new use cases that get unlocked because of this.
A
Yeah, the pushback that comes to mind for me is that with so much investment in AI right now, someone going directly after, you know, verification for code or verification for hardware, you know, might get there a lot quicker than going broadly after mathematical verification and then trying to apply it to this. So maybe the problem is solved by the time you get there.
B
I think the fundamental technology problem is so stubborn, you know, that it's very hard to get around. I think it's like, you know, pre chatgpt we are building a wrapper like that seems a bit Odd to me at least.
A
Are you arguing that you need the mathematical verification to do software and hardware verification? Like we're doing it to some degree now, right?
B
Well, not quite. Well, I think the fully automated way, like, you know, the no or very minimal human intervention part, it's pretty bad. We do believe that, you know, you can feel free to say, hey, you know, I want to in 2025 be a product company and I will develop better capabilities once my product is going and then I will make the model better, which is I think the, for example, the cursor way. Or you can say, well, I think the technology is like so far behind that I can't really build a good product if I don't have it working. And so therefore I choose to be a model their company and then unlock, you know, product in these vertical application. I'll be diligent in doing sort of product market fit iteration going down deeply in each of the vertical domains, which, you know, could be the code way. You have a really good model and then you have a delightful, delightful product. And we like to go the cloud code way, then the cursor way.
A
Yeah, that's, that's kind of the lab versus product company argument. Right.
B
But I also agree with you that because like lean is like somehow like you know, geeky and you know, there needs to be good product design, we're not kind of slacking on that. Like, you know, for example, I think link should be mostly hidden in the, any sort of interface with mathematicians because it's just not that readable. Like, you know, the, the right level of abstraction that one communicates should be, should be at the natural language layer. But then there, there needs to be lean system that is working to, to sound, you know, to make sure that it's like some like to. To it's. It's provable and, and make sure that it is actually sound and correct and rigorous.
A
Right, right. That, that's a really interesting point and insight and I think it kind of speaks to this question. Like, you know, thinking back to, you know, chatgpt, right. We had generative models. The unlock was to a larger degree UI and making it accessible. Like what's the unlocking UI for a mathematical reasoner or an auto formalization model? Like, I don't think we want that on the Internet as a chatbot. It's got to be something else, right?
B
Yeah, I think we have interesting like product brainstorming sessions already. We think that mathematicians work in a very specific way and researchers with large amount of quantitative reasoning in Their daily workflow also work a very specific way. And it's a little bit more like, you know, dreaming. It's very interesting product problem. Right. It's like a technology that hasn't quite existed yet. There are people, the audience generally are not the product people. Like, it's very hard to find intersections like product and mathematicians and so. Yeah, but I think there are cute designs that can make this a very frictionless, like delightful experience.
A
I don't know that there's an answer for this that's different from the entirety of our conversation, but in terms of like, you know, priorities and the next things that are, you know, on your mind for figuring out, like, how do you think about that?
B
Yeah. So are you like we're so back to the technology, right? Like, I think we want to have a very good system that can prove very well, that can auto formalize very well and that can conjecture very well, even though the very well part is still under explorations. It is, it is, it is a deep tech company kind of profile, a startup that we expect to, you know, like SpaceX having many rocket crashes before, before it surely work and like, you know, half the eventual renaissance. We expect like very sort of intense R and D focused period of the startup cycle as well. And we'll try to like, you know, reach these milestones one at a time and eventually have a comprehensive system that is really similar to what the word AI mathematician, you know, represents.
A
Awesome. Awesome. Well, Karina, thanks so much for jumping on and sharing a bit about what you're up to.
B
Thank you so much. This is really delightful conversation.
The TWIML AI Podcast #754: Building an AI Mathematician with Karina Hong (Axiom)
Date: November 4, 2025
Guest: Karina Hong, Founder and CEO at Axiom
Host: Sam Charrington
In this episode, Sam Charrington explores the burgeoning intersection of artificial intelligence, mathematics, and programming languages with guest Karina Hong, founder and CEO of Axiom. Together, they discuss the challenges and opportunities in building an "AI mathematician" — a system capable of formal mathematical reasoning, conjecture, and discovery. Karina shares her personal journey from math Olympian to research mathematician, the current state of formal mathematics with languages like Lean, the role of code generation and reinforcement learning, data challenges, and the roadmap to making mathematics as programmable as code.
“Wouldn’t it be great if you can just let your intuitions take you?… Wouldn’t it be great to just think there’s a high chance if your intuition is right that the lemma is correct and move on…” (04:06, Karina)
Three Key Trends Enabling Progress (05:51)
"The O series of OpenAI really marked an era of post-training, reasoning, reinforcement learning… great technique breakthoughs." (06:16, Karina)
Why Math Is Uniquely Unsolved
How Lean Works and Its Role (08:19–13:39)
“If your statement is correctly formalized and your proof compiles, then congratulations, you have a correct proof.” (13:39, Karina)
Limitations and Challenges
What Is Autoformalization?
The "Trillion vs. Million" Data Problem
Tackling Cold Start
“You do have that sort of like cold start chicken and egg problem…” (17:22, Karina)
Current State and Barriers
Role of RL in Formal Math
Model Strategies
Prover-Conjecturer Self-Improving Systems
“…form like a self improving loop as the prototype of self improving AI.” (22:13, Karina)
Synthetic Data and Benchmarking
“…it is going to be a fundamental technology that for the first time you have provable guarantees of things in large language models.” (47:50, Karina)
Francois Charton’s Team & Pattern Boost
Exploring Beyond Lean
On the challenge of math vs. coding:
“Math has not been turned into programming language yet and there will be new markets and new use cases that get unlocked because of this.” (01:05, Karina)
On autoformalization’s difficulty:
“Autoformalization… is surprisingly very hard. I mean, it’s not just a translation problem.” (14:46, Karina)
On the intersection of formal and informal reasoning:
“We don’t believe in only formal. We don’t believe in purely informal for sure… a hybrid approach… could be very interesting.” (29:19, Karina)
On the feedback loop of conjecturer/prover:
“You can have the prover as the reward signals for the conjecture… have the prover talk to the conjecture… form like a self-improving loop…” (22:06, Karina)
On commercialization:
“…there needs to be good product design… the right level of abstraction that one communicates should be at the natural language layer… [but] lean system working to make sure that it’s provable and correct…” (52:24, Karina)
| Timestamp | Segment Description | |--------------|--------------------------------------------------------------| | 02:20–04:41 | Karina’s early math journey and MIT research experience | | 05:48–07:39 | Why now? The three converging fields enabling AI math | | 08:19–13:39 | Lean explained: programming language for math, strengths | | 14:46–17:22 | Autoformalization explained and data gap challenge | | 17:54–20:24 | Reinforcement learning, formal math benchmarks, curriculums | | 21:55–24:52 | Axiom’s approach: Prover-conjecturer, self-play systems | | 27:40–30:07 | Data exploitation, informal vs. formal data | | 30:29–33:24 | Advice for newcomers, resources, and community | | 33:32–36:31 | Mathematical discovery: Pattern Boost and specialized problems| | 41:17–44:08 | Autoformalization limits and Lean’s coverage | | 47:50–50:32 | Commercial applicability and transformative potential | | 53:00–54:31 | Productization, UI/UX, and the challenge of adoption |
Karina Hong, through Axiom, is spearheading a future where math is not just solved by machines, but formalized, proven, and even creatively discovered with AI. The journey remains fraught with technical and data hurdles: scaling up autoformalization, bridging data gaps, expanding formal language coverage, and creating tools mathematicians will actually use. Yet, as math, programming, and AI coalesce, the prospect grows for both practical applications (verification, discovery) and foundational advances — not just for math, but across all high-stakes domains requiring rigorous, trustworthy reasoning.
Podcast fans in AI, mathematics, or software engineering will find this a rich, hopeful look at a deeply technical fast-moving frontier, still very much open for contribution and innovation.