wavePod

Get Wave AI

Episode 17 - What happens now that AI is good at math? - OpenAI Podcast | Wave AI Podcast Notes

Back to OpenAI Podcast

Episode 17 - What happens now that AI is good at math?

OpenAI Podcast

Tue Apr 28 2026

Summary

OpenAI Podcast – Episode 17: "What happens now that AI is good at math?"

Date: April 28, 2026
Host: Andrew Mayne
Guests: Sebastian Bubeck & Ernest Rio (OpenAI Researchers)

Episode Overview

This episode explores the recent breakthroughs in mathematical reasoning by AI, specifically large language models (LLMs) like ChatGPT. Host Andrew Mayne interviews OpenAI researchers Sebastian Bubeck and Ernest Rio on how AI's abilities have evolved from basic arithmetic to research-level mathematics, the implications of this progress across science, the human/AI collaboration dynamic, and the new challenges and opportunities this presents for mathematicians, scientists, and learners everywhere.

Key Discussion Points & Insights

1. Rapid Progression of AI in Mathematics

"The progress of the last few years has been nothing short of miraculous." (Sebastian, 00:14, and again at 01:35)
AI models have moved from being unable to solve high school problems to performing at International Math Olympiad (IMO) gold medal level and even solving open research questions.

2. Personal Journeys of the Researchers

Sebastian and Ernest share their backgrounds in mathematics, optimization, and machine learning, emphasizing their perspective straddling academia and industrial AI research. (00:34–01:26)

3. The Breakthrough: Solving Open Problems

Ernest's story:
- In 2025, Ernest used ChatGPT to interactively solve a 42-year-old open problem in optimization involving the Nesterov accelerated gradient method.
- "It wasn't as simple as me just putting in the prompt and getting a solution... I played the role of the verifier." (Ernest, 03:01)
- Collaboration and iterative correction were key; the process took around 12 hours—far less than his previous months-long solo attempts. (See 03:01–05:59, and 22:01–22:08)
- He publicized this on Twitter, sparking both excitement and caution in the math community. (06:00–06:21)

4. From Tool Use to Reasoning

Early LLMs needed calculators or tools for nontrivial math, but newer models began to demonstrate genuine internal reasoning and problem-solving.
Ernest: "Now any math problem that you would want to solve, most people for 99% of the population, the models can do it." (08:37)
Sebastian recalls: The transformation from being "impressed" by basic coordinate geometry (Minerva, 2019) to now seeing near-complete mastery of advanced undergraduate and even research math. (10:13)

5. Why Math is a Benchmark for Intelligence

Math problems are precise and their solutions verifiable—a perfect benchmark for tracking model progress.
Sebastian: "To resolve a problem, you have to think for a long time, ... also have to think consistently for a long time. If at some point in your chain of reasoning there is a mistake, this will kill the entire argument." (11:36–12:59)
This logical rigor underpins both human and AI learning and is expected to generalize to broader domains.

6. The Social Impact: Collaboration, Verification, and Discovery

Twitter and social media fast-track feedback and debate but can also amplify misunderstandings or hype. (13:36–14:20)
The Erdos Problems:
- AI models surfaced previously unsolved or little-known problems posed by Paul Erdos, showcasing both literature search prowess and, eventually, entirely novel solutions.
- Sebastian: "Within the span of just a few months… we went from… a ridiculous statement to say that there would be 10 solutions to Erdos problems to... It's actually happening for real and it's accelerating." (15:27–19:15)

7. Limits, Levelling Up, and the "Automated Researcher"

Current LLMs can reason at the level of a researcher for stretches equivalent to days of human work, but real breakthroughs (especially in biology, etc.) will require models that autonomously 'think' for weeks or months ("AGI time"). (21:31–24:07 – see “AGI seconds, minutes, hours, days, weeks, months…")
Coding analogy: Long-form coding sessions using AI (Codex) foreshadow how math research may also be conducted over long 'threads' and large content/context windows. (24:07–26:25)

8. Acceleration & Democratization of Science

AI allows rapid prototyping and testing: tasks that once took days/minutes now take minutes/seconds.
Andrew: "Five minutes later I had it. And that was magical to me because I'm in the middle of working on the tool that would have involved me ... a few hours." (26:25–27:02)
Mathematicians and scientists now have access to computational tools and coding resources without prior training.
Scientists will be empowered to be more productive and creative, not replaced.
Sebastian: "We need more scientists than ever. Those scientists are going to be more productive, more powerful. They will do better things, but we need them to be really, really good at their craft." (38:13)

9. AI’s Role in Mathematical Verification and Connections

AI can dramatically speed up the process of verifying proofs and surfacing connections between disparate areas of math.
Ernest: "Math is going to become much more richer because it's going to be much more interconnected... now that we have AI, the AI will have read it. And if there is a useful connection ... it will surface it." (32:16)
AI's patience and accuracy aid in error detection.
However, Sebastian cautions: Over-reliance can lead to a shallow understanding for newer learners or those outside the field. Expertise will remain crucial. (35:54–37:29)

10. Human & Social Dimensions: The Need for Expertise and Adaptation

Both guests stress that AI is a tool: it can’t replace curiosity, deep understanding, or the social dimension of science.
There’s a danger in skipping rigorous learning ("mental atrophy" and verification responsibility).
Social structures of publishing, review, and reputation will need to adapt to new realities. (40:52)
Sebastian: "The reason why we are able to squeeze out those results from ChatGPT is because of all of those years of training and our deep understanding of the subject." (35:54)

11. Advice for Aspiring Mathematicians and Scientists

Use ChatGPT as a learning companion; explain your knowledge background and let the model tailor explanations and even propose open problems appropriate to your level. (41:29–43:04)
Math and science will become more accessible and social—tools can help more people enter and thrive in the field.

Notable Quotes & Memorable Moments

Sebastian (01:35): "Everybody has been surprised by this progress, including us."
Ernest (03:01): "I played the role of the verifier. I told whenever the model made a mistake, I corrected it."
Sebastian (10:13): "We have kind of forgotten how quickly things have happened."
Sebastian (11:36): "A key feature of mathematics is that to resolve a problem, you have to think for a long time... If there is one single failure point, everything, the entire argument is destroyed."
Ernest (19:33): "I really like about AI research is that it forces us to confront big questions about intelligence and ... how do we discover new things?"
Sebastian (22:08 – on AI time): "We want to go to weeks, if not months. This is open research. I don't think anyone on the planet knows exactly how to do it."
Ernest (32:16): "Math is going to become much more richer because it's going to be much more interconnected... AI will have read it... if there is a useful connection, it will surface it."
Sebastian (35:54): "Expertise is even more valuable than it ever was."
Ernest (43:32): "Having this companion that you can talk about math with ... even though you're still in your room alone, it feels much less of a solitary process. And that's what really makes mathematics fun. Because math, I think it really is a social endeavor."

Timestamps for Key Segments

00:00–01:26 — Introductions, backgrounds of Sebastian and Ernest
01:35–02:53 — The scale and surprise of progress in AI math skills
03:01–06:21 — Ernest’s story: AI helps solve a 42-year-old open problem
06:57–09:10 — From basic word problems to advanced calculus; limitations and breakthroughs
09:31–11:16 — Beyond tool use: What enabled the leap in AI’s abilities?
11:36–13:36 — Why math is a benchmark and what the next steps are
13:39–15:18 — The legend of Paul Erdos and the Erdos problem database
15:18–19:15 — AI solves Erdos problems: literature search and original research
21:31–24:07 — Moving towards "automated researchers" and AGI time
24:07–26:25 — Human-style, long-form research with AI; coding/notes analogy
26:25–28:17 — Science acceleration: practical stories of productivity
32:16–35:54 — The future: interconnected math, rapid error-checking, greater accessibility
35:54–38:07 — Warnings: the risk of shallow learning, need for expertise and verification
41:29–43:04 — How to get started with math/AI as a curious learner
43:04–end — Final thoughts and encouragement

Final Thoughts

The episode paints a picture of a mathematical, scientific, and educational landscape transformed by AI:

Problems that once seemed out of reach are now approachable.
Research can proceed orders of magnitude faster.
Math is becoming more interconnected, social, and accessible.
Yet, deep expertise, human curiosity, and rigorous validation are more important than ever.

Bottom line: AI is not replacing mathematicians or scientists—it is amplifying their potential, democratizing access, and accelerating the rate of discovery. But human judgment, insight, and guidance remain essential.

If you’re mathematically curious:

Try out ChatGPT as a math coach and collaborator (41:29–43:26).
Start with toy problems, ask questions, and follow the threads—let your curiosity, and the model’s suggestions, guide you.

"Math is going to be so much fun. ... It really is a social endeavor."
(Ernest, 31:56 & 43:32)

Loading summary...

Transcript

A (0:00)

Hello, I'm Andrew Main and this is the OpenAI podcast. Today our guests are researchers Sebastian Bubeck and Ernest Rio and we're going to talk about math, how it went from almost laughable to Olympiad level, and why you need math to reach AGI.

B (0:14)

The progress of the last few years has been nothing short of miraculous.

C (0:18)

We will be able to have LLMs, be able to solve problems that require more than 50 pages of thinking.

B (0:24)

Mathematics was just the perfect benchmark to see the model making progress during the last four years.

A (0:34)

Sebastian Ernest, I'd love to know more about you. So how would you explain your roles?

B (0:38)

Yeah, sure. So I have been working in mathematics for almost 20 years now. I used to work in optimization and theory of machine learning. I was a professor at Princeton for a few years before moving to Microsoft, and now I'm a researcher at OpenAI and in the last few years have been really trying to understand how AI can help mathematics and to really evaluate the progress that we're making in terms of solving difficult math problems with AI.

A (1:08)

Ernest, how about you?

C (1:09)

Yeah, so I've recently joined OpenAI as a researcher, but before that I was an applied mathematician working on optimization and machine learning theory. And in my previous job I worked as a professor of mathematics at the UCLA Math department.

A (1:26)

So I think a lot of people have this perception that these models aren't good at math, literally called language models. And how has that changed what's gone on?

B (1:35)

Yeah, I think, you know, the progress of the last few years has been nothing short of miraculous. It's important to remember that two years ago we didn't even have reasoning models, let alone models that could prove, you know, difficult mathematical theorems. Today, two years later, the models, they are able to help field medalists in their day to day work. Really the jump is just simply astounding. Maybe if I can build a little bit more on that. Something which is important to understand is that everybody has been surprised by this progress, including us. To tell you a story, a year and a half ago I was at a workshop at the conference with other fellow mathematicians and there was a debate that I participated in on whether scaling LLMs will help us resolve major open problems. So this was a debate a year and a half ago and the room was very divided. In fact, they did a poll at the beginning and I think it was like 80% said no, impossible that this would happen. So then the debate unfolded and by the end of the debate it was more like 50 50. So pretty good progress during that hour. This obviously was just so wrong. In hindsight, like just mere eight months later, the model we're starting to be able to do research level mathematics.

C (3:01)

So summer of 25, the big news was ChatGPT was able to achieve a top human level performance at the International Math Olympia gold medal performance. So that was amazing news. And that demonstrated that, well, at least for the competition level mathematics, the models are very highly capable, only on par with the top human high school contestants. But, well, competition problems are canned problems. They have relatively short solutions because they are meant to be solved within a few hours and they're not novel because, well, somebody came up with it, there's a solution. So it's not research level math. So then I got curious and a lot of people got curious. Can ChatGPT do research level mathematics? And there was a lot of debate online. And then I thought to myself, I should try it on my own problems. Maybe I'll try it for myself and make up my own minds as opposed to listening to what other people say, because I'm a mathematician myself. So I took a classical open problem in optimization theory, which is a branch of applied mathematics that I work in, and the question specifically is, there's a famous algorithm called the Nesterov accelerated gradient method, and does this have this convergent behavior, or is it possible that in certain bad cases can there be a certain divergent behavior? This question was genuinely open in the sense that people know that in most cases the algorithm behaves well, it's convergent, but people really did not know. Is there a bad instance in the worst case, could it diverge? The answer turned out to be yes. And the way I discovered it is I remember it distinctly. So my bedtime for my son is 8pm and then I tried not to stay awake after midnight. So I had four hours of usually evening hours to myself if I want to focus on something. So I decided, okay, I'm going to spend a few days working on this. So over the course of three days, so that's 12 hours total, I interacted with ChatGPT on this question. It wasn't as simple as me just putting in the prompt and getting a solution. I played the role of the verifier. I told whenever the model made a mistake, I corrected it. I also tried to point the conversation into areas that I felt approaches that I felt were novel. And after a while there was a proof and I checked it. I also asked ChatGPT to double check it, and it was correct. And that's how this 42 year old open problem got resolved. And once I got this, this solution, I thought to myself, what would be the most fun thing for me, fun way for me to publicize this? Because I could just write a paper and that would be. But that'd be less fun. So I decided, let me go to Twitter and talk about this.

C (6:57)

When ChatGPT, you know, just entered the scene in early 23, I started testing the. I was very curious about how the model would perform fair on sort of common math problems. So these would include math problems that you would see in like the high school level, but also like day to day, like math ish problems. So for example, imagine a scenario where the three of us went camping together and then I paid for this, I paid for this. And then, Andrew, you pay for whatever. And then we might want to clear the ledger. We want to split things evenly at the end. Can ChatGPT do the calculations for us? And this is moderately complicated. If you have like 17 items that we've purchased in 23, 24 and also in early 25, I remember the models couldn't do this. Another example would be, I'm in, let's say in Korea, Seb's in Paris. Andrew, you're in California and want to set up a zoom meeting, what would be a good hour to do? So again, in early 25, the models couldn't do this. But then just suddenly things just changed and I wasn't in OpenAI at the time. So I'm not at all, I'm not quite privy to what exactly you did, but suddenly the model started solving IMO problems and then furthermore it started solving research problems. And the way I sort of calibrate this right now is that unless you are a professional mathematician trying to discover new mathematics, if you are somebody who's, let's say a physicist or a chemist who uses relatively complicated mathematics like differential equations, differential geometry, things like this, but you're not inventing new math, then ChatGPT can do all of the math that you would need. So any basically user of high level mathematics from STEM can now use ChatGPT to basically have their math taken care of. You want to exercise some degree of caution, check whether things are right, run simulations just to double check. The models can make mistakes. But now any math problem that you would want to solve, most people for 99% of the population, the models can do it.

B (11:36)

So I think the oh, cool, it does math part, what did matter as we were developing those models as a good way to benchmark the progress? The nice thing about mathematics is that the questions are very clear, non ambiguous. You know, everybody agrees on what the question is asking. So that's point number one, point number two, you can verify the answer. So once the model can give an answer, everybody will agree, was it correct or was it not correct? Although you can put a pin on that because we will talk about, you know, in research level, it's not that simple anymore to evaluate. But before research level, it's very easy to evaluate. So mathematics was just the perfect benchmark to see the model making progress during the last four years. Now we'd say we have kind of saturated that aspect. And you can ask, okay, now, okay, fine, the models do mathematics. We have understood what about the next steps? And for the next step, I would say that having our models be good at mathematics is going to be good for many, many other things. And let me explain why. A key feature of mathematics is that to resolve a problem, you have to think for a long time, be it days, weeks, sometimes years. So this long thinking, not only do you have to think for a long time, but you also have to think consistently for a long time. If at some point in your chain of reasoning there is a mistake, this will kill the entire argument. It doesn't matter if everything after that is correct. If there is one single failure point, everything, the entire argument is destroyed. So this property makes it that this is what you want out of reasoning models that if they make mistakes, they will be able to correct themselves. So we are hoping that this property that they acquire through mathematics will generalize to other domains, which, by the way, is exactly the same thing with human beings. Why do we train human beings in mathematics? I mean, it's a very fun topic. I love it. We did it professionally, maybe we still do some of it a little bit. But why do we train humans in mathematics? Exactly. For the same reason. It gives you this kind of very logical thinking.

B (15:27)

Absolutely, absolutely. I think the two versus three basically says something about our respective age, essentially what it said. So anyway, so Erdos has, you know, all of this problem and There is a, a very nice website by Thomas Bloom who is keeping track of all the Erdos problems that are still open. So I think there is like a thousand problem or something like that on that website. And Thomas himself has done the work of trying to find, you know, he's an expert in combinatorics so he can kind of say okay, this is open, this is resolved. This has some complicated status for every problem. Of course it doesn't necessarily know the answer to all of them. So if there is a paper which is marked open, it is not necessarily true that nobody knows how to solve it. But it is also a very interactive website where people can go on it and add comments to every problem and explain whether there is a solution, et cetera. So it's a very dynamic, great website. So of course once we started to have GPT be able to solve research mass problem, this sounded like a treasure trove of problem to try our models on. And we tried a couple and to our great surprise the model came back with answers to some of them that were marked as open. So we got really excited about this. The first one, you know, that I tweeted about, I don't remember when it was, maybe it was in October or something like that last year. It was a deep literature search result. So let me explain what that means. It means that what GPT did is that it did a vast literature search trying to scan thousands of papers and it found in some unrelated field the answer to the question. Now it's really important to understand that it's not like in that unrelated field. The person said, okay, I'm solving an Erdos problem. It was written in a completely different language, it was different mathematics. You have to do work to connect the two pieces. And GPT did that. So that was kind of amazing. And this was very ad hoc like we just tried by hand basically in the ChatGPT interface. Once we saw that Mark Selke who is in our team also decided to have a more systematic approach of trying all of the problems. And he tried that and the model came back with solutions to 10 Erdos problems. And this was, you have to remember at that point there was still, I think, a very dynamic discussion about whether those models could go beyond the state of the art and discover invent new mathematics. So I got very excited about this result and I tweeted about it and it's kind of an infamous tweet because people misunderstood it as kind of saying it really found the solution to 10 open problems that are very hard and the solution is Completely new and did not exist in the literature. But that's not what happened. It was connected of course to the previous case where it is a deep literature search. So there was some FUD with Google with Endemis about whether this is the right way to talk about such results. But now the punchline is kind of amazing, which is a few months later. So Again, I said 10 solutions to open problems and these were solutions in the literature. And then the question is, can you find solutions that are not in the literature? By now we have more than 10 actual solutions that are completely new, that are publishable in top journal in combinatorics, completely obtained by some by ChatGPT and some by our internal models. So just within. Again, this really speaks to the acceleration in the span of just a few months. We went to. It's kind of a ridiculous statement to say that there would be 10 solutions to Erdos problems to. It's actually happening for real and it's accelerating.

C (24:07)

So the people, the other mathematicians that I talk to, their Mode of using AI is they open up ChatGPT and then they talk to ChatGPT within that context window. And you can have multiple sessions, but each session has a finite context length and roughly on the order of like 50 pages of a math paper. And that's not long enough to make true like deep math, groundbreaking math breakthroughs, because a lot of math papers are longer than 50 pages. And also the thought, the human thought that went into to produce, let's say a 10 or 30 page paper is usually, well, much orders of magnitude longer than the final output. So there's a limitation with the limited context window, but for users, but people who've used codecs will know that you can actually have very long work sessions with codecs. So you just keep giving instructions as to what kind of code you want to write and then the code itself that you're working on, the repository of your code, which in the math sense the analogy would be that would be analogous to like math notes that you write down that can be very, very, very long. But Codex is pretty good at dealing with that once in a while. It compactifies its conversations and it has its way of becoming this really amazing agent that can do really complex jobs over huge repositories of code over a long, a really long context of conversation. This, I believe is going to happen with mathematics research as well. So we will be able to have LLMs, be able to solve problems that are longer than just that, require more than 50 pages of thinking. And that's what humans do. That's what human mathematicians do. People think for A day on a certain problem and then we kind of summarize our ideas and put it into notes. The next day or the next week we come back to it. And then over several months we've thought for so long, but it's sort of summarized, it's sort of organized in a way that becomes manageable. And in the end the final output becomes a 30 page paper summarizing the thoughts over many, many months or even years. So yeah, I think that's going to happen.

B (28:36)

I think, I think, you know, there is what my heart tells me and there is the rational aspect. So what my head tells me is, look, the progress has been happening very consistently for the last four years from being able to solve mass problems that would take you seconds to minutes to hours, to days. There is no reason anybody who would look at the situation would say, okay, a year from now, you will have systems that can think for weeks. Two years from now, systems that can think for years. Not only that, but already today we're finding that our models, they are able to really surpass humans in the sense that they can find mistakes in papers. You know, we had system, we had agents internally that have been able to come up with, to find papers and say, hey, actually, this is wrong. Here is the correct answer. Not only that, but people tend to think that AI is only good at answering questions. Actually, no, it's also pretty good at asking questions. Of course, you need to be, you know, again, you need some research, innovation. There's. Which we had. And now our models are very good at asking questions. So good, in fact, that humans are looking at those questions and saying, hey, maybe I should write a paper based on this question. So this is really already happening now. So I think what I'm trying to say is that in a year, in two years, yes, models could do basic, more or less everything that human researchers do. So now what. What is the role of human. Well, why is it that we're doing science? What's the point? You know, the point is not to. I mean, at least it shouldn't be to just solve problems for the fun of solving problems. We're solving problems because we're trying to understand something. The understanding piece is key. We're not solving problems to write papers to show, to say that we can write 10 times more papers than our neighbor. That's not the point. You can do competitive chess if that's, if that's your kind of deal. We're trying to really understand deeper things. And why are we Trying to understand deeper things because we want to have better control over our environment. We want to be able to cure diseases, we want to be able to build things better, faster, more robust, more solid, all of those things. So I think there is a chance that we're looking at a very, very bright future using those tools. As long as the human stays in control and guides what are the problems that matter? Problems that the AI doesn't care about curing disease. I mean, they will not suffer from the same disease as we do, but we do care. So we have to control them and to guide them towards those problems.

C (32:16)

There is a lot of pain and there is a huge, like there is a surge of dopamine when you actually find the solution, that's going to be accelerated. So more solutions, more fun. But also I think math is going to become much more richer because it's going to be much more interconnected because there's a lot of, at research level, a lot of math is hyper niche. And when you write the paper, you know that there are only five living humans right now that will care about this paper. But you like the result, so you put it out and then the five other people appreciate it, so they read it. But then, you know, 20 years later it's going to, well, it's going to be in the archive somewhere and nobody will read it. But now that we have AI, the AI will have read it. And if there is a useful connection, as Sebastian mentioned, it will surface it and then people 100 years down the line will Discover it and use it for whatever they want to use. So I would now have much more confidence that my results that are just like that is put out there will be used if there is a use in the future. And also I'm now able to access the mathematics in a much broader way. There are fields that I've not studied, but if a result comes up, then I would still have to study that field to be able to use that particular result in my research. But there is no way I could have found that result without the assistance of AI. But now it's accessible, the model tells me, hey, you can use this to solve your problem. And then, well, okay, I'll go and try to use that. So math is going to be a much more interconnected enterprise. And also verifying correctness of mathematics is actually quite non trivial because imagine there's a proof written by some, you know, some, somebody that's, it's, it's 300 pages long and it claims to solve a really important problem. And this person is a very reputable person. So like there's there and the paper at surface looks plausible. How do you know? Well, this is a process that takes years to verify and it's also not enough that one person reads it. Many people need to read it and then try to extend it and then look into the details. This is a process that takes years and sometimes fatally incorrect proofs are published. So that's also a very slow process where the field initially accepts a result, but later on discovers that it's unsalvageable. So then it needs to get filtered out. This is going to be so much more accelerated with AI. So right now ChatGPT and our AI models are not perfect at verifying mathematics, but it's very good and also it has much more patience than humans. So the truth is, so much of the published mathematics have minor mistakes and a lot of them do have major mistakes. And we know because we have tested these things with our models. But now I think the more richer future of mathematics is that this will be through AI verification. We will have much more certainty as to which results are correct, which results are incorrect. And we'll have a much faster feedback on this. The paper published put out a week ago, we could get a verification on that and then we could trust and build on that as opposed to waiting for five years to really ascertain the correctness. So overall math is going to be much more fun, it's going to be much more interconnected, we'll be able to trust the results More we'll be able to move faster and the mathematicians will solve harder and more interesting problems.