
Loading summary
A
Okay, so look at the calendar. It is Monday, February 16, 2026. And if you're feeling a little bit of whiplash from the news cycle right now, you are definitely not alone. I mean, usually we get, what, one big tech announcement a month? Maybe.
B
It has been a singular week, hasn't it? Yeah, I think whiplash is probably the technical term for it at this point.
A
Singular is putting it mildly. I feel like we just lived through a year's worth of AI releases in about 96 hours. We've had drops from basically every major lab, and my feed is just a total blur of version numbers.
B
It really does feel that way. It's a saturation point. But amidst all that noise today, we really need to pause and focus on the one that has everyone actually stopping to read the white paper.
A
Exactly. We are talking about Google's latest the Gemini 3 Deepthink. And the claims here are, well, they're big. They aren't just saying, hey, we made it a bit faster or it writes better poetry. No, they're talking about solving problems. That stuff stomp human experts.
B
Right. And what's interesting is the specific angle they're taking. This isn't just another chatbot update. We're looking at a model explicitly designed for modern science, research and engineering. It's a real pivot from chat to reasoning.
A
So for this deep dive, we are digging into the official technical announcement from Google, DeepMind, which just dropped. But because we don't just take the press release at face value, never. We are also pulling apart a very active, very, very skeptical and honestly fascinating discussion thread from Hacker News where people are actually testing this thing out in the wild.
B
And that community testing is just crucial because the marketing is always going to be shiny. But the users, the engineers, the coders, the scientists, they're the ones finding the cracks in the armor.
A
Exactly. And the mission today is to answer one big question. Is this model actually thinking, like, really reasoning through novel problems, or is it just a ridiculously expensive version of Autocomplete that. That's really good at faking it.
B
That is the billion dollar question. Or considering the compute costs, we're seeing maybe the trillion dollar question.
A
Let's jump right into the number that basically broke the Internet this week. We have to start with the benchmark.
B
ARC AGI 2 IRCAGI 2 yeah, this is the big one. This is the test that keeps researchers up at night.
A
Gemini 3 Deepthink scored 84.6%.
B
Yep.
A
Now, I need you to contextualize that for me, because numbers are just numbers. Until you compare them. Where were we before this? What was the best before Monday?
B
So to give you a sense of the scale here, the previous heavy hitter, the model everyone was using as the gold standard, the one we all thought was basically untouchable, was claudopus 4.6. And its score was 68.8%.
A
Whoa. Okay, so we're talking about a 16 point jump.
B
Almost 16 points. In the world of AI benchmarks, you know, we usually celebrate a 1 or 2% increase. 16% leap, that is. That's not an iteration. That is a different species of software.
A
And for people who aren't glued to the benchmark leaderboards, just remind us what ARC AGI actually is. Why does this test matter more than, say, a math test or the bar exam?
B
Great question. So most benchmarks, they test knowledge. If I ask an AI who was the President in 1850, or solve this differential equation, it can often do that just by recognizing patterns it saw in its training data. It's. It's essentially remembering.
A
It's open book testing.
B
It's exactly that. Yeah, but Arc the abstraction and reasoning corpus, it tests novel reasoning. It gives the AI AI these visual puzzles, grids of colored squares, and it has to figure out a hidden rule just by looking at a few examples. And crucially, these are patterns it has never, ever seen before.
A
So it's basically an IQ test for machines.
B
It is exactly that. It tests the ability to adapt to something brand new. You can't memorize the answer because the answer isn't in the database. So jumping from 68% to nearly 85% is. It's startling. It suggests that the model isn't just matching patterns anymore. It's actually formulating hypotheses and testing them.
A
But. But. And there is always a but. We found a catch in the community analysis, and it's a massive one. It's the price tag, right?
B
This is where we have to look on the practical side. I mean, achieving that 84.6% score, it isn't efficient. The analysis on Hacker News pointed out that to get that performance, the model is doing a lot of internal monologue.
A
And time is money.
B
Compute is money. The estimated cost per task to solve just one of these hard problems is around $13.62.
A
Wait, hold on. $13 per question?
B
Per task? Yes.
A
Okay, so I'm not using this to generate my grocery list. Hey, Gemini, do I need milk? Yes, that will be $13, please.
B
Definitely not. And that distinction is just vital. We are moving into an era where we have a split market. We have Fast thinking models for chat emails and code completion. And those are cheap. And now we have these deep thinking models for the heavy lifting.
A
But who pays 13 bucks for an answer?
B
Well, you have to frame it differently. Compare it to a human. Okay, if you are running a lab and you're hiring a grad student or a specialized consultant to solve a novel logic problem or debug a complex theorem, how much are you paying them per hour?
A
Way, way more than $13.
B
Exactly. If the output is valuable. If it solves a bottleneck in your research, $13 is basically free. So Google is positioning this not as a better Siri, but as a cheaper research assistant.
A
That makes sense. It's heavy thinking, computing, a consultant, not a chatbot. And speaking of research assistants, let's move out of the abstract puzzles and look at what this thing is doing in the real world. Because Google claims this is for modern
B
science and they brought receipts. The technical report was full of these case studies.
A
They did. Let's talk about the physics example first. Lisa Carbone at Rutgers University.
B
This is a great example because of how niche it is. So, Lisa Carbone, she works in high energy physics. She's dealing with the mathematical structures needed to bridge Einstein's theory of gravity with quantum mechanics.
A
The holy grail of physics. The thing everyone has been trying to solve for a century, pretty much.
B
And this is a field with very little training data because it's so cutting edge. But, I mean, the papers are being written now. She fed Gemini three deepthink, a highly technical mathematics paper, a draft she was working on or reviewing.
A
And what happened?
B
The model identified a subtle logical flaw in the paper.
A
A flaw that humans missed.
B
A flaw that had passed through human peer review unnoticed.
A
That is okay, that gives me chills a little bit. We are not talking about it checking for typos or grammar. It understood the logic of high energy physics well enough to correct a human expert.
B
It's the aha moment. It's acting as a second pair of eyes, but a pair of eyes that has read every math textbook in existence and, you know, doesn't get tired.
A
Yeah.
B
And it's not just theoretical math. There was another case study from the Wang lab at Duke University.
A
The crystal growth one.
B
Yes. This is material science. They were trying to grow these thin crystal films. Specifically, they needed a recipe to grow them larger than 100 micrometers, which I assume is hard. Previous methods couldn't hit that target. It's a messy problem. Chemistry and material science are full of incomplete data. It's not like a math problem. Where there's always a single clean answer. You have temperature, pressure, chemical mixtures. It's just chaos.
A
And an AI solves it.
B
Deep Think analyzed the variables and successfully designed a fabrication method, a recipe to achieve it.
A
It designed the recipe. It didn't just find one on Google.
B
No, it synthesized the known variables to create a new process.
A
So we have an AI debugging gravity theories and cooking up new semiconductors.
B
It aligns perfectly with what we're seeing in the user comments on Hacker News. Users are describing Gemini 3 as feeling book smart or academic. It really excels in these domains. Chemistry, physics, math, where you have messy data but rigorous underlying rules.
A
It's the absent minded professor who solves the unsolvable equation, but maybe forgets where he put his keys.
B
That is the perfect analogy. And actually that transitions us perfectly to the next segment because we've talked about high science. But can this absent minded professor play video games?
A
This was my favorite part of the thread.
B
The Balatro test, right?
A
For those who don't know, Balatro is a poker themed roguelike game. It's not just played poker. You have joker cards that give you multipliers, tarot cards that change your deck. It's all about strategy, probability and risk management. I'm terrible at it, frankly.
B
It requires long term planning. You have to build an engine to score points.
A
Now, usually when we test AI on games, we think of it seeing the screen, we hook up a camera or feed it screenshots. But that's not what happened here, right?
B
No. This was a user named Rain Cole. They tested Gemini 3 by giving it a text description of the game state. No visuals, just words. You have a pair of kings. The blind is 600 chips. You have a joker that gives plus four multiplier.
A
So it's playing blind, essentially just reading the commentary track.
B
Precisely. It has to build a mental model of the game state entirely from text. And the result? It beat ante eight, which is winning a standard run nine out of 15 times. That is a 60% win rate.
A
Is that, is that good? I mean, for a machine playing a video game via text, it's incredible.
B
The user noted that most human first time players cannot beat anti8 on their first attempt. In fact, probably 99% of players fail their first run. And other models like Deepseek or even GPT4.0, they generally can't do this at all. They lose track of the numbers or make illegal moves.
A
So it's demonstrating general intelligence. It wasn't trained to play Balatro Specifically, it's reading the rules, understanding the strategy from text and beating the game better than a novice human.
B
That's the general. In artificial general intelligence, it's transferring reasoning from one domain to another. But.
A
I hear a but coming. The expert nuance.
B
It's not perfect. It's still software. The community notes mentioned that in one of the runs it tried to sell
A
an invisible joker card, which is a valid move.
B
It is a valid move in the game logic. But doing so caused the test harness the software the user wrote to run the test to crash.
A
It crashed the test.
B
It did so while the reasoning was sound. Selling this card is good for my economy. The integration with the test environment had bugs. It reminds us that underneath this thought, it's still code execution. But the strategic reasoning, the pure ability to play the game via text, that's a very strong signal.
A
It's fascinating that it can handle the abstract strategy of poker. But what about spatial reasoning? We mentioned IRC AGI earlier, which is visual grids. But I want to talk about the pelican on a bicycle.
B
Ah, Simon Woollison's famous test.
A
I love this test because it just sounds so ridiculous. You ask the model to generate an SVG scalable vector graphic code of a pelican riding a bicycle.
B
It sounds silly, but think about what that actually requires. The model has to understand what a pelican looks like, like the beak, the large body, the webbed feet have to understand what a bicycle looks like. And crucially, it has to understand the physics of how a pelican would physically sit on and ride a bicycle.
A
It's a physics problem and an anatomy problem wrapped in a coding problem.
B
Exactly. And previous models have failed, hilariously at this. They just draw a blob of lines, or the bike is floating, or the pelican is sort of melting into the handlebar.
A
Then Gemini 3, it produced the best
B
results seen so far. It wasn't just a mess. It understood the concept. The pelican was positioned correctly, the feet were on the pedals, or, you know, where the pedals should be. It demonstrated genuine spatial reasoning through code.
A
And this isn't just about drawing cartoons. This connects back to something in the Google announcement, doesn't it? The 3D printing feature?
B
It does. Google explicitly highlighted that you can give Deepthink a sketch, just a rough drawing on a napkin, and it generates the file to 3D print the object that
A
feels like the pelican on a bicycle test. Applied to the real world, it is.
B
It's turning napkin math or a napkin sketch into physical reality. That spatial reasoning capability Allows it to bridge the gap between. I have an idea in my head and here is the file to build it. I mean, that is huge for engineering.
A
So we have a model that crushes benchmarks, fixes physics papers, wins poker games and draws pelicans. It sounds like the perfect AI, but I want to do a vibe check.
B
Let's do the vibe check. Because it's not all sunshine.
A
No, it is not. Digging through the user discussions, there is a distinct debate happening between intelligence and and what they call agentic utility.
B
This is a really important distinction for anyone thinking about using these tools. Intelligence is can you solve this hard riddle? Agentic utility is can you go follow these five instructions, write this Python script and not mess up the file paths?
A
And Gemini 3 is the absent minded professor again.
B
That's the consensus. Users are saying that for pure coding tasks, you know, simple stuff like write me a script to spray this website or reformat this CSV file, they actually prefer other models. They Prefer Cloud Opus 4.6 or GPT 5.3 Codex.
A
Why is Gemini just too smart for
B
its own good Sometimes, Yes. One user complained that for simple coding it can be lazy or it fails to follow negative constraints. You tell it don't do X and it does X anyway because it thinks it knows better. Or it just loses track of the instruction because it's thinking about the broader logic.
A
That's frustrating.
B
It is. If you want a worker bee to follow orders, you might not want the genius physicist, you want the diligent clerk. And Claude Opus seems to be the diligent clerk right now.
A
And then there's the deep research caveat. We talked about it acting as a research assistant, but users noticed something worrying about its sources.
B
Hallucinations. But specific kinds of hallucinations.
A
Right? One user noted that while it's great for finding nuggets of inspiration in messy data, if you ask it to cite sources, it sometimes gets creative. It invented terms. It cited Reddit threads that don't exist. That's dangerous. Especially for a model that's being pitched at scientists.
B
It is extremely dangerous. It implies that while the reasoning is strong, the logic of the answer might be sound. The references can be fabricated. It's that dreaming quality of LLMs means you can use it to find the needle in the haystack, but you have to verify that the needle is actually real.
A
So it provides the 10% that is gold. The breakthrough thought. But a human has to double check to make sure you aren't publishing a paper that cites a ghost.
B
Exactly. It's a tool for experts who can verify the output. Not a magic box that does your homework perfectly every time. If you don't know the subject matter, Gemini 3 might trick you with a very convincing, very wrong citation.
A
I want to zoom out for a second and just look at the landscape because we started this episode mentioning how crazy this week has been. We had Opus 4.6 GPT, 5.3, GL5, and now Gemini 3, all in a matter of days.
B
Release week Madness.
A
There is a theory on the forums about why this is happening. The Chinese New Year Theory.
B
Yes. This is a fascinating bit of industry folklore. The theory goes that Chinese labs, like the ones behind GLM5, tend to release their big updates right before the Lunar
A
New Year to get it out before the holiday break.
B
Right. It's the last push before everyone goes home to their families. And because the AI field is so competitive, when the US labs see a big release coming from China, they panic. They rush their own releases to stay relevant and dominate the news cycle.
A
So we get this sudden singularity of news where everyone drops their atomic bombs at the same time.
B
It seems that way. It creates this intense pressure cooker environment. And it also explains why some of these releases feel well specialized, like they are rushing to show off their specific superpower. Gemini with its reasoning, clawed with its coating, rather than a perfectly polished do it all product.
A
Speaking of product, we have to mention accessibility. If I want to use DeepThink right now, can I just log in to the free version?
B
No. Google is keeping this one behind the velvet rope. It is currently gated behind the Google AI Ultra subscription and the API.
A
So it's a premium tool.
B
It is. And honestly, given the cost we discussed that $13 per task cost, it makes sense. They can't let the free tier burn through that kind of compute. This is for professionals, researchers and enterprises who can justify the expense.
A
Okay, let's unpack this for the listener. We've covered a lot of ground, from benchmark shattering scores to hallucinated Reddit threads.
B
It's a mixed bag, but a very, very impressive one.
A
If you are a learner or a researcher, what is the big takeaway here?
B
I think the takeaway is that we are seeing a divergence. We are seeing a split between agents that do work for you, like coding assistants and reasoning engines that think with you.
A
And Gemini 3. Deepthink is the reasoning engine.
B
Ideally, yes. If you are stuck on a hard problem, a physics equation, a complex architectural design, a strategic dilemma, this is your partner. It's the smartest person in the room. Even if they are a bit expensive and prone to making up a citation now and then.
A
But if you just need a Python script to rename some files, stick with
B
Claude Opus or the standard models, don't use a supercomputer to do simple arithmetic.
A
It really feels like we are entering the era of specialized intelligence.
B
We are. And that 84.6% on arc AGI2 is the signal fire. It tells us that AI is learning to deal with novelty. It's learning to handle things it hasn't seen before.
A
And that leads me to my final thought for you, the listeners. And this is something to chew on.
B
Let's hear it.
A
We talked about Lisa Carbone. She used this AI to find a flaw in. In a peer reviewed physics paper, an AI checked the work of top tier human physicists and found a mistake.
B
It acted as a super reviewer.
A
Right, but flip that around. If an AI can find a flaw in a theory of gravity today, what happens a year from now? What happens when it doesn't just check the math, but writes the math?
B
That is the provocative thought.
A
What happens when Gemini 4 or 5 generates a scientific breakthrough, a new proof, a new material, a new theory that is correct, but is so complex that we aren't smart enough to verify it?
B
That is the crisis we are walking to. Are we ready to trust a proof we can't understand?
A
Exactly. If it's smarter than the reviewer who reviews the AI?
B
That is going to be the central tension of the next decade of science. We might be moving from doing science to interpreting the science.
A
The AI did a fascinating and slightly terrifying place to leave it.
B
Oh, deed.
A
That is all for this deep dive into Gemini 3. Deep think, stay curious, keep learning and maybe double check those citations.
B
Always double check the citations.
A
We'll see you in the next one.
Date: February 17, 2026
Host: Next in AI
Episode Theme: A deep dive into Google’s latest large language model, Gemini 3 Deepthink, focusing on its leap in scientific and engineering reasoning, its real-world use cases, limitations, and industry impact.
This episode centers on Google’s Gemini 3 Deepthink—a model explicitly designed not just for dialogue, but for genuine scientific, research, and engineering problem-solving. Hosts break down its benchmark-shattering performance, real-world case studies, and hands-on testing by the developer community, particularly examining whether this is true machine reasoning or simply a more advanced pattern-matcher.
On the leap in reasoning:
“A 16% leap… That's not an iteration. That is a different species of software.” – B, 02:46
On real-world impact:
“It understood the logic of high energy physics well enough to correct a human expert.” – A, 06:24
On utility:
“If it solves a bottleneck in your research, $13 is basically free.” – B, 05:07
On the absent minded professor analogy:
“It's the absent minded professor who solves the unsolvable equation, but maybe forgets where he put his keys.” – A, 07:49
On hallucinations:
“If you don’t know the subject matter, Gemini 3 might trick you with a very convincing, very wrong citation.” – B, 14:17
On the future of AI in science:
“What happens when Gemini 4 or 5 generates a scientific breakthrough… so complex that we aren’t smart enough to verify it?... Are we ready to trust a proof we can't understand?” – A & B, 17:41–17:57
Final Message:
If you’re solving hard, novel problems—think physics, materials science, or complex engineering—Gemini 3 Deepthink may soon become your indispensable partner. But don’t trust it for citations, and don’t pay supercomputer prices for Python scripts. Welcome to the age of specialist AI—and the frontier where the reviewer might no longer be smarter than the reviewed.