Steve Gibson (142:53)
Yeah, but when that first peg was stacked with four disks, the deeper thinking model's performance was restored, Whereas the simpler Claude 3.7 LLM collapsed to only finding the solution 35% of the time, whereas the thinking model held at 100. As the discount then increases both models above 4 both models performance continues to drop, but the LRM holds a huge lead over the LLM until they get to 8 disks. The LLM is never able to solve that one, whereas the thinking model finds the 8 disk solution about 1 out of every 10 tries. About 10% and but 10 disks is beyond the reach of ether. The full research paper has lots of interesting detail about the various models performance on the four puzzle types. I noted however, that the nature of the other three puzzles seem to be pretty much beyond the grasp of any of this so called AI. One of their more interesting findings was the appearance of what they term the three complexity regimes. Paraphrasing from the paper they wrote under how does complexity affect reasoning? They said, motivated by the observations to systematically investigate the impact of Problem complexity on reasoning behavior we conducted experiments comparing thinking and non thinking model pairs across our controlled puzzle environments. Our analysis focused on matching pairs of LLMs with identical model backbones, specifically Claude 3.7, sonnet with and without thinking and deep sync deep seq R1 versus V3. For each puzzle, we vary the complexity by manipulating problem size N where n represents the discount, the checker count, the block count, or the crossing elements. Results from these experiments demonstrate that unlike observations from math, and that's probably one of the most significant things here is that you know, we keep seeing oh this thing, these do better than a math PhD and it's like okay, how about frogs jumping over each other? Oh well, no, can't do frogs. No. So they said there exist three regimes in the behavior of these models with respect to complexity. In the first regime where problem complexity is low, we observed that non thinking models are capable of obtaining performance comparable to or even better than thinking models with more token efficient inference, meaning it's cheaper to do them. In the second regime with medium complexity, the advantage of reasoning models capable of generating long chain of thought begin to manifest and the performance gap between the model pairs increases. The most interesting regime is the third regime where problem complexity is higher and the performance of both models have collapsed to zero. Results show that while thinking models delay this collapse, they ultimately encounter the same fundamental limitations as their non thinking counterparts. I think it's important to address their decision to use puzzles as an evaluation mechanism versus math problems. They gave this a lot of thought and they wrote on the math and puzzle environments question, they wrote the following they said, Currently it is not clear whether the performance enhancements observed in recent reinforcement learning RL based thinking models. All of the LRMs we've been talking about are attributable to increased exposure to established mathematical benchmark data to the significantly greater inference compute allocated to thinking tokens order reasoning capabilities developed by RL based training that is the reinforcement learning training. Recent studies have explored this question with established math benchmarks by comparing the upper bound capabilities of reinforcement learning based thinking models with their non thinking standard LLM counterparts. They've shown that under equivalent inference token budgets, non thinking LLMs can eventually reach performance comparable to thinking models on benchmarks like Math 500 and AIM 24. We also conducted our comparative analysis of Frontier Frontier LRMs like Claude 3.7, Sonnet with and without thinking and Deep seq R1 versus V3. Our results confirm that on the Math 500 data set, the performance of thinking models is comparable to their non thinking counterparts when provided with the same inference token budget. However, we observed that this performance gap widens on the AIM24 benchmark and widens further on AIM25. This widening gap presents an interpretive challenge. It could be attributed to either increasing complexity requiring more sophisticated reasoning processes, thus revealing genuine advantages of the thinking models for more complex problems, or reduced data contamination in the newer benchmarks, particularly AIM25. Interestingly, human performance on AIM25 was actually higher than on AIM24, suggesting that AIM25 might be less complex. Yet models perform worse on AIM25 than AIM24, potentially suggesting that data contamination during the training of Frontier LRMs is occurring. That is, there's more contamination in the older models because there's been more time for the contamination to happen as compared to the newer training benchmarks or testing benchmarks. Given these non justified observations, and the fact that mathematical benchmarks do not allow for controlled manipulation or problem complexity, we turned to puzzle environments that enable more precise and systematic experimentation. Okay, so we have the very real problem of data contamination. That makes judging what these AI models are actually doing meaning that the models you know may have previously encountered the problems during their training and simply memorized the answer. So they're not actually reasoning, they're not thinking or solving new problems. They're pattern matching at a very high level and just regurgitating. But even puzzles like the Towers of Hanoi and River Crossing exist on the Internet and are also presumably in the training data. The researchers talk about this under the heading Open Questions Puzzling behavior of Reasoning Models, they write, we present surprising results concerning the limitations of reasoning models in executing exactly problem solving steps, as well as demonstrating different behaviors of the models based on the number of moves in the Tower of Hanoi environment, even when we provide the algorithm in the prompt here again, again, this is what I was talking about. In the Tower of Hanoi environment, even when we provide the algorithm to be used in the prompt so that the model only needs to execute the prescribed steps, performance does not improve and the observed collapse still occurs at roughly the same point. This is noteworthy because finding and devising a solution should require substantially more computation for search and verification than merely executing a given algorithm. This further highlights the limitations of reasoning models and verification and in following logical steps to solve a problem, suggesting that further research is needed to understand the symbolic manipulation capabilities of such models. Moreover, we observe very different behavior from the Claude 3.7 sonnet thinking model in the Tower of Hanoi environment. The model's first error in the proposed solution often occurs much later around move 100 for when you have 10 disks compared to the river crossing environment where the model can only produce a Valid solution until move 4. Note that this model also achieves near perfect accuracy when solving the Tower of Hanoi with five disks, which requires 31 moves, while it fails to solve the river crossing puzzle with just n equals 3, which has a solution in only 11 moves. This likely suggests that examples of river crossing with n greater than 2 are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training. In other words, it is very, very difficult to to test a these models where you need clean models that have that have not absorbed contaminating information that allows them to appear to be creating new thought as opposed to just finding something from the past. So this work by Apple's researchers is full of terrific insights that I want to commend to anyone who's interested in obtaining a more thorough understanding of where things probably stand at this point in time. I've got a link right under the title at the beginning of this in the show notes so here's what the researchers conclude. They said in this paper we systematically examine frontier large reasoning models through the lens of problem complexity using controllable puzzle environments. Our findings reveal fundamental limitations in current models. Despite sophisticated self reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds. So I'm going to repeat that since I think that's the essence of this entire paper. Our findings reveal that despite sophisticated self reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds. So the models are doing much better at doing what their simpler LLM brethren have been doing. But the difference is fundamentally quantitative, not qualitative. Apple continues, we identified three distinct reasoning regimes. Standard LLMs outperform LRMs at low complexity, LRMs excel at moderate complexity and both collapse at higher complexity. Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs. Our detailed analysis of reasoning traces further expose complexity dependent reasoning patterns from inefficient overthinking on simpler problems to complete failure on complex ones. These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning. Finally, we presented some surprising results on LRMs that lead to several open questions for future work. Most notably, we observed their limitations in performing exact computation. For example, when we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve that they gave them the answer and it didn't help. Moreover, investigating the first failure move of the models revealed surprising behaviors. For instance, they could perform up to 100 correct moves in the Tower of Hanoi, but fail to provide more than five correct moves in the river crossing puzzle. We believe our results can pave the way for further future investigations into the reasoning capabilities of these systems. And then finally, under limitations they just said we acknowledge that our work has limitations. While our puzzle environments enable controlled experimentation with fine grained control over problem complexity, they represent a narrow slice of reasoning tasks and may not capture the diversity of real world or knowledge intensive reasoning problems. You know, they're algorithmic, not knowledge based. It is notable that most of our experiments rely on black box API access to the closed frontier LRMs, limiting our ability to analyze internal states or architectural components. Furthermore, the use of deterministic puzzle simulators assumes that reasoning can be perfectly validated. So step by step. However, in less structured domains, such precise validation may not be feasible, limiting the transferability of this analysis to other more generalizable reasoning. So in other words, the only thing this is is what it is. It may or may not be more widely applicable, and it may not even have any meaning or or utility beyond the scope of these problems. There's not a great deal of real world need, you know, for stacking disks on poles after all. But for what it's worth, it does track with the intuition many of us have about where the true capabilities of today's AI falls. You know, using terms like comprehend or understand or even reason really don't seem to apply. They're used by AI fanboys. You know, maybe they're just a lazy shorthand, but I don't feel that they're helpful. In fact, I think they're anti helpful. So what I think we need is some new anti anthropomorphic terminology to accompany this new technology. There's, there's zero question that that scale driven computation has changed the world forever. Everyone is asking chat, GPT and other consumer AI more and more questions every day, and that's only going to accelerate as the benefits of this become more widely known. AI does not need to become AGI or self aware to be useful, and frankly, I would strongly prefer that it did not. To that end, I doubt that we have anything to worry about anytime soon, and perhaps not even for the foreseeable future. Thus the title of today's podcast the Illusion of Thinking. Because I believe that the fairest conclusion is that's all we have today, it's useful, but it's not thought.