Apple's AI Reasoning Study Challenged: New Research Questions "Thinking Collapse" Claims
A scientific dispute has erupted over the capabilities of advanced AI systems, with independent researchers challenging Apple's recent study that claimed to identify fundamental limitations in how AI systems reason through complex problems.
In a paper published this week titled "The Illusion of the Illusion of Thinking," researchers C. Opus and A. Lawsen have directly contested findings from Apple's research team led by Parshin Shojaee and Iman Mirzadeh. The Apple study, which made waves with its claim that AI systems experience "accuracy collapse" when facing increasingly complex puzzles, may have overlooked critical experimental design factors according to the new analysis.
"Our analysis reveals that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures," write Opus and Lawsen in their rebuttal paper.
Was It Really a Thinking Failure ?
The heart of the dispute centers on whether the observed failures in Apple's study represent true cognitive limitations or simply practical constraints in how the experiments were designed.
Apple's researchers used a methodology where AI models were asked to solve increasingly complex versions of classic puzzles like the Tower of Hanoi and River Crossing problems, listing every step of their solutions. As the puzzles grew more complex, the models' performance supposedly fell off a cliff.
But the new analysis found that the Tower of Hanoi experiments systematically exceeded the models' output token limits at precisely the points where "failure" was reported. In other words, the models were being asked to write answers that were too long for their maximum output capacity.
"A critical observation overlooked in the original study: models actively recognize when they approach output limits," the researchers note. They found evidence that models explicitly stated things like "The pattern continues, but to avoid making this too long, I'll stop here" when solving Tower of Hanoi problems.
This awareness suggests the models understood the solution pattern but chose to truncate their outputs due to practical constraints - a very different scenario from not understanding how to solve the problem.
The Impossible Puzzle Problem
Perhaps the most striking finding in the new analysis concerns the River Crossing puzzles used in the Apple study. These puzzles involve moving actors across a river using a boat with limited capacity, subject to various constraints.
The researchers discovered that for instances with 6 or more actors and a boat capacity of 3 (the parameters used in the original study), the puzzles are mathematically impossible to solve.
"By automatically scoring these impossible instances as failures, the authors inadvertently demonstrate the hazards of purely programmatic evaluation," Opus and Lawsen write. "Models receive zero scores not for reasoning failures, but for correctly recognizing unsolvable problems."
This finding casts serious doubt on the Apple study's conclusions about model performance on these puzzles. If the models were being asked to solve impossible problems, their "failure" to do so isn't evidence of reasoning limitations but rather a correct assessment of the problem's nature.
Different Approach, Different Results
To test their hypothesis that the failures were due to format constraints rather than reasoning limitations, Opus and Lawsen conducted preliminary tests using a different approach. Instead of asking models to enumerate every move for Tower of Hanoi puzzles, they asked them to output a Lua function that would print the solution when called.
This approach dramatically changed the results. When tested on Tower of Hanoi with 15 disks - well beyond the "collapse" threshold reported in the original study - models including Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, and Google Gemini 2.5 all showed "very high accuracy" and completed the task in under 5,000 tokens.
"The generated solutions correctly implement the recursive algorithm, demonstrating intact reasoning capabilities when freed from exhaustive enumeration requirements," the researchers report.
This finding suggests that the models understand the underlying patterns and algorithms needed to solve these complex puzzles. When allowed to express their solutions in a more efficient format, they can handle much more complex instances than the Apple study suggested.
Rethinking How We Measure Problem Difficulty
The new paper also challenges Apple's use of "compositional depth" (minimum moves required) as the primary metric of problem complexity. They argue that this metric conflates mechanical execution with problem-solving difficulty.
They provide a comparative analysis of the puzzle types used in the original study:
Tower of Hanoi requires 2^N - 1 moves (exponential growth), but has a branching factor of 1, meaning there's only one correct move at each step. This makes it algorithmically simple despite requiring many moves.
River Crossing puzzles require approximately 4N moves, but have a branching factor greater than 4 and require complex constraint satisfaction and search. These puzzles are NP-hard.
Blocks World problems require approximately 2N moves, have a branching factor of O(N²), and also require search. These problems are PSPACE-complete, making them computationally very difficult.
"This explains why models might execute 100+ Hanoi moves while failing on 5-move River Crossing problems," the researchers note. The difficulty lies not in the number of moves, but in figuring out which moves to make.
Why This Matters for AI Evaluation
The findings from Opus and Lawsen have significant implications for how we evaluate AI systems, particularly when assessing their reasoning capabilities.
First, they highlight the importance of distinguishing between different types of limitations. A model that understands how to solve a problem but can't output the complete solution due to token limits is very different from a model that fundamentally doesn't understand the problem.
Second, they emphasize the need for careful puzzle design and verification. Using impossible puzzles as test cases without acknowledging their impossibility can lead to misleading conclusions about model capabilities.
Third, they suggest that evaluation frameworks should be more flexible in how they accept solutions. Allowing models to express solutions in different formats - such as algorithms or functions rather than exhaustive move lists - may provide a more accurate picture of their reasoning abilities.
"The question isn't whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing," the researchers conclude, emphasizing that many apparent limitations may be artifacts of evaluation design rather than fundamental cognitive barriers.
The Ongoing Debate About AI Reasoning
This scientific dispute takes place against a backdrop of ongoing debate about the capabilities and limitations of large language models and their reasoning abilities.
Some researchers have argued that these models are fundamentally limited in their ability to reason, suggesting they merely mimic reasoning through pattern matching rather than engaging in genuine logical thought. Others have pointed to impressive performances on complex reasoning tasks as evidence that these models can engage in something akin to reasoning, even if it differs from human reasoning processes.
The findings from Opus and Lawsen don't resolve this broader debate, but they do suggest caution in interpreting apparent failures. What looks like a reasoning limitation may actually be a constraint imposed by the evaluation framework or the way questions are posed.
This aligns with a growing body of research suggesting that how we prompt or instruct AI systems can dramatically affect their performance. The same model might fail or succeed on identical tasks depending on how the task is framed or what output format is requested.
Moving Forward: Better Ways to Test AI Thinking
Opus and Lawsen conclude their paper with recommendations for future work in this area:
Design evaluations that can distinguish between reasoning capability and output constraints
Verify puzzle solvability before evaluating model performance
Use complexity metrics that reflect computational difficulty, not just solution length
Consider multiple solution representations to separate algorithmic understanding from execution
These recommendations point toward a more nuanced approach to evaluating AI reasoning capabilities - one that acknowledges the practical constraints of these systems while still pushing to understand their true cognitive limitations.
The researchers also note that due to budget constraints, they were unable to conduct enough trials for a comprehensive statistical sample, suggesting that more validation of their findings remains as future work.
The Importance of Scientific Debate
The dispute between these research teams highlights a crucial aspect of AI research: the methods we use to evaluate these systems can significantly impact our understanding of their capabilities and limitations.
As AI systems become more sophisticated and are deployed in increasingly critical applications, accurate assessment of their abilities becomes ever more important. Mischaracterizing their limitations could lead to either unwarranted confidence in their abilities or unnecessary restrictions on their use.
The paper by Opus and Lawsen serves as a reminder that even as we create increasingly powerful AI systems, the human element in designing fair and accurate evaluations remains essential. Our understanding of these systems is only as good as the methods we use to test them.
Their work also highlights the value of scientific debate and replication in AI research. By challenging findings and reexamining methodologies, researchers can collectively work toward a more accurate understanding of what these systems can and cannot do.
As we continue to develop and deploy AI systems, maintaining this spirit of critical inquiry and careful evaluation will be essential to realizing their potential while understanding their true limitations.