When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

TL;DR: Zhou et al. (2026) challenge the assumption that more reasoning tokens always improve LLM accuracy. Their systematic study reveals diminishing marginal returns, a measurable “overthinking” crossover around 7K tokens, and problem-dependent optimal thinking lengths. The practical takeaway: uniform compute allocation is wasteful — adaptive early stopping can cut reasoning costs significantly without sacrificing accuracy.

The Problem: Does More Thinking Always Help?

The dominant paradigm in LLM reasoning is straightforward: longer chains of thought → better answers. Models like DeepSeek-R1 and OpenAI’s o-series are explicitly trained to reason longer, and test-time compute (TTC) scaling literature has consistently shown accuracy improvements as token budgets increase [1].

But this assumption — that thinking length and answer quality are monotonically related — had never been systematically challenged. Zhou et al. (2026) from Nanjing University and Baidu draw an analogy from economics: the law of diminishing marginal returns. Just as additional units of input eventually yield smaller increments of output, additional reasoning tokens may provide progressively less benefit [2]. And beyond a point, extended thinking might actually be harmful.

“More thinking leads to better answers” is the assumption — but is it true?

What the Paper Investigates

The authors set out to answer three questions:

How does the marginal utility of additional reasoning tokens change as compute budgets increase?
Do models exhibit “overthinking” — abandoning previously correct answers through extended reasoning?
Can adaptive stopping strategies reduce compute costs without sacrificing accuracy?

To study this, they introduce three key analytical tools:

Marginal Utility

Measured as the incremental accuracy gain per additional reasoning tokens. If a model goes from 50% accuracy at 2K tokens to 55% at 4K, the marginal utility of those extra 2K tokens is +5%. This is the core metric for understanding when additional compute stops paying off.

Flip Events

Instead of just tracking aggregate accuracy, the authors track individual problems through their reasoning trajectories. A positive flip occurs when a model changes a wrong answer to a correct one. A negative flip occurs when it changes a correct answer to a wrong one. The flip ratio (negative/positive) reveals whether extended reasoning is net beneficial or harmful [3].

Overthinking Indicators

Four behavioral signals that predict negative flips:

Hesitation markers — phrases like “wait”, “actually”, “let me reconsider”
Answer oscillation — the model changing its intermediate answer multiple times
Confidence drop — decreasing probability assigned to the final answer
Combined — all signals together

Experimental Setup

The paper evaluates several models across mathematical and scientific reasoning benchmarks:

Component	Details
Models	DeepSeek-R1-32B, s1-32B, Qwen2.5-32B-Instruct
Datasets	AIME (competition math), MATH-500 (standardized math), GPQA Diamond (scientific reasoning)
Budget levels	500 to 16,000 tokens, controlled via Budget Forcing
Metrics	Accuracy, marginal utility, flip ratio, cost-adjusted efficiency

The Budget Forcing technique (from Muennighoff et al., 2025) allows controlled allocation of compute by either truncating or extending the model’s inference at predetermined token limits [4].

Key Findings

Finding 1: Diminishing Returns Are Severe

Across all models and datasets, marginal utility declines sharply as compute budgets increase. For DeepSeek-R1-32B on AIME, the marginal utility per 500 tokens drops from +2.1% at low budgets to near zero around 7K tokens, and turns negative beyond 10K [2].

Finding 2: The Overthinking Crossover

The flip ratio exceeds 1.0 (meaning negative flips outnumber positive flips) at approximately 7K tokens for R1-32B on AIME. At 16K tokens, the flip ratio reaches 7.55 — meaning models are over 7 times more likely to switch a correct answer to a wrong one than vice versa.

Budget	Flip Ratio (R1-32B)	95% CI	p-value
2,000	0.32	[0.24, 0.41]	—
4,000	0.60	[0.48, 0.73]	—
6,000	0.87	[0.71, 1.05]	—
7,000	1.09	[1.01, 1.18]	0.014
8,000	1.42	[1.21, 1.68]	0.002
12,000	3.29	[2.87, 3.82]	<0.001
16,000	7.55	[6.12, 9.24]	<0.001

The crossover is statistically significant (p=0.014 at 7K, p=0.002 at 8K), confirmed by bootstrap confidence intervals [5].

Finding 3: s1-32B Is More Susceptible

Comparing two 32B models, s1-32B crosses the overthinking threshold earlier (~5K tokens) than R1-32B (~7K tokens). This suggests that overthinking susceptibility varies significantly by model architecture and training approach [6].

Finding 4: Optimal Budget Varies 7.5x by Difficulty

On MATH-500, the optimal budget for easy problems (Level 1) is ~1.0K tokens, while hard problems (Level 5) benefit from up to ~7.5K tokens. Uniform compute allocation is inherently suboptimal — easy problems get wasteful over-analysis while hard problems may get insufficient compute [7].

Finding 5: Overthinking Generalizes Beyond Math

On GPQA Diamond (scientific reasoning), the same pattern holds: peak accuracy at ~10K tokens, with flip ratio crossing 1.0 around that point. Scientific reasoning benefits from slightly longer deliberation before overthinking dominates [8].

Finding 6: Natural Long Reasoning Shows the Same Pattern

A potential concern is that Budget Forcing creates artificial artifacts. The authors validate against 312 samples where R1-32B naturally produced >8K tokens. Natural and forced samples show nearly identical accuracy decline patterns, confirming that overthinking is a genuine model behavior, not an experimental artifact [9].

Overthinking Indicators in Practice

The paper identifies four signals that can predict when a model is about to overthink:

Indicator	Correlation with Negative Flips	Precision@80%
Answer oscillation	0.78	71.5%
Hesitation markers	0.71	64.2%
Confidence drop	0.63	58.7%
Combined	0.82	76.3%

Answer oscillation — the model repeatedly changing its intermediate answer — is the strongest individual predictor. Combining all signals achieves 76.3% precision at 80% recall, suggesting that production systems could use these signals for early stopping [10].

Classification of Negative Flips

Manual analysis of 80 negative flip cases reveals three categories:

Category	%	Description
Genuine overthinking	67.5%	Model explicitly abandons a correct answer, second-guessing itself
Exploration divergence	20.0%	Model veers off into irrelevant reasoning paths
Degradation artifacts	12.5%	Token-level quality degrades, causing garbled output

The majority of negative flips (67.5%) are genuine overthinking — the model had the right answer and talked itself out of it [11].

Cost-Aware Evaluation

One of the paper’s more practical contributions is a cost-aware evaluation framework. Rather than reporting only accuracy, they compute:

Efficiency frontier — the set of (cost, accuracy) points that are Pareto-optimal
Cost-adjusted accuracy — accuracy normalized by compute cost
Optimal stopping point — where marginal utility crosses zero

The results are striking: stopping at ~7K tokens (the crossover point) on AIME reduces compute by ~56% compared to 16K tokens, while maintaining statistically indistinguishable accuracy. For many real-world applications, the savings are even larger — most queries never need long reasoning chains [12].

What This Means for Practitioners

1. Don’t Always Default to Max Tokens

If you’re deploying reasoning models (DeepSeek-R1, o-series, Claude with extended thinking), the default behavior is to maximize reasoning length. This paper suggests that’s often wasteful and sometimes harmful. For easier queries, shorter thinking yields the same or better results at a fraction of the cost.

2. Implement Adaptive Stopping

The paper’s overthinking indicators — especially answer oscillation — can be monitored in production. A simple heuristic: if the model changes its answer more than twice mid-reasoning, the probability of a negative flip increases significantly. You could:

Stop early when answer oscillation exceeds a threshold
Reduce max_tokens for inference requests classified as easy (based on problem length or model confidence)
Budget-aware routing — send easy queries to a fast, short-reasoning model and hard queries to a longer-reasoning one

3. Report Efficiency Curves

When evaluating reasoning models, the paper recommends publishing accuracy at multiple compute budgets, not just the maximum. This is analogous to reporting inference latency alongside accuracy — a model that achieves 95% accuracy at 1K tokens is often more useful than one that achieves 96% at 16K tokens.

4. Watch for Model-Specific Overthinking Profiles

s1-32B overthought earlier than R1-32B (~5K vs ~7K), suggesting that overthinking susceptibility is a model-level property worth benchmarking. When choosing between reasoning models, the “optimal stopping point” is as important as peak accuracy [13].

Limitations and Open Questions

Scale studied: Up to 32B parameters. Do 70B+ models overthink later or not at all?
Domain scope: Math and science. Does overthinking occur in creative writing, code generation, or multi-turn conversation?
Indicator precision: 76.3% precision at 80% recall is useful but leaves room for improvement. A production early-stopping system would need better accuracy to avoid cutting off useful reasoning.
Budget Forcing artifacts: The natural-lengh validation addresses this partially, but the forced truncation/e xtension setup may still introduce biases not fully captured.

Bottom Line

Zhou et al. (2026) deliver a clear, empirically grounded challenge to the “more thinking is always better” assumption in LLM reasoning. The paper provides:

Measurement tools (marginal utility, flip ratio, overthinking indicators) that practitioners can adopt in their own evaluation pipelines
Actionable thresholds (the ~7K token crossover for 32B models, varied by difficulty)
A cost-aware framework that makes the economic case for adaptive compute allocation

For anyone deploying reasoning models in production, the key insight is counterintuitive: giving the model more compute often makes it worse. The smart play isn’t max tokens — it’s knowing when to stop.

References

[1] DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.

[2] Zhou, S., Ling, R., Chen, J., Wang, X., Fan, T., & Wang, H. (2026). When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling. arXiv:2604.10739.

[3] Chen, J. et al. (2024). Fine-grained answer analysis for reasoning in LLMs. arXiv preprint.

[4] Muennighoff, N. et al. (2025). s1: Simple Test-Time Scaling. arXiv:2501.19393.

[5] Zhou et al. (2026), Table 3: Bootstrap confidence intervals for flip ratios.

[6] Zhou et al. (2026), Figure 4: Flip event analysis comparing R1 vs s1.

[7] Zhou et al. (2026), Figure 5: Difficulty-stratified analysis on MATH-500.

[8] Zhou et al. (2026), Table 5: GPQA Diamond results (R1-32B).

[9] Zhou et al. (2026), Table 6: Natural vs. forced long reasoning validation.

[10] Zhou et al. (2026), Table 4: Overthinking indicator effectiveness.

[11] Zhou et al. (2026), Table 7: Qualitative analysis of negative flips.

[12] Zhou et al. (2026), Section 5: Cost-aware evaluation.

[13] Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314.

ToolBrain — tool reviews, LLM comparisons, and AI workflow guides
NoCode Insider — AI workflow automation with no-code tools, agents, and APIs
DeepSeek-R1 Production Guide — deploying reasoning models in practice

← Back to all posts