Reasoning's Razor: When Thinking More Makes Safety Worse

Paper Review · 25 Jan 2026 - 7 minutes to read.

Large Reasoning Models (LRMs) like DeepSeek-R1 and QwQ-32B have become remarkably capable at solving complex problems through extended chain-of-thought. The natural instinct is to apply this power to safety-critical tasks: detecting harmful content, catching hallucinations, flagging policy violations. More reasoning = more accuracy = safer AI, right?

A new paper challenges that intuition head-on. “Reasoning’s Razor” (Chegini et al., 2025) presents the first systematic study of reasoning in precision-sensitive classification tasks, and the findings are striking: reasoning can actively harm recall exactly where it matters most — at the stringent operating thresholds required for real-world safety deployment.

The Safety Deployment Problem

Before diving into the results, it helps to understand why operating points matter so much.

Consider a content moderation system processing millions of messages per day. If you set a threshold that produces a 20% false positive rate, you’re flagging and blocking 200,000 legitimate messages every million — an unacceptable user experience. In practice, production systems must operate at extremely low false positive rates (FPR < 5%, often < 1%).

Formally, given a binary classifier with score \(s(x)\) and threshold \(\tau\), we predict “harmful” if \(s(x) > \tau\). The key metrics at any operating point are:

\[\text{Recall}(\tau) = \frac{TP(\tau)}{TP(\tau) + FN(\tau)}, \quad \text{FPR}(\tau) = \frac{FP(\tau)}{FP(\tau) + TN(\tau)}\]

The catch is that AUROC — the most commonly reported metric — aggregates performance across all thresholds. It can look great while hiding catastrophic failures at the specific low-FPR region where you’ll actually deploy.

This is the regime the paper targets.

What Are Reasoning Models and Why Do We Use Them?

Standard language models (like Llama or Gemma) produce a direct answer by predicting the next tokens. Large Reasoning Models (LRMs) instead generate extended internal reasoning traces — chains of thought that walk through the problem step by step — before committing to a final answer.

On reasoning-heavy benchmarks (math, coding, logic), this deliberate “thinking” dramatically improves performance. It’s natural to expect the same benefit for nuanced tasks like:

Safety detection: Does this message contain harmful content?
Hallucination detection: Is this model-generated claim factually supported?

Both tasks require careful judgment of subtle semantic signals, which is exactly where reasoning models shine. Or so the assumption goes.

The Paradox: Better Accuracy, Worse Safety

How does thinking affect model performance? Across fine-tuned models, standard LLMs, and Large Reasoning Models (LRMs), the data shows a clear divide. The green bars (Think On) generally win on overall Accuracy (top row). However, the orange bars (Think Off) take the lead for low-FPR recall (bottom row), illustrating a core accuracy-precision trade-off in AI reasoning.

The paper evaluates a diverse set of models across five safety datasets (ToxicChat, WildGuard, AEGIS, HarmBench, BeaverTails) and six hallucination detection datasets (HaluEval, DROP, PubMedQA, COVID-QA, FinanceBench, RAGTruth), covering both fine-tuned and zero-shot settings.

Models tested include reasoning models (QwQ-32B, DeepSeek-R1, K2-Think) and standard LLMs (Llama, Gemma, Qwen variants).

The headline finding: reasoning models improve overall AUROC but degrade recall by 10–30% at critical low-FPR operating points.

Visually, imagine the Precision-Recall curve. Reasoning models push up the average-accuracy part of the curve, but the tail at high precision — exactly the operating regime a deployed system uses — collapses.

Why Does This Happen? Calibration Shifts

The root cause is how reasoning tokens affect confidence calibration.

Standard LLMs learn to associate certain output token patterns with certain confidence levels. When you introduce extended reasoning chains, the model’s final answer token is conditioned on a much longer, more varied context. This shifts the distribution of the final output probabilities.

The problem is asymmetric: reasoning traces for genuine harmful content look different from reasoning traces for borderline-safe content, but in ways that don’t consistently push scores in the right direction at the decision boundary. As reasoning trace length grows, this miscalibration amplifies.

Two related findings support this:

Self-reported confidence is unreliable. When reasoning models express confidence in their final classification, those confidence scores are poorly calibrated for precision-critical deployment.
Token-based scoring substantially outperforms self-reported confidence. Rather than using the model’s stated confidence, scoring directly from the token probabilities of the final classification output gives better-calibrated signals for the precision-sensitive regime.

Formally, let \(p_\theta(y \mid x, r)\) be the model’s token probability for label \(y\) given input \(x\) and reasoning trace \(r\). The token-based score is:

\[s_{\text{token}}(x) = p_\theta(\text{"harmful"} \mid x, r)\]

whereas self-reported confidence uses the model’s verbalized score, which can be inconsistent with the underlying token distribution. The paper shows \(s_{\text{token}}\) is significantly better calibrated at low-FPR thresholds.

The Ensemble Fix

The authors don’t leave practitioners without a path forward. They show that a simple ensemble of reasoning and non-reasoning models recovers the strengths of both:

\[s_{\text{ensemble}}(x) = \alpha \cdot s_{\text{reasoning}}(x) + (1 - \alpha) \cdot s_{\text{standard}}(x)\]

This combination:

Keeps the accuracy improvements of reasoning models in the average-accuracy regime
Restores recall at high-precision operating points by anchoring on the better-calibrated standard model scores

The intuition is that the two models make different types of errors. Reasoning models excel at nuanced judgment calls but miscalibrate at extremes. Standard models are more conservatively calibrated. The ensemble balances these complementary weaknesses.

Key Takeaways for Practitioners

This paper contains several important lessons for anyone building safety systems with LLMs:

Don’t optimize for AUROC alone. When you’ll deploy at a specific operating point (e.g., FPR < 2%), evaluate at that operating point during development. A model that looks better on AUROC can look dramatically worse where you actually need it.
Reasoning is not universally better. For tasks requiring high recall under strict precision constraints, non-reasoning or standard LLMs may outperform LRMs despite lower overall AUROC.
Use token probabilities, not verbal confidence. When extracting a confidence score from an LLM, use the log-probability of the classification token rather than asking the model to verbalize a confidence level.
Consider calibration recalibration. Standard post-hoc calibration techniques (temperature scaling, Platt scaling) may be needed before deploying reasoning models in precision-critical settings.
Ensembles work. Combining reasoning and standard models is a simple and effective strategy to get the best of both worlds.

Broader Implications

The paper is titled “Reasoning’s Razor” — a deliberate echo of Occam’s Razor. Just as Occam’s Razor warns against unnecessary complexity, Reasoning’s Razor warns that adding reasoning capability to a task doesn’t always improve the outcome. The razor cuts both ways.

This is part of a broader pattern in deep learning: capability improvements that help on aggregate metrics often come with unexpected failure modes at distribution extremes. This connects to related work on OOD detection (Yang et al., 2024), where models that are highly accurate in-distribution can catastrophically fail on edge cases.

As LRMs become default components in production AI pipelines, understanding these failure modes is not just academic — it’s critical engineering. Safety, content moderation, and hallucination detection are precisely the tasks where deployment failure is most consequential.

The takeaway is not to avoid reasoning models, but to deploy them with eyes open: evaluate at your actual operating threshold, use token-based scoring, and consider ensemble approaches for high-stakes applications.

This post reviews (Chegini et al., 2025). For related reading on uncertainty and reliability in language models, see our hallucination survey summary and LoRA-based OOD detection.

References

Chegini, A., Kazemi, H., Souza, G., Safi, M., Song, Y., Bengio, S., Williamson, S., & Farajtabar, M. (2025). Reasoning’s Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection. ArXiv Preprint ArXiv:2510.21049. https://arxiv.org/abs/2510.21049
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2024). Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, 132(12), 5635–5662.