Inference-time factuality improvement in LLMs: from layer contrasting to deep-thinking tokens

Paper Review · 30 Apr 2026 - 12 minutes to read.

LLMs hallucinate (Huang et al., 2025). They generate text that sounds right but isn’t. Most fixes require retraining or plugging in external knowledge. But a growing set of papers asks a different question: can we make models more truthful just by changing how they decode?

This post covers five methods that do this: DoLa, SLED, DELTA, Lookback Lens, and Think@n. None of them touch the model weights. None need extra data.

Factuality-enhanced decoding strategies that contrast information across transformer layers. Source: Zhang et al., 2024.

What’s wrong with standard decoding?

When an LLM generates text, it takes the hidden state from its last layer, projects it to vocabulary space, applies softmax, and picks a token.

This is simple but has known problems (Stahlberg & Byrne, 2019):

Sampling noise can push low-probability wrong tokens through.
The model sometimes over-attends to recent tokens and loses track of the original input.
The softmax bottleneck can suppress correct tokens that have lower probability.

Training fixes like RLHF can make things worse by encouraging sycophancy (Gekhman et al., 2024). RAG helps but adds a lot of complexity.

The methods below all start from one observation: different transformer layers encode different levels of factual knowledge. By looking at how predictions evolve across layers, we can get more truthful outputs.

1. DoLa: decoding by contrasting layers

(Chuang et al., 2024) (ICLR 2024)

Factual knowledge in LLMs tends to be localized in specific transformer layers. DoLa exploits this by subtracting an early layer’s predictions from the final layer’s predictions.

DoLa contrasts logits between early and later transformer layers. Source: Chuang et al., 2024.

The idea

In a transformer with \(L\) layers, each layer \(l\) can produce a next-token distribution \(p_l\) by projecting its hidden state through the vocabulary head. Standard decoding uses only the last layer \(p_L\).

DoLa computes:

\[p_{\text{DoLa}} \propto \text{softmax}\left(\log p_L - \log p_l\right)\]

where \(l\) is a dynamically selected early layer.

Why does subtracting help? Early layers capture syntax and surface patterns. Later layers refine those into factual associations. The subtraction removes the surface-level noise and keeps the factual signal.

Picking the right early layer

For each token, DoLa picks the layer with the highest Jensen-Shannon Divergence from the final layer. This targets layers where the most factual transformation is happening, rather than using a fixed layer that might not be informative for every token.

Results

+12-17 points on TruthfulQA for LLaMA models
Gains on StrategyQA and GSM8K
Minimal added latency

2. SLED: self logits evolution decoding

(Zhang et al., 2024) (NeurIPS 2024)

DoLa contrasts two layers. SLED uses all of them.

SLED tracks logits evolution across all layers and applies gradient-based self-refinement. Source: Zhang et al., 2024.

The idea

SLED projects the hidden state from every layer through the vocabulary head, producing a trajectory of distributions:

\[\{p_1, p_2, \ldots, p_L\}\]

This trajectory shows how the model’s predictions evolve from shallow guesses to factual assessments.

Instead of picking two layers to contrast, SLED computes a weighted combination of all layer distributions. It frames this as an approximate gradient step that pushes the output distribution toward factual accuracy.

You can think of it as the model asking itself: “based on how my predictions evolved across all my layers, what should I actually output?”

Why it works better than DoLa

DoLa picks two snapshots. SLED uses the full trajectory. The gradient formulation is also more principled since it doesn’t require the somewhat arbitrary selection of a premature layer.

Results

Outperforms DoLa across Gemma, Qwen, Mixtral (1B to 45B parameters)
State of the art on TruthfulQA, FACTOR, StrategyQA
Negligible latency overhead

3. DELTA: contrastive decoding via input masking

(Huang & Chen, 2025) (2025)

DELTA takes a different angle. Instead of contrasting across layers, it contrasts across inputs: run the model on the full prompt, then run it again on a masked version of the prompt, and subtract.

DELTA contrasts outputs from the original input with outputs from a masked version. Source: DELTA, 2025.

The idea

Run the model on the full input. Get output distribution \(p(x_t \mid c)\).
Mask random tokens in the input. Get output distribution \(p(x_t \mid \tilde{c})\).
The masked version lacks grounding information, so it hallucinates more. Subtract its logits from the original.

\[p_{\text{DELTA}} \propto \text{softmax}\left(\log p(x_t \mid c) - \alpha \cdot \log p(x_t \mid \tilde{c})\right)\]

This is related to context-aware decoding (Shi et al., 2024), but DELTA uses random partial masking rather than removing the context entirely. That makes it better at targeting hallucinations that come from contextual ambiguity specifically.

Results

+3 points on SQuAD v1.1, +6 points on SQuAD v2
+10 points on SQuAD v2 no-answer exact match
Gains on TriviaQA and Natural Questions

4. Lookback Lens: are you even looking at the context?

(Chuang et al., 2024) (EMNLP 2024)

The first three methods modify the output distribution. Lookback Lens does something different: it detects hallucinations by watching where the model’s attention goes.

The hypothesis is simple. When a model hallucinates, it stops paying attention to the input context and starts attending to its own generated tokens.

Lookback Lens uses lookback ratios from attention heads to detect contextual hallucinations. Source: Chuang et al., 2024.

The lookback ratio

For each attention head \(h\), compute:

\[r_h = \frac{\sum_{i \in \text{context}} \alpha_{h,i}}{\sum_{i \in \text{context}} \alpha_{h,i} + \sum_{j \in \text{generated}} \alpha_{h,j}}\]

High ratio means the model is grounding its output in the input. Low ratio means it’s talking to itself, which usually means it’s making things up.

Hallucinated tokens show lower attention to the source context. Source: Chuang et al., 2024.

What they found

A simple linear classifier on these lookback ratios detects hallucinations about as well as methods that use full hidden states or entailment models.

A detector trained on a 7B model also works on a 13B model without retraining, which suggests the attention patterns that signal hallucination are consistent across model sizes.

For mitigation, they use classifier-guided decoding: penalize tokens that the classifier flags as likely hallucinations. This cuts hallucination by 9.6% on XSum summarization.

How it relates to DoLa and SLED

Lookback Lens comes from the same first author as DoLa (Yung-Sung Chuang), and the two papers complement each other. DoLa and SLED fix the output distribution to be more factual. Lookback Lens detects when hallucination is happening. You could combine them: use Lookback Lens to spot risky tokens, then apply stronger layer-contrasting corrections at those positions.

5. Think@n: it’s not about length, it’s about depth

(Chen & others, 2026) (Google, 2026)

This one came out of work on reasoning models like DeepSeek-R1. The question: do longer reasoning chains actually produce better answers?

The answer is no. What matters is how deeply the model processes each token.

Deep-thinking tokens

A deep-thinking token is one where the model’s internal prediction changes a lot across layers before settling on an answer. The Deep-Thinking Ratio (DTR) measures what fraction of tokens in a sequence are like this:

\[\text{DTR} = \frac{|\{t : \text{JSD}(p_l(x_t), p_L(x_t)) > \tau \text{ for some } l\}|}{T}\]

This is the same kind of layer-wise divergence that DoLa and SLED exploit. Tokens where logits diverge across layers are the ones where the model is doing real work.

The numbers

Raw token count correlates negatively with accuracy (\(r = -0.544\)). More tokens, more likely wrong.

DTR correlates positively with accuracy (\(r = 0.683\)). More deep thinking, more likely correct.

This validates the premise behind DoLa and SLED from a completely different angle, which I think is the most interesting part of this paper.

DTR correlates positively with accuracy while raw token count correlates negatively. Source: Chen et al., 2026.

The Think@n strategy

Given a budget of \(n\) samples:

Generate short prefixes (just 50 tokens).
Estimate DTR from these prefixes.
Only continue generating the highest-DTR candidates.

This matches or beats standard self-consistency while cutting inference costs by about 50%.

Both Think@n and Lookback Lens give you cheap signals about generation quality: one watches attention patterns, the other watches layer-wise prediction changes. Neither modifies the model.

Comparison

Method	Signal	Mechanism
DoLa	Layer logits (2 layers)	Subtract early from final
SLED	Layer logits (all layers)	Gradient refinement
DELTA	Masked vs. full input	Subtract masked output
Lookback Lens	Attention weights	Detect + penalize
Think@n	Layer-wise divergence	Filter best samples

The first three modify the output distribution per-token. Lookback Lens detects hallucinations per-token. Think@n operates per-sample.

Reading these papers in sequence, you can see the ideas build on each other. DoLa showed that layer contrasting works. SLED showed that using all layers beats picking two. DELTA showed the same contrasting trick works across inputs, not just layers. Lookback Lens showed that attention patterns alone carry enough signal to catch hallucinations. And Think@n closed the loop by showing that these layer-wise signals predict reasoning quality and can cut inference costs in half.

What I keep coming back to is how all five methods exploit the same thing: the evolution of representations across layers encodes factual confidence. If you’re working on uncertainty estimation or robustness, these are worth paying attention to. They give you interpretable confidence signals with zero additional training, and they could slot into existing uncertainty quantification setups without much friction.

References

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & others. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1–55.
Stahlberg, F., & Byrne, B. (2019). On NMT search errors and model errors: Cat got your tongue? ArXiv Preprint ArXiv:1908.10090.
Gekhman, Z., Yona, G., Aharoni, R., Eyal, M., Feder, A., Reichart, R., & Herzig, J. (2024). Does fine-tuning LLMs on new knowledge encourage hallucinations? ArXiv Preprint ArXiv:2405.05904.
Chuang, Y.-S., Xie, Y., Luo, H., Kim, Y., Glass, J., & He, P. (2024). DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2309.03883
Zhang, J., Juan, D.-C., Rashtchian, C., Ferng, C.-S., Jiang, H., & Chen, Y. (2024). SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2411.02433
Huang, C. P., & Chen, H.-Y. (2025). Delta – Contrastive Decoding Mitigates Text Hallucinations in Large Language Models. ArXiv Preprint ArXiv:2502.05825. https://arxiv.org/abs/2502.05825
Shi, W., Han, X., Lewis, M., Tsvetkov, Y., Zettlemoyer, L., & Yih, W.-tau. (2024). Trusting your evidence: Hallucinate less with context-aware decoding. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), 783–791.
Chuang, Y.-S., Qiu, L., Hsieh, C.-Y., Krishna, R., Kim, Y., & Glass, J. (2024). Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1419–1436. https://arxiv.org/abs/2407.07071
Chen, W.-L., & others. (2026). Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens. ArXiv Preprint ArXiv:2602.13517. https://arxiv.org/abs/2602.13517