Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs

1Institute of Software Chinese Academy of Sciences, 2University of the Chinese Academy of Sciences, 3Hong Kong University of Science and Technology

Abstract

Large language models (LLMs) excel at complex tasks thanks to advances in their reasoning abilities. However, existing methods overlook the trade-off between reasoning effectiveness and efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode's contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates. Empirical results on various reasoning benchmarks and base models demonstrate the advantage of L2T across different tasks, boosting both reasoning effectiveness and efficiency.

Motivation

Large Language Models (LLMs) have advanced from basic NLP tasks to complex applications such as code generation, web interaction, and personal assistance, largely driven by improvements in reasoning ability. Recent studies show that scaling test-time computation—e.g., generating more tokens during inference—can logarithmically enhance reasoning performance. Building on this, a new class of reasoning models integrates test-time compute scaling with reinforcement learning (RL), achieving state-of-the-art results on challenging benchmarks. These models employ chain-of-thought (CoT) reasoning to maintain logical coherence and explore deeper solution paths, thereby improving accuracy. However, existing approaches still struggle to balance reasoning effectiveness and efficiency. Most rely solely on final outcome rewards, providing no feedback for intermediate reasoning steps. This delayed reward structure encourages unnecessary chain extensions; models tend to think one more step, causing redundant computation and reduced efficiency. Empirical evidence shows that such methods often double token usage compared to what is needed for correct answers. Moreover, while moderate CoT extension helps in complex problems, excessive reasoning can harm accuracy in simpler ones. As task difficulty varies widely, no fixed reasoning length is universally optimal. Therefore, it is essential to design dense process rewards that evaluate each reasoning step’s contribution, enabling models to generate the most informative tokens efficiently while maintaining reasoning quality.

L2T

Observation

(i) Existing methods may fail to use test-time compute budgets efficiently, leading to wasted resources: both models have on average used more than twice the minimum tokens required. For example, k=16 achieves accuracy comparable to or exceeding sequential generation at k=24 with fewer tokens. (ii) The additional episodes add no new information and instead degrade performance due to context redundancy: for both models, Acc(k) peaks around k=16-20 and then declines as k increases. (iii) The questions of different difficulty tiers prefer different chain lengths: Tier 4 questions tend to benefit from longer chains, whereas Tier 1 questions can achieve correct results with short chains, where excessive reasoning depth may causes a marked accuracy drop (e.g., falls by over 5% at k=20). These findings underscore the limitations of existing methods, which ignore the balance between reasoning effectiveness and efficiency.

Overview of L2T

To this end, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for large language models (LLMs). At its core, L2T introduces a universal information-theoretic dense process reward that quantifies the information gain in model parameters. This reward comprises two components: (i) a fitting information gain term that guides the model to capture correctness-critical information during each update, and (ii) a compression penalty that prevents over-optimization, thereby preserving computational efficiency. By treating each question-answer pair as a multi-episode session and assigning immediate rewards to each episode, L2T encourages the model to focus on the progress of reasoning rather than the final outcome. This design effectively suppresses redundant reasoning steps and mitigates unnecessary computational overhead. The reward is agnostic to input formats, label types, and task domains, requiring no additional annotation. Through reinforcement learning, L2T optimizes the LLM (policy) to generate tokens that most contribute to answer correctness at every reasoning step.

Specifically, L2T operates in three stages: (i) Problem reformulation: each question-answer interaction is reformulated as a hierarchical session composed of multiple episodes, where each episode corresponds to a reasoning segment supporting dense reward computation and optimization; (ii) Reward design: after each episode, the information-theoretic reward is computed using PAC-Bayes bounds and the Fisher Information Matrix, enabling early termination of unproductive reasoning and balancing effectiveness with efficiency; (iii) LLM fine-tuning: the LLM is optimized to maximize cumulative reward across tasks via reinforcement learning, ensuring both reasoning accuracy and computational efficiency.

L2T

Performance

L2T

Poster

BibTeX


      @article{wang2025learning,
        title={Learning to think: Information-theoretic reinforcement fine-tuning for llms},
        author={Wang, Jingyao and Qiang, Wenwen and Song, Zeen and Zheng, Changwen and Xiong, Hui},
        journal={arXiv preprint arXiv:2505.10425},
        year={2025}
      }