RL with Learnable Textual Feedback:
A Bilevel Approach

Utsav Singh*,1, Sidhaarth Sredharan*,2, Souradip Chakraborty3, Amrit Singh Bedi1
1University of Central Florida 2Carnegie Mellon University 3University of Maryland *Equal contribution

Abstract

Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning remains sample-inefficient when terminal rewards are sparse. This has motivated a growing line of work on RL with textual feedback, where a critic model generates natural language feedback to guide a reasoning model (the actor), augmenting scalar rewards with richer learning signals. However, existing methods typically treat feedback as fixed or auxiliary, which misses a key property: feedback should not merely be correct, but should improve the policy (actor model) when provided in context. This motivates a paradigm of learnable textual feedback for RL. Yet the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across MATH-500, MBPP, and GPQA, Bi-NAC improves sample and parameter efficiency over RL and fixed-critic baselines: our 2B model outperforms the 3B GRPO baseline, achieving 46.6% versus 41.4% on MATH-500, while our 6B model surpasses the 7B GRPO baseline, achieving 49.3% versus 43.6% on GPQA.

πŸ’‘ Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning is sample-inefficient under sparse terminal rewards.

πŸ’‘ Prior work mitigates this by adding natural language critiques, yet it typically treats critique generation as fixed or auxiliary, causing actor-critic misalignment and leading to sub-optimal performance.

Bi-NAC: Bilevel Natural Language Actor-Critic

We formalize reasoning with textual feedback as a Stackelberg bilevel formulation and derive Bi-NAC, where the critic is optimized to generate feedback that maximizes the actor’s downstream performance, and the actor is optimized to generate refined outputs based on the feedback.

🎯 Actor-Critic Interaction

  1. Actor $\left(\pi^L_\theta(\cdot \mid x) \right)$ produces initial attempt $y_0$ given prompt $x$
  2. Critic $\left( \pi^H_\phi(\cdot \mid x, y_0) \right)$ generates textual feedback $f$ based on $(x, y_0)$
  3. Actor $\left( \pi^L_{\theta}(\cdot \mid x, y_0, z) \right)$ produces refined response $y_1$ conditioned on $(x, y_0, f)$
  4. Verifier $\left( R(x,y_1) \right)$ scores $y_1$ with binary reward $r \in \{0,1\}$

🎯 Bilevel Optimization

The actor and critic optimize the following bilevel objective

$\max_{\phi, \theta} \mathcal{L}(\phi, \theta, \lambda) \!=\! \max_{\phi, \theta} U\!\bigl(\phi, \theta\bigr) \!+\! \lambda \big(L(\phi, \theta) \!-\! L(\phi, \theta^*(\phi)) \big), \quad$ where
Upper Level Objective: $ \max_{\theta} U(\phi, \theta) = \max_{\theta} \mathbb{E}_{\substack{x \sim \mathcal{D}, \, y_0 \sim \pi^L_\theta(\cdot \mid x) , z \sim \pi^H_\phi(\cdot \mid x, y_0) , y_1 \sim \pi^L_{\theta}(\cdot \mid x, y_0, z)}} \left[ R(x,y_1) \right], \quad$ and
Lower Level Objective: $ \max_{\phi} L(\phi, \theta) = \max_{\phi} \mathbb{E}_{\substack{x \sim \mathcal{D}, \, y_0 \sim \pi^L_\theta(\cdot \mid x) , z \sim \pi^H_\phi(\cdot \mid x, y_0) , y_1 \sim \pi^L_{\theta}(\cdot \mid x, y_0, z)}} \left[ R(x, y_1) \right]$.

Bi-NAC Architecture

Top Left: In prior approaches where actor and critic are not aligned (e.g., auxiliary critics or fixed feedback critics), this misalignment often leads to ineffective guidance, where the feedback is either incorrect or ignored by the actor, thus resulting in incorrect final outputs. Bottom Left: We show this in an illustrative example where the feedback and final output are incorrect. Top Right: Bi-NAC explicitly models the dependency between feedback generation and policy improvement by training the critic to maximize the final task performance of the actor, thus also enabling the actor to effectively leverage it. Bottom Right: The illustrative example shows how correct aligned feedback successfully guides the actor from a flawed initial response to the correct final solution.

Bi-NAC Teaser

Key Insights

❌ The Misalignment Problem

Fixed or auxiliary critics can generate ineffective guidance due to misalignment between actor and critic, leading to sub-optimal performance.

🎲 Advantage Collapse

Prior approaches like GRPO suffer under sparse terminal rewards. On MBPP, 74% of training groups in GRPO lead to 0 rewards, causing advantage collapse.

πŸ’‘ Dense Feedback Signal

Textual feedback provides denser supervision beyond binary terminal rewards; however, it may lead to sub-optimal performance due to actor-critic misalignment.

βœ… Bi-NAC's Solution

Bi-NAC leverages bilevel formulation to enable effective actor-critic alignment, significantly improving LLM reasoning performance.

Results

1. Parameter Efficiency: Bi-NAC vs GRPO

Method Size MATH-500 MBPP GPQA
GRPO 3B 41.4 61.6 36.4
Bi-NAC 2B 46.6 (+5.2) 66.7 (+5.1) 41.2 (+4.8)
GRPO 7B 48.4 72.2 43.6
Bi-NAC 6B 51.4 (+3.0) 75.0 (+2.8) 49.3 (+5.7)

Takeaway 1: Bi-NAC exhibits faster convergence and better parameter and sample efficiency than GRPO baselines.

2. Comparison between fixed/auxiliary and aligned feedback (Bi-NAC)

Bi-NAC Teaser

Feedback comparison: (Left) Bi-NAC significantly outperforms methods that rely on fixed or auxiliary critics on MATH-500, demonstrating the efficacy of our bilevel framework.

Takeaway 2: Adding fixed feedback to GRPO gives only marginal gains. Bi-NAC's bilevel training yields strong gains.

3. Comparison with State-of-the-Art Methods (1B/3B/8B scales)

Method 1B 3B 8B
MATH MBPP GPQA MATH MBPP GPQA MATH MBPP GPQA
BC 26.2 50.0 30.2 42.4 68.5 37.1 53.0 74.5 39.0
Hier-NFT 25.8 41.3 26.8 41.3 61.0 32.7 50.9 72.8 34.2
ArCHer 30.6 52.4 33.6 43.6 69.5 40.9 55.5 76.0 43.0
SCoRe 39.8 62.3 34.4 45.2 70.7 42.6 56.8 77.3 44.6
Bi-NAC 46.6 66.4 40.6 51.4 75.0 49.3 60.2 79.8 56.3

Takeaway 3: Bi-NAC's gains persist across model scales and consistently outperforms all baselines, with especially large improvements on challenging benchmarks like GPQA (+11.7 at 8B scale).

4. Feedback Optimality & Compatibility Analysis (MATH-500)

Does Bi-NAC generate better feedback, and does the actor effectively use it?

Method Trained Critic Trained Actor Accuracy ↑ FO ↑ Ξ”acc (t1,t2) ↑ Ξ”iβ†’c (t1,t2) ↑
Hier-NFT ❌ ❌ 35.4 2.1 9.6 12.2
SCoRe ❌ βœ… 39.6 2.1 7.4 17.8
Hier-FT βœ… ❌ 38.0 3.0 8.8 16.0
Bi-NAC w/o BL βœ… βœ… 41.8 3.6 9.6 19.2
Bi-NAC βœ… βœ… 46.6 4.2 11.4 21.2

FO: Feedback Optimality (1-5 scale, LLM judge) | Ξ”acc (t1,t2) ↑ Accuracy increase from turn 1 to turn 2 | Ξ”iβ†’c (t1,t2) ↑ Fraction flipped from incorrect to correct

Takeaway 4: Bi-NAC achieves the highest feedback optimality (FO=4.2) and the actor is able to effectively leverages the feedback, demonstrating successful alignment between critic and actor.

5. Single-Model Bi-NAC

Can a single model do both critique generation and response refinement?
We tested a LLAMA-3.2-1B model variant where a single LLM generates both the responses and feedback.

Variant MATH-500 MBPP
Bi-NAC (2 models) 46.56 66.73
Bi-NAC (1 model) 46.84 65.24

Takeaway 5: The performance of the single-model variant is comparable to the two-model variant, which demonstrates Bi-NAC's practicality.

BibTeX


@misc{singh2026rllearnabletextualfeedback,
      title={RL with Learnable Textual Feedback: A Bilevel Approach}, 
      author={Utsav Singh and Sidhaarth Sredharan and Souradip Chakraborty and Amrit Singh Bedi},
      year={2026},
      eprint={2605.24547},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.24547}, 
}