RL with Learnable Textual Feedback: A Bilevel Approach

RL with Learnable Textual Feedback:
A Bilevel Approach

Utsav Singh^*,1, Sidhaarth Sredharan^*,2, Souradip Chakraborty³, Amrit Singh Bedi¹

¹University of Central Florida ²Carnegie Mellon University ³University of Maryland ^*Equal contribution

TL;DR A framework to improve LLM reasoning by effectively aligning the actor and critic models via bilevel formulation.

Trained, aligned feedback yields faster convergence and higher reasoning accuracy.: (Left) Vanilla GRPO on MBPP with LLaMA-3.2-1B exhibits a zero-reward phase (around 600 steps) where advantage collapse stalls learning. Fixed feedback (critiques from a frozen LLM not optimized for task reward) yields only marginal gains, replacing it with learnable feedback, where the critic is optimized for the actor's downstream reward, yields a larger improvement. (Right) On MATH-500, an untrained actor reaches 26.2%, and vanilla RL reaches 36.6%; fixed feedback adds little (39.6%), but our Bi-NAC, which jointly trains actor and critic via a bilevel objective, reaches 46.6% (+7% over fixed feedback). Takeaway: feedback quality drives sample efficiency; aligning critic and actor through bilevel optimization leads to superior gains over static feedback-based methods.

Abstract

Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning remains sample-inefficient when terminal rewards are sparse. This has motivated a growing line of work on RL with textual feedback, where a critic model generates natural language feedback to guide a reasoning model (the actor), augmenting scalar rewards with richer learning signals. However, existing methods typically treat feedback as fixed or auxiliary, which misses a key property: feedback should not merely be correct, but should improve the policy (actor model) when provided in context. This motivates a paradigm of learnable textual feedback for RL. Yet the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across MATH-500, MBPP, and GPQA, Bi-NAC improves sample and parameter efficiency over RL and fixed-critic baselines: our 2B model outperforms the 3B GRPO baseline, achieving 46.6% versus 41.4% on MATH-500, while our 6B model surpasses the 7B GRPO baseline, achieving 49.3% versus 43.6% on GPQA.

Bi-NAC: Bilevel Natural Language Actor-Critic

We formalize reasoning with textual feedback as a Stackelberg bilevel formulation and derive Bi-NAC, where the critic is optimized to generate feedback that maximizes the actor’s downstream performance, and the actor is optimized to generate refined outputs based on the feedback.

🎯 Actor-Critic Interaction

Actor $\left(\pi^L_\theta(\cdot \mid x) \right)$ produces initial attempt $y_0$ given prompt $x$
Critic $\left( \pi^H_\phi(\cdot \mid x, y_0) \right)$ generates textual feedback $f$ based on $(x, y_0)$
Actor $\left( \pi^L_{\theta}(\cdot \mid x, y_0, z) \right)$ produces refined response $y_1$ conditioned on $(x, y_0, f)$
Verifier $\left( R(x,y_1) \right)$ scores $y_1$ with binary reward $r \in \{0,1\}$

🎯 Bilevel Optimization

The actor and critic optimize the following bilevel objective

$\max_{\phi, \theta} \mathcal{L}(\phi, \theta, \lambda) \!=\! \max_{\phi, \theta} U\!\bigl(\phi, \theta\bigr) \!+\! \lambda \big(L(\phi, \theta) \!-\! L(\phi, \theta^*(\phi)) \big), \quad$ where

Upper Level Objective: $ \max_{\theta} U(\phi, \theta) = \max_{\theta} \mathbb{E}_{\substack{x \sim \mathcal{D}, \, y_0 \sim \pi^L_\theta(\cdot \mid x) , z \sim \pi^H_\phi(\cdot \mid x, y_0) , y_1 \sim \pi^L_{\theta}(\cdot \mid x, y_0, z)}} \left[ R(x,y_1) \right], \quad$ and

Lower Level Objective: $ \max_{\phi} L(\phi, \theta) = \max_{\phi} \mathbb{E}_{\substack{x \sim \mathcal{D}, \, y_0 \sim \pi^L_\theta(\cdot \mid x) , z \sim \pi^H_\phi(\cdot \mid x, y_0) , y_1 \sim \pi^L_{\theta}(\cdot \mid x, y_0, z)}} \left[ R(x, y_1) \right]$.

Bi-NAC Architecture

Top Left: In prior approaches where actor and critic are not aligned (e.g., auxiliary critics or fixed feedback critics), this misalignment often leads to ineffective guidance, where the feedback is either incorrect or ignored by the actor, thus resulting in incorrect final outputs. Bottom Left: We show this in an illustrative example where the feedback and final output are incorrect. Top Right: Bi-NAC explicitly models the dependency between feedback generation and policy improvement by training the critic to maximize the final task performance of the actor, thus also enabling the actor to effectively leverage it. Bottom Right: The illustrative example shows how correct aligned feedback successfully guides the actor from a flawed initial response to the correct final solution.

Key Insights

❌ The Misalignment Problem

Fixed or auxiliary critics can generate ineffective guidance due to misalignment between actor and critic, leading to sub-optimal performance.

🎲 Advantage Collapse

Prior approaches like GRPO suffer under sparse terminal rewards. On MBPP, 74% of training groups in GRPO lead to 0 rewards, causing advantage collapse.

💡 Dense Feedback Signal

Textual feedback provides denser supervision beyond binary terminal rewards; however, it may lead to sub-optimal performance due to actor-critic misalignment.

✅ Bi-NAC's Solution

Bi-NAC leverages bilevel formulation to enable effective actor-critic alignment, significantly improving LLM reasoning performance.

Results

1. Parameter Efficiency: Bi-NAC vs GRPO

Method	Size	MATH-500	MBPP	GPQA
GRPO	3B	41.4	61.6	36.4
Bi-NAC	2B	46.6 (+5.2)	66.7 (+5.1)	41.2 (+4.8)

GRPO	7B	48.4	72.2	43.6
Bi-NAC	6B	51.4 (+3.0)	75.0 (+2.8)	49.3 (+5.7)

Takeaway 1: Bi-NAC exhibits faster convergence and better parameter and sample efficiency than GRPO baselines.

2. Comparison between fixed/auxiliary and aligned feedback (Bi-NAC)

Feedback comparison: (Left) Bi-NAC significantly outperforms methods that rely on fixed or auxiliary critics on MATH-500, demonstrating the efficacy of our bilevel framework.

Takeaway 2: Adding fixed feedback to GRPO gives only marginal gains. Bi-NAC's bilevel training yields strong gains.

3. Comparison with State-of-the-Art Methods (1B/3B/8B scales)

Method	1B			3B			8B
Method	MATH	MBPP	GPQA	MATH	MBPP	GPQA	MATH	MBPP	GPQA
BC	26.2	50.0	30.2	42.4	68.5	37.1	53.0	74.5	39.0
Hier-NFT	25.8	41.3	26.8	41.3	61.0	32.7	50.9	72.8	34.2
ArCHer	30.6	52.4	33.6	43.6	69.5	40.9	55.5	76.0	43.0
SCoRe	39.8	62.3	34.4	45.2	70.7	42.6	56.8	77.3	44.6
Bi-NAC	46.6	66.4	40.6	51.4	75.0	49.3	60.2	79.8	56.3

Takeaway 3: Bi-NAC's gains persist across model scales and consistently outperforms all baselines, with especially large improvements on challenging benchmarks like GPQA (+11.7 at 8B scale).

4. Feedback Optimality & Compatibility Analysis (MATH-500)

Does Bi-NAC generate better feedback, and does the actor effectively use it?

Method	Trained Critic	Trained Actor	Accuracy ↑	FO ↑	Δ_acc (t₁,t₂) ↑	Δ_i→c (t₁,t₂) ↑
Hier-NFT	❌	❌	35.4	2.1	9.6	12.2
SCoRe	❌	✅	39.6	2.1	7.4	17.8
Hier-FT	✅	❌	38.0	3.0	8.8	16.0
Bi-NAC w/o BL	✅	✅	41.8	3.6	9.6	19.2
Bi-NAC	✅	✅	46.6	4.2	11.4	21.2

FO: Feedback Optimality (1-5 scale, LLM judge) | Δ_acc (t₁,t₂) ↑ Accuracy increase from turn 1 to turn 2 | Δ_i→c (t₁,t₂) ↑ Fraction flipped from incorrect to correct

Takeaway 4: Bi-NAC achieves the highest feedback optimality (FO=4.2) and the actor is able to effectively leverages the feedback, demonstrating successful alignment between critic and actor.

5. Single-Model Bi-NAC

Can a single model do both critique generation and response refinement?
We tested a LLAMA-3.2-1B model variant where a single LLM generates both the responses and feedback.

Variant	MATH-500	MBPP
Bi-NAC (2 models)	46.56	66.73
Bi-NAC (1 model)	46.84	65.24

Takeaway 5: The performance of the single-model variant is comparable to the two-model variant, which demonstrates Bi-NAC's practicality.

BibTeX


@misc{singh2026rllearnabletextualfeedback,
      title={RL with Learnable Textual Feedback: A Bilevel Approach}, 
      author={Utsav Singh and Sidhaarth Sredharan and Souradip Chakraborty and Amrit Singh Bedi},
      year={2026},
      eprint={2605.24547},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.24547}, 
}