Papers
arxiv:2601.05242

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published on Jan 8
· Submitted by
LIU Shih-yang
on Jan 9
#1 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

AI-generated summary

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

Community

Paper author Paper submitter

GDPO is a drop-in replacement for GRPO in verl and TRL — only minor code changes needed.

We release a slurm-free, easy-to-run implementation supporting multiple RL frameworks (verl / TRL / NeMo-RL) so you can quickly validate GDPO on tool-calling and math reasoning tasks.

⏱️ Each run can be completed in ~1 hour on 8×A100s, or ~2.5 hours on a single A100.

🔄 Switching from GRPO to GDPO is easy.
👉 Try it yourself: https://github.com/NVlabs/GDPO

Really cool paper!
I've created a podcast that explains the key concepts:
https://researchpod-share.vercel.app/episode/c83f1820-279a-4cc0-afe1-b927a0c20ec8

·

I enjoyed listening to the AI paper podcast!

This comment has been hidden

When you RL models for real-world use, you care about more than one thing: accuracy, conciseness, alignment, faithfulness, etc.

But most RL pipelines still compress all of that into one scalar advantage in the loss function — and a lot of preference signal gets washed out.

We’re introducing GDPO, a simple fix that lets you express multi-dimensional preferences with a single advantage. Key idea: swap the order of reward normalization and aggregation.

Works out-of-the-box as a GRPO add-on—code is provided for veRL, TRL, and NeMo-RL.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.05242 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.05242 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.05242 in a Space README.md to link it from this page.

Collections including this paper 5